~/notes/causal-inference-over-deep-learning
Causal Inference Over Deep Learning
When propensity matching and do-calculus beat a transformer — a FinOps lens on a 40% cost reduction at Amber Electric.
▸ Anchored to Amber Electric — 40% avg infrastructure cost reduction across 10k+ energy sites
“Correlation is the problem deep learning is good at. Causation is the problem your CFO actually wants solved.”
This post is a confession. The right answer at Amber was rarely “another transformer.” For a forecasting problem on 10k+ energy sites, with strict cost ceilings and an actively-changing tariff regime, the classical causal toolkit beat deep learning on cost, latency, and explainability — by a wide margin.
The setup
We had two questions:
- Forecasting: what will site
sconsume in the next 30 minutes? - Counterfactual: if we had switched site
sto tariff planT'instead ofT, what would consumption have looked like?
Question 1 is a regression. Question 2 is causal — and argmin RMSE over a feed-forward net does not, by itself, get you there.
Do-calculus, briefly
The key move is to re-interpret the intervention do(T = T') as a graph mutation, not a conditioning. Pearl’s notation:
The right-hand side is a passive observation — it suffers from confounding (e.g., users on tariff T' may also be heavy AC users in summer). The left-hand side is an intervention — it severs incoming edges to T in the causal graph and asks: holding everything else fixed, what would Y have been?
If the causal graph satisfies the back-door criterion with respect to set Z, then:
This is adjustment by Z — and it can be estimated with classical regressions, no neural net required.
Propensity score matching, with a twist
For continuous treatments (tariff is not binary — it’s a vector of price-by-time-of-day), we used generalised propensity scores (GPS):
then matched units across treatments by \hat{r} percentile bins. R-flavoured pseudocode:
# generalised propensity score for continuous treatment
gps <- function(data) {
# 1. fit treatment density given covariates
treat_model <- gam(treatment ~ s(load_avg) + s(temp_avg) + state, data = data)
data$gps_pred <- predict(treat_model, type = "response")
# 2. estimate dose-response surface
dose_response <- gam(
outcome ~ s(treatment) + s(gps_pred),
data = data,
method = "REML"
)
# 3. counterfactual: holding GPS fixed, sweep treatment
treatments <- seq(min(data$treatment), max(data$treatment), length.out = 20)
predict(dose_response, newdata = expand.grid(
treatment = treatments,
gps_pred = median(data$gps_pred)
))
}
This pipeline ran on a single n2-standard-4 instance per state, costing roughly 6,000/month per state — and gave us worse counterfactual estimates because the loss function never asked it to.
When deep learning is the right call
Deep learning earned its place in two areas of the same platform:
- Ultra-short-horizon (≤5 min) ramp prediction, where temporal patterns dominate and the signal is high-frequency.
- Anomaly detection on grid-quality telemetry, where representation learning materially beat hand-engineered features.
That’s two slots. The other thirty-plus modelling problems wanted causal or classical methods.
The FinOps angle
The 40% cost reduction at Amber was not from compute optimisation alone. Most of it came from deciding not to deploy a deep model where a causal one was cheaper, faster, and more honest about its assumptions. A chunk of MLOps debt is paid in compute bills generated by the wrong model class.
If the model is wrong, GKE node-pool autoscaling cannot save you.
The Staff+ takeaway
When a stakeholder asks for a deep model, the Staff+ response is rarely “yes.” It is: what is the underlying decision, and what is the cheapest model that lets you make it well? Often that is a regression. Sometimes it is a back-door adjustment. Occasionally it is a transformer. Get the order right.
Anchored to: Amber Electric Energy Forecasting Platform, 2021–2023. Cost figures from GCP billing diff, 6-month rolling, post-FinOps refactor.