Causal Inference Over Deep Learning

“Correlation is the problem deep learning is good at. Causation is the problem your CFO actually wants solved.”

This post is a confession. The right answer at Amber was rarely “another transformer.” For a forecasting problem on 10k+ energy sites, with strict cost ceilings and an actively-changing tariff regime, the classical causal toolkit beat deep learning on cost, latency, and explainability — by a wide margin.

The setup

We had two questions:

Forecasting: what will site s consume in the next 30 minutes?
Counterfactual: if we had switched site s to tariff plan T' instead of T, what would consumption have looked like?

Question 1 is a regression. Question 2 is causal — and argmin RMSE over a feed-forward net does not, by itself, get you there.

Do-calculus, briefly

The key move is to re-interpret the intervention do(T = T') as a graph mutation, not a conditioning. Pearl’s notation:

P(Y \mid \text{do}(T = T')) \neq P(Y \mid T = T')

The right-hand side is a passive observation — it suffers from confounding (e.g., users on tariff T' may also be heavy AC users in summer). The left-hand side is an intervention — it severs incoming edges to T in the causal graph and asks: holding everything else fixed, what would Y have been?

If the causal graph satisfies the back-door criterion with respect to set Z, then:

P(Y \mid \text{do}(T = T')) = \sum_{z \in Z} P(Y \mid T = T', Z = z) \, P(Z = z)

This is adjustment by Z — and it can be estimated with classical regressions, no neural net required.

Propensity score matching, with a twist

For continuous treatments (tariff is not binary — it’s a vector of price-by-time-of-day), we used generalised propensity scores (GPS):

\hat{r}(t, x) = f_T(t \mid X = x)

then matched units across treatments by \hat{r} percentile bins. R-flavoured pseudocode:

# generalised propensity score for continuous treatment
gps <- function(data) {
  # 1. fit treatment density given covariates
  treat_model <- gam(treatment ~ s(load_avg) + s(temp_avg) + state, data = data)
  data$gps_pred <- predict(treat_model, type = "response")

  # 2. estimate dose-response surface
  dose_response <- gam(
    outcome ~ s(treatment) + s(gps_pred),
    data = data,
    method = "REML"
  )

  # 3. counterfactual: holding GPS fixed, sweep treatment
  treatments <- seq(min(data$treatment), max(data$treatment), length.out = 20)
  predict(dose_response, newdata = expand.grid(
    treatment = treatments,
    gps_pred  = median(data$gps_pred)
  ))
}

This pipeline ran on a single n2-standard-4 instance per state, costing roughly $80/month per state**. The previously-attempted DeepAR-on-Vertex equivalent cost about **$ 6,000/month per state — and gave us worse counterfactual estimates because the loss function never asked it to.

When deep learning is the right call

Deep learning earned its place in two areas of the same platform:

Ultra-short-horizon (≤5 min) ramp prediction, where temporal patterns dominate and the signal is high-frequency.
Anomaly detection on grid-quality telemetry, where representation learning materially beat hand-engineered features.

That’s two slots. The other thirty-plus modelling problems wanted causal or classical methods.

The FinOps angle

The 40% cost reduction at Amber was not from compute optimisation alone. Most of it came from deciding not to deploy a deep model where a causal one was cheaper, faster, and more honest about its assumptions. A chunk of MLOps debt is paid in compute bills generated by the wrong model class.

If the model is wrong, GKE node-pool autoscaling cannot save you.

The Staff+ takeaway

When a stakeholder asks for a deep model, the Staff+ response is rarely “yes.” It is: what is the underlying decision, and what is the cheapest model that lets you make it well? Often that is a regression. Sometimes it is a back-door adjustment. Occasionally it is a transformer. Get the order right.

Anchored to: Amber Electric Energy Forecasting Platform, 2021–2023. Cost figures from GCP billing diff, 6-month rolling, post-FinOps refactor.