How do you measure model accuracy?

Q: How do you measure model accuracy?

Learn how marketing mix model accuracy is measured using nRMSE, R², and back-testing and how Fospha's Glassbox framework makes MMM performance transparent and continuously verifiable.

The short answer

Model accuracy in a marketing mix model is not a single number - it is a framework of complementary signals evaluated continuously. The three core components are nRMSE (Normalized Root Mean Squared Error), which measures predictive error; R², which reflects how well the model explains historical variance; and back-testing, which validates at key checkpoints whether the model generalizes reliably to data it has not seen. No single metric is sufficient on its own. Used together, and monitored over time rather than at a single point, they give a robust and transparent picture of model performance.

‍

Marketing mix models guide some of the largest budget decisions a performance team will make. The natural question follows: how do you know the model is actually accurate? And how do you make that accuracy visible and verifiable to finance, leadership, and external stakeholders?

Accuracy, properly measured, requires multiple complementary perspectives - different metrics reveal different things about how a model is performing.

‍

Why does measuring model accuracy require more than one metric?

Evaluating a model's accuracy comes down to two distinct questions that pull in different directions.

The first is how well the model learns from historical data - how closely its outputs match the patterns already in the training set. The second is how well it performs on data it has not seen - whether the relationships it has learned hold up in genuinely new periods.

These two questions reflect what is known in statistics as the bias-variance tradeoff. The bias-variance tradeoff is the tension between a model that learns too rigidly from historical data and one that is too loose to be reliable - finding the right balance is central to building models that perform consistently on new data. A model that fits historical data too closely tends to absorb noise rather than meaningful structure - and when the environment shifts, its predictions become unreliable. A model with a slightly imperfect fit on training data can be the more reliable choice if its predictions remain stable on genuinely new periods.

This is why a sound accuracy framework uses both performance metrics, such as nRMSE and R², and out-of-sample validation through back-testing. Each provides a signal the others cannot.

‍

What does each accuracy metric actually measure?

Normalized Root Mean Squared Error (nRMSE) is a measure of predictive error - how closely the model's predictions align with observed outcomes. It is calculated by dividing RMSE by the mean of observed outcomes, which makes the metric comparable across brands and scales. Other normalization conventions exist, such as using the range or standard deviation, so it is worth confirming definitions when comparing providers.

nRMSE is most usefully read as a trend rather than a single number. A low, stable nRMSE time series is a strong signal of dependable predictive performance. A rising or erratic nRMSE trend may indicate the model is drifting or that the underlying data environment has shifted - a signal worth investigating.

R² represents the proportion of variation in the outcome that the model can explain based on its inputs. A practical way to read it: an R² of 0.90 means the model accounts for roughly 90% of the rises, dips, and shifts in your historical sales data.

R² reflects in-sample fit - how well the model captures patterns in the training data - rather than predictive accuracy on new data. In time-series settings, R² can appear artificially inflated due to trends, seasonality, non-stationarity, or data leakage, so it is best read alongside out-of-sample metrics such as nRMSE. High R² with weak predictive accuracy can indicate over-fitting. Moderate R² with strong predictive accuracy can reflect a well-calibrated model operating in a genuinely complex, noisy environment.

Back-testing is a form of out-of-sample validation that evaluates how well the model generalizes to unseen future periods, preserving the time order of the data. It is typically run at key checkpoints - such as model build or retraining - rather than as a continuously updated signal. At its simplest, it involves comparing model performance between the periods it learned from and the future periods it has not seen. If performance degrades on the unseen periods, it may indicate over-fitting or instability. If performance remains consistent, it suggests the model has learned meaningful structure rather than memorizing historical noise. Back-testing adds a layer of confidence that the model will behave reliably in real-world, forward-facing conditions.

‍

Inside the Glassbox

Accuracy is a continuous discipline at Fospha, not a one-time check. This sits inside Glassbox - Fospha's commitment to full transparency across every modeling layer. Every model layer, validation step, and metric is open to inspection. Customers can see how the ensemble model is constructed, how different measurement components contribute (click measurement, impression measurement, post-purchase, halo), the validation metrics behind every prediction, and the daily, ad-level outputs those decisions rely on.

In practice, each modeling cycle follows a structured loop: data refresh and retraining; evaluation on held-out periods to assess generalization; ongoing monitoring of nRMSE and R² to track predictive error, model fit, and stability over time; and transparent reporting, with accuracy measures available to customers on request.

nRMSE is computed daily for every model Fospha runs, including click-based components and impression-based MMM, so performance is continuously visible. Accuracy metrics are available to customers on request and typically shared via their CSM, complete with plain-English definitions and guidance, so model health is straightforward to understand and verify without requiring statistical expertise.

Healthy accuracy ranges are brand-specific and derived empirically. The goal is not a single universal benchmark, but a stable band for each brand that signals the model is learning meaningful structure and generalizing reliably over time.

‍

Common questions

Q: What is a good nRMSE score for a marketing mix model?

There is no universal benchmark - healthy nRMSE ranges are brand-specific and derived empirically based on the data environment and business context. The more useful signal is the trend over time: a low, stable nRMSE series indicates dependable predictive performance, while a rising or volatile trend warrants investigation. A single low score at one point in time is less informative than consistent stability across many measurement periods.

Q: Can R² alone tell me if my MMM is accurate?

No. R² reflects in-sample fit - how well the model explains historical patterns - but it does not tell you whether those relationships will hold on new data. In time-series settings, R² can be artificially inflated by trends, seasonality, non-stationarity, or data leakage. A high R² alongside weak out-of-sample performance is a sign of over-fitting. R² is best read alongside predictive accuracy metrics such as nRMSE and validated through back-testing.

Q: What is back-testing and why does it matter for MMM?

Back-testing is out-of-sample validation that checks whether a model generalizes beyond the data it was trained on. It works by evaluating model performance on future periods the model has not seen, preserving the time order of the data. If performance degrades significantly on those unseen periods compared to the training period, it may suggest the model has over-fitted to historical noise. Consistent performance across both periods is a positive indicator that the model has learned genuine, stable structure - and is more likely to produce reliable outputs in real-world conditions.

Q: How often should model accuracy be monitored?

Continuous monitoring is more reliable than periodic checks. Marketing environments shift - media mix changes, spending levels fluctuate, audience behavior evolves. A model calibrated under one set of conditions may drift as those conditions change. Tracking metrics such as nRMSE on a daily basis, rather than waiting for quarterly model refreshes, makes it possible to detect and address emerging issues early.

‍

How do you measure model accuracy?

Why does measuring model accuracy require more than one metric?

What does each accuracy metric actually measure?

Inside the Glassbox

Common questions

Related reading