Operations

Your AI Forecast Says 90%. Nobody Checked If It Means It.

Phil Bolton · June 15, 2026 · 3 min read

A founder I work with put a new FP&A tool in front of his board last month. Cash forecast, 90 days out, with a tidy band around it: "85% confidence we land between $1.4M and $1.9M." The board loved it. Looked rigorous. I asked him one question afterward. Over the last six months, how often did actuals land inside the band the tool drew at the time? He hadn't looked. We pulled it. Of the eight forecasts where the tool claimed 85%, actuals landed in the band three times.

The number wasn't 85%. It was closer to 40%. And he'd been leading with it.

Accuracy and calibration are not the same test

These tools really did get better this year. The vendors quote 40% gains in forecast accuracy, and the point estimates are genuinely tighter than the spreadsheet they replaced. That part is real. The part nobody tests is calibration: when the model says 90%, do nine out of ten of those calls actually land where it said?

Accuracy is whether the center is close. Calibration is whether the confidence is honest. A model can be more accurate than your old method and still lie about how sure it is. That's the trap. The forecast got sharper, so everyone assumed the confidence got trustworthy too. It didn't. Confidence calibration is a separate property, and most tools never report it because backtesting their own bands isn't flattering.

The band is the part that drives decisions

Here's why this matters more than the point estimate. Nobody bets the company on a single number. People bet on the range. "We're 85% sure we clear $1.4M" is what tells a founder he can sign the lease, make the hire, skip the raise. The band sets the action. So when the band is decoration, the decision is built on a confidence that was never earned.

A point estimate that's wrong gets caught at month-end. A confidence interval that's wrong gets caught when you've already committed to the thing it told you was safe.

It gets worse as conditions shift. These models drift. Train one on two stable years and the bands stay calm right up until a tariff change or a churned anchor customer moves the ground under it. The point estimate adjusts a little. The confidence band, in my experience, is the last thing to widen, which means the tool sounds most certain right when it should sound least.

Backtest the band, not just the number

Fixing it costs an afternoon. Take every forecast the tool produced over the last six months and check the hit rate of its stated confidence. If it claimed 80% and landed inside 80% of the time, trust it and move on. If it claimed 80% and hit 45%, you don't throw the tool out. You write the real number next to it and widen what you tell the board.

Then keep checking it quarterly, because calibration decays as the business changes. The model that was honest in January is describing a company that no longer exists by June.

A forecast that admits it's a coin flip is worth more than one that's confidently wrong. The second one just costs you later, after you've already acted on it.

Phil Bolton

Founder & Principal at Manitou Advisory

Your AI Forecast Says 90%. Nobody Checked If It Means It.

Accuracy and calibration are not the same test

The band is the part that drives decisions

Backtest the band, not just the number

More from the blog

Your Benefits Load Factor Lies About Your Cheapest Seats

Your Digital Employee Reports to No One

Two Wrong Numbers Still Reconcile

Want to talk about your finance setup?