§ Essays / AI & Automation

What an automation audit should actually look like

Forward this to procurement before the next AI pitch.

A friend of mine got four pitches for “AI automation audits” in the last quarter alone. All four opened with the same statistic, the MIT 2025 “State of AI in Business” finding that 95 percent of enterprise generative AI pilots fail to produce a measurable financial return. All four positioned their audit as the cure, but none of the four would have changed that failure rate, because the audit itself is one of the leading causes of it.

He forwarded me the decks, and they’re nearly identical. The same stat. The same “opportunity assessment” deliverable. The same logo wall of prior clients. And the same gap between the deck and anything an executive could do on Monday morning without the auditor in the room.

That’s what caught my attention. Not “why automation fails.”, but why the most-marketed instrument for fixing it is itself one of the failure mechanisms.

The audit that’s a sales motion

Most “automation audits” in market right now are sales tools dressed as diagnostics. They exist to sell the next engagement, not to give the buyer something they can act on without the auditor present. I’ve found four specific tells to be most common. Once you have language for them, you can spot the pattern in the first thirty minutes of a pitch.

  1. Vendor-aligned framing. Every recommendation routes back to a tool or platform the auditor implements. The Big 4 firms all launched in-house agentic AI audit platforms in 2026, and their assessments increasingly conclude, to no one’s surprise, that the buyer needs their platform. That isn’t an accident, it’s the business model. A 2026 DAS Advanced Systems critique of Big 4 work with mid-market companies put it directly: these firms excel at strategy and pilot development but tend to disappear when it’s time for full implementation. The audit is the lead-gen instrument for the next engagement, not a finding the buyer can carry alone.

  2. Vague-promise language with no measurable outcome by a date. The deck talks about enterprise-wide change, strategic roadmaps, and organizational alignment, but what it doesn’t do is commit to a number by a deadline. McKinsey itself has named this failure mode “death by 1,000 pilots.” A pilot that the auditor won’t agree to kill, in writing, is not a pilot. It’s a budget line with a fancy name.

  3. A deliverable that orphans from execution. The team reads the slides, nods, and nothing happens, because nothing in the slides told them what to do on Monday morning. McKinsey’s “Rewired” research found that organizational factors (workflow redesign, sustained executive commitment, the political work of moving people through change) beat technology choices in the programs that actually worked. My own bias leans toward two failure modes: anti-AI sentiment from operators who fear they’ll be replaced, and poor task selection at the front of the project. A slide deck doesn’t redesign a workflow, it doesn’t unstick a manager who is quietly sabotaging the pilot, and it definitely doesn’t surface the wrong-task-chosen problem until the pilot is well underway.

  4. The recommendation is always cost reduction. The audit lands on labor savings, expense control, efficiency gains, and headcount reduction. The commercial side of the org, the part where revenue lives, barely appears, because the auditor was scoped to operations from the day the SOW was signed. This is the loudest tell because it’s the easiest to sell (what CFO doesn’t want to hear “margin expansion”?). Revenue is harder to scope and easier to defer. An audit calibrated towards the easiest sell is calibrated away from the most important business question.

If you can spot the bad audit, you can specify the good one.

The three artifacts a real audit produces

Here’s what a real audit with standalone value produces: three things an executive can act on without the auditor present. Three. Not seven principles, not twelve workstreams, not a slide that says “next steps.” Three artifacts that survive the consultant walking out the door. Anything else is theater.

A task inventory with agentability scores. A clear-eyed map of where the org is today. Fifty to a hundred and fifty named tasks across both the operations and the commercial surface of the organization, each scored on a rubric covering input clarity, decision determinism, failure cost, verification ease, and exception frequency (I cover the five scoring dimensions in depth in the agentability scoring piece). Not “AI opportunities.” Specific named tasks, with the score and the reasoning behind the score. The commercial-side inclusion is the deliberate move that separates a real audit from the operations-biased default. Most audits skip commercial work because it’s harder to scope and the CFO can’t put a savings line item on it. That is exactly why a real audit insists on including it. The first-pass response on an inbound RFP, the qualification on a new lead, the prep work for a senior commercial rep ahead of a renewal conversation: all of these are scorable, most of them are high-agentability, and almost none of them appear in the typical audit. An exec who walks out with the inventory can sort it and see the top twenty candidates without re-engaging the consultant.

A stack diagnosis. A read on what’s already paid for but sitting unused. What the existing TMS, ERP, RPA tools, and data warehouse can already do, but aren’t doing. Most automation projects fail because the buyer pays for a new tool when the current stack had the capability sitting unused, configured two staffing changes ago and never picked back up. The diagnosis names the gap between what’s provisioned and what’s being used. Pair this with the workflow re-engineering lens: some workflows score low on agentability not because the tools are missing but because the workflow shape itself prevents an agent from operating reliably, and re-engineering the workflow is the prerequisite to automating it. A real audit names both gaps, the unused tool capability and the workflow shapes that block agentability, so the buyer can sequence them. Sometimes the right move is to buy nothing, redesign the workflow, then revisit.

A pilot roadmap with kill criteria. The map forward, engineered so that failure shows up early enough to act on, not six months and a million dollars in. Two to four pilots, each with an owner, a budget, a success metric, and a kill criterion. Of the three artifacts, the kill criterion is the one auditors most often refuse to write, because committing to it on paper commits them to invoking it in practice. That commitment is what separates a real pilot from a budget line with a fancier name. Kill criteria deserve their own section.

Kill criteria: the undervalued line

A pilot without a kill criterion is a budget line, not a project. The criterion exists for one reason: to give the buyer permission to stop, in writing, before the sunk-cost dynamic kicks in. Every executive who has wrestled with a long pilot knows the feeling. Eight months of investment, the team has built relationships with the agent’s outputs, the consultant is angling for the next phase, and the question “is this actually working” has gotten harder to ask, not easier.

Compare two versions.

The fake: “We’ll evaluate the pilot at month six.” No criterion, no metric, no commitment. The evaluation is a conversation, the conversation is steered by whoever has the most political weight at the table, and the pilot survives.

The real: “We stop the pilot if the agent’s draft variance memo requires more than fifteen minutes of analyst rework on average across weeks four through six, and the rework time isn’t trending downward week over week.” Specific. Measurable. Time-bound. The trend qualifier matters, because a slow start shouldn’t trigger a premature kill. The criterion gets baked into the SOW before the pilot begins, and reviewed at the named weeks.

The reason most consultants resist this line is structural, not malicious. A kill criterion is a commitment to walk away from the next phase if the pilot fails. The consultants who write it are the ones with the conviction to lose the second engagement rather than carry a failed pilot into it. Most won’t write it for the same reason most car salesmen won’t price the trade-in before the sale: doing so weakens the negotiating position that pays them.

Once you know what the artifacts should be, you can interview your auditor.

The diligence questions for your auditor

Flip the table. You’re the buyer. You get to interview the auditor after scoping but before signing.

Five questions separate the real audit from the sales motion. An auditor who fails three of five isn’t worth a second meeting. The point of these isn’t to trap anyone. It’s to test the methodology at the moment they have enough information to give a real answer, with no signed engagement to hide behind.

1. Show me the rubric you’ll use on day one.

Wrong answer: “We customize per client.” Translation: no rubric exists, they’ll improvise, and whatever arrives in the deliverable will look novel because it was invented mid-engagement.

Right answer: They produce a real rubric used in prior engagements, with the dimensions defined and the scoring scale calibrated. You don’t need them to share the proprietary parts of their work. You do need them to show that “methodology” is more than a slide.

2. Sketch a kill criterion for the first pilot, then walk me through what would trigger it.

Wrong answer: “We’ll keep refining this together.” Translation: they have the scoping facts in hand and are treating the criterion as a moving target. That posture predicts how they’ll behave when the pilot is running and the criterion needs to be enforced.

Right answer: A specific, measurable criterion proposed from what scoping surfaced, with the trigger conditions named (a metric, a threshold, a time-bound window) and a plausible scenario in which they’d have to invoke it. This isn’t a portfolio test, and it isn’t a request for finalized contract language (that gets negotiated before signing). The point is to confirm alignment on what a kill criterion should include and willingness to commit one as a structural part of the pilot. Both matter on every audit.

3. Who owns the pilot if it works? Who owns it if it doesn’t?

Wrong answer: A vague handoff to “your team.” Translation: they’ll exit, the project will die, and the post-mortem will conclude that AI just isn’t ready for your industry yet.

Right answer: A named human embedded during the pilot, plus a clean transition plan with named handoff points and a written brief of what the embedded person learned.

4. What does your deliverable look like in ninety days if our team stops responding to your emails?

Wrong answer: “We’ll keep you accountable.” Translation: the deliverable lives or dies on their continued involvement.

Right answer: A self-executing artifact, plus a one-page brief that a new hire could read and continue from. The deliverable should be valuable in the absence of the consultant. If it isn’t, what you bought is a relationship, not a finding.

5. What’s your stake in the game?

Wrong answer: Hourly billing or a fixed retainer with no outcome exposure. Or a vague “your success is our success.”

Right answer: Outcome-tied pricing on at least one pilot, or a structure where their continued payment depends on a measurable result. You don’t need every dollar tied to outcomes. You need at least one dollar that hurts to lose if the pilot misses.

What you should actually pay

Automation audits live in four tiers in the current market. Three are worth knowing about. One is a trap. The right tier depends on the scope of the decision you’re trying to make, not on the size of your company.

The vendor-funded pitch ($0 to $5K). A vendor offering a “complimentary audit” or a heavily discounted one. The conclusion will, with surprising regularity, be that you need their tool. Useful only as competitive intelligence on what the vendor is selling. Never as a buying decision.

The focused diagnostic ($2K to $10K). Two to five days of senior consulting time. Scope is deliberately narrow: one workflow, one business unit, or one decision you’re stuck on. Produces a subset of the three artifacts described above, typically the task inventory for the scoped area plus one or two pilot recommendations, not a full org-wide stack diagnosis. The right tier for a first engagement, a workflow you’re stuck on, or a sanity check before committing to a larger scope. Anti-vendor-lock-in by design: a focused diagnostic earns the consultant the right to a larger engagement. It doesn’t assume one.

The honest assessment ($15K to $50K). Four to eight weeks. One senior consultant plus an analyst. Produces all three artifacts at full scope across the operations surface. The right tier for an org-wide automation strategy or a multi-departmental rollout, and the only tier where a buyer should expect a defensible answer to the “what should we do across the whole org” question.

The multi-quarter program ($150K and up). A program that bundles assessment with implementation. Worth it only if you’ve already done one of the two tiers above separately and know what you’re buying. Otherwise the implementation labor pads out the engagement, the diagnostic gets buried under build work, and the deliverable becomes a relationship rather than a finding.

The pattern that distinguishes the focused diagnostic from the vendor-funded pitch: a focused diagnostic still produces an artifact, the task inventory and pilot recommendation, that the buyer can act on without the consultant. A vendor-funded pitch produces a sales conclusion that the buyer can’t act on without buying the vendor’s tool. The price points overlap. The artifacts don’t.

The piece itself is a procurement instrument.

Forward this before the next pitch

Send this to your procurement team before the next pitch. The five questions are a screening tool. If the auditor fails three of five, walk. If they pass four of five and the fifth is “stake in the game,” negotiate the stake into the engagement structure. That conversation alone tells you whether you’re hiring a consultant or buying a brochure.

The pattern holds across regulated industries. Maritime fleet operations, regional banks, mid-market healthcare, regulated manufacturers: the same four tells, the same three artifacts, the same five questions. The vocabulary on the deck will change, but the mechanics of a real engagement should not.

I’d love to hear from you. If you’ve sat through one of these pitches recently, the kind where the deck opens with the MIT statistic and ends with “let’s talk about phase two,” I want to know what survived contact with your team and what didn’t. I read every comment.