§ Essays / AI & Automation

Agentability scoring

Most automation projects fail at the task-selection step. Here's a rubric for picking the right tasks before the next vendor walks in.

27 May 2026 · 18 min read

The vendor demo was impressive. The agent drafted customer responses in seconds, handled edge cases without breaking, and produced output that looked indistinguishable from a senior support analyst’s. The pilot ran for six months. The team reported a 12 percent reduction in inbound queue volume. Everyone shook hands and called it a win.

Meanwhile, the commercial side didn’t bid on 4 of the 7 major RFPs that came in during the same six months. The two senior commercial people who were supposed to chase the expansion conversations from Q1 still hadn’t. The team had automated a queue, but the business kept losing the same deals it was losing before the pilot started.

You didn’t pick the wrong vendor; You picked the wrong work.

The 12-percent problem

Most automation projects fail at the task-selection step, not the implementation step. The vendor’s agent is fine. The team is fine. The pilot is fine. The pilot just picked the wrong tasks to automate, which is why 12 percent of inbound disappearing didn’t produce a 12 percent lift to the business.

Two failure modes you’ve probably seen.

First: the pilot automates the easy work and leaves the hard work concentrated. The 88 percent that remained was the 88 percent that already took the team’s time. The 12 percent automated was the 12 percent that took the least. Net relief on the team’s actual day: marginal.

Second: the pilot automates something that wasn’t actually the bottleneck. The work that felt like the bottleneck (drafting responses) wasn’t the work constraining the team (deciding which responses needed a phone call). The vendor solved the visible problem instead of the operational one.

Either failure mode, on its own, is recoverable. Together, you have a six-month pilot with great metrics and zero real change.

The deeper pattern is structural. Most automation projects get scoped by the operations side of the org, because operations is where the visible work lives and where the cost-cutting framing already exists. The commercial side, where the revenue work is, rarely gets a seat at the table when the task inventory gets built. You end up auditing the wrong half of the org before any scoring rubric ever gets applied, then picking the wrong half of that half.

The fix is not better vendors. The fix is a triage rubric you can run on your own work, on both sides of the org, before anyone walks in the door to pitch.

What agentability actually means

Agentability is the readiness of a specific named task to be handed off to an AI agent reliably enough that you can remove it from the human review queue. The definition is important because it forces you to score tasks, not the more ambiguous “opportunities” or “use cases”.

The vocabulary matters, because each of the alternative terms gives you the wrong unit of analysis.

“AI opportunity” is the vendor’s term. It refers to anywhere AI could theoretically be applied. Every task is an AI opportunity, which means the term is useless for triage. If everything qualifies, then nothing does.

“Use case” is the consultant’s term, which is slightly better. A use case refers to a specific deployment pattern, like “AI-assisted underwriting.” It has a distinct shape, but it’s still scoped at the wrong level, because a single use case might cover dozens of distinct tasks, each with different agentability. “AI-assisted underwriting” includes initial intake, document classification, risk scoring, exception flagging, communication drafting, and final approval. Five of those might be fully ready. Two might not be ready for another two years. The use case averages them. The rubric scores them individually.

“Task” is the buyer’s term. A task is a specific, repeatable unit of work that someone in your org does today. “Extracting opportunity signals from an inbound commercial inquiry and routing to the right desk” is a task. “Improving customer experience with AI” is not. The difference is observable versus amorphous.

Agentability scores tasks. Anything that doesn’t score at the task level is a marketing instrument, not a triage mechanism. You can tell which is which because the buyer can’t use the marketing instrument to make a decision.

The five dimensions

Agentability is the joint product of five dimensions. A task that scores high on four of them and low on one is not ready. A task that scores moderate across all five usually is.

These five dimensions are my working version. I’m still pressure-testing them against the operators I talk to, and they’ll likely evolve. What I’m more confident about is the underlying move: scoring tasks against named dimensions. The specific dimensions can shift, but the discipline should not.

The rubric is a mental model, not a spreadsheet. You should be able to give a low, medium, or high read on each dimension as you think about the task itself, without sitting down with a 25-point scoring grid. Numerical scoring is useful for cross-team comparison or rigorous prioritization, but the goal is a tool you can apply at conversation speed.

1. Input clarity. Is the shape of the input consistent across instances of the task? This is about format and structure, not about content novelty (Exception frequency, the fifth dimension, handles that). A consistently-shaped input can still contain surprising content. A varied-shape input can still be substantively routine.

High: structured form fields, templated documents with predictable shape, API payloads.

Low: free-text customer emails in five languages, scanned paper documents with no consistent layout, voice messages.

2. Decision determinism. Does the task have a single right answer that a competent human would reach the same way every time?

High: classification (“which queue does this belong in?”), extraction (“what is the invoice number?”), pattern matching against a known taxonomy.

Low: tradeoff weighing (“is this multi-year charter offer for our vessel worth accepting?”), strategic judgment, novel reasoning under ambiguity.

3. Failure cost. How recoverable is a wrong output, for the party who bears it?

This rolls together two things that operate in tandem: the magnitude of the loss and its reversibility. A wrong sales-forecast slide is medium-magnitude and mostly recoverable next quarter. A wrong wire transfer is large-magnitude and not recoverable at all. They land at very different agentability scores even if the nominal dollar exposure looks similar. The question to ask is “how acceptable is this loss to the specific party in question,” not “how big is the number.”

High: regulatory fine, customer-relationship damage, irreversible financial commitment, safety event.

Low: minor rework, the human catches it on review, the next instance gets it right, no external party absorbs the cost.

4. Verification ease. Can a human spot a bad output quickly?

High: 30-second skim tells you whether the routing was right, the classification was correct, the extracted field is plausible.

Low: catching the error requires re-reading the source document, cross-referencing five systems, or actually being the expert who originally would have done the task.

5. Exception frequency. How often does the task encounter content that surprises the agent, situations outside the patterns it can handle reliably?

This is distinct from Input clarity, which is about the shape of the input. A consistently-shaped input can still surprise the agent with unusual content. A varied-shape input can still be substantively routine.

High: the tail is long, exceptions are the norm, and “edge case” is most of the work.

Low: stable, repeatable, the long tail is short.

The two binary gates

Two things sit outside the rubric, as binary gates rather than scoring dimensions. Either one can short-circuit a task that otherwise scores high on all five.

Tool access. Can the agent actually perform the action: push the record to the CRM, send the message through Slack, write to the system of record? Some tasks score high on every dimension but cannot deploy because the agent has no API access to the system where the work happens. Binary check, run before scoring.

Audit and explainability requirement. Does the judgment this task requires have to be defensible to an auditor, a regulator, or a counterparty? Lending decisions, hiring screens, medical triage, regulated-industry compliance review. If the answer is yes, the task likely stays human even if the rubric scores it high. Another binary check.

The rubric scores the tasks that survive both gates. If a task fails a gate, the rubric doesn’t apply yet. Fix the gate first, or accept that the task stays human until you do.

A clarifying note on the audit gate. It applies to the judgment portion of a task, not necessarily the whole task. Many regulated workflows have a deterministic core that can be agent-handled (the loan application’s debt-to-income calculation; the medical triage protocol’s vital-signs sorting; the compliance review’s match against a known sanctions list) and a judgment margin that requires human accountability (the loan officer’s discretionary call on a borderline applicant; the clinician’s override on an ambiguous presentation; the compliance officer’s signature on a false-positive exception). The gate applies to the margin. If the workflow allows you to cleanly separate the two, you have two tasks instead of one, and the rubric applies to each independently. Segmenting the task to isolate the deterministic core from the judgment margin is itself a workflow re-engineering move.

A note on calibration

This rubric is calibrated for LLM-class agents, which fail differently than RPA. RPA-era thinking treated Exception frequency as a near-fatal score; brittle robotic-process automation broke on anything unusual. LLM-era agents handle messy inputs and unusual content reasonably well, which softens Exception frequency’s weight.

They compensate by failing differently. An LLM agent is confidently wrong in ways that look right. Which means Verification ease matters more in 2026 than it did in the RPA era, because the wrong outputs do not announce themselves.

If you’re applying RPA-era intuitions to LLM-era agents, you’ll under-weight verification and over-weight exception handling. Don’t.

Walking the rubric through three tasks

Definitions are abstract. The rubric is a thinking tool only when you walk it through real work. Three tasks. Three different verdicts.

Routing commercial inquiries to the right desk (high agentability)

Most B2B businesses have a generic inquiry inbox. A commercial inquiry lands, someone scans it, decides which desk it belongs on (account management, new business, partnerships), and forwards it. The work is judgment-light, classification-shaped, and consistently delayed because nobody owns the inbox as their primary work.

Input clarity: medium-high. Inquiries vary in format, but the routing decision keys off a few stable signals (cargo and lane for maritime, product line and deal size for SaaS, claim type and severity for insurance, and so on). The shape is industry-specific, but the pattern is the same.
Decision determinism: high. The right desk exists. The task is classification, occasionally with a confidence score and a fallback to manual review for the ambiguous 5 percent.
Failure cost: low to medium. A misrouted inquiry gets re-routed manually, and the cost is one or two hours of delay, not a lost deal.
Verification ease: high. A senior commercial person can confirm the routing in seconds. Sample the agent’s outputs weekly, and if a misrouting pattern shows up, retrain the classifier on the misses.
Exception frequency: low to medium. Long-tail inquiries exist but are bounded. Most inquiries fit one of usually seven or eight recognizable shapes.

Verdict: high agentability. Automate, with a sample-review weekly.

The payoff isn’t labor savings. The payoff is making sure no commercial inquiry sits in a generic inbox for two days while the rep who should see it is in meetings. Speed of routing converts to speed of response, which converts to deals you would have lost to the competitor who responded first.

Drafting first-pass RFP responses (mid agentability)

An RFP arrives. Someone needs to read it, extract the relevant fields, draft a quote with the right pricing posture, and send it back. The work is structured enough to be agent-friendly, and judgment-heavy enough that a senior sales rep still needs to be in the loop.

Input clarity: medium. RFPs vary in structure, but most have the same five or six fields that drive the quote (for maritime logistics, it might be cargo, lane, dates, volume, terms preferences, counterparty context, etc.).
Decision determinism: medium. A competent commercial person would draft similarly, but with judgment on pricing posture, terms, and which clauses to push back on. Two senior reps might draft the same RFP differently and both would be defensible.
Failure cost: medium to high. A bad draft can either undercut margin or annoy a counterparty. A senior review catches most of the issues before the draft goes out, but the review has to actually happen.
Verification ease: medium. A senior reviewing the draft can spot a pricing-posture issue in 60 to 90 seconds. The catch is that the senior has to make time for the review, every time, and on every draft.
Exception frequency: medium. Most RFPs fit a pattern. A small share need strategic judgment (long-term relationship value, risk expansion, unusual terms) and should escalate immediately.

Verdict: mid agentability. Automate the draft. Keep the senior in the review loop. Instrument the queue so the senior’s review time per RFP stays under a target.

The payoff is response capacity. The team bids on RFPs they were previously dropping because nobody had time to start the draft. The rep’s role shifts from drafting to reviewing, building momentum on the pipeline and letting one senior cover three to five times as many opportunities without working longer hours.

Pricing exceptions in renewal negotiations (low agentability)

A customer’s renewal is up. The standard renewal is templated, but this particular customer has a history of pushing for off-list pricing, terms exceptions, or scope changes. The senior on the account is going to negotiate this themselves, and the team is going to live with whatever they decide for the next 12 to 24 months.

Input clarity: low. The conversation is the input. Context lives in CRM notes, sales calls, the rep’s memory, and a folder of past contracts.
Decision determinism: low. Price judgment depends on customer history, competitive context, strategic relationship value, and the rep’s read of how aggressive the customer is feeling this quarter. Two reps making the same call might land in different places, both defensibly.
Failure cost: high. A wrong price decision affects revenue, retention, and precedent for other customers. The customer will reference your concession in the next renewal too.
Verification ease: low. Catching a bad call requires being the rep, or being someone who has seen the customer’s full history. You can’t sample-review this from the outside.
Exception frequency: high. Every negotiation is at least partially novel. The “edge case” is the work.

Verdict: low agentability today. Keep human.

It’s important to understand that the rubric is not saying low-agentability tasks stay human forever. It is saying that the combination of the current generation of agents AND the workflow as it stands today will produce errors faster than the team can catch them.

Two paths can raise the score; The agents improve (passive, happens over time without you), or you re-engineer the workflow so its inputs, decisions, and verification become more agent-friendly (active, a separate project you choose to undertake).

That second path deserves a name, because it gets confused with automation itself. Workflow re-engineering means changing the shape of the task before applying an agent, so the task scores differently on the rubric. For the pricing-exceptions example: you could re-engineer the workflow so customer history lives in structured fields rather than free-text CRM notes, so price-sensitivity signals are extracted and standardized rather than tacit, so the senior’s judgment is bounded by an explicit decision framework rather than full discretion. After re-engineering, the same task might score moderately rather than low.

The re-engineering is not automation. It’s the prerequisite to automation, and it’s often the bigger project of the two. The political dimensions, covered next, are usually heavier here than the political dimensions of deploying an agent on an already-high-agentability task.

The diagnostic before the vendor meeting

The rubric is most useful before any vendor walks in. Run it on your own org first, with this three-step diagnostic.

List 50 to 100 tasks across both operations and commercial. Not opportunities, but specific named tasks that someone on the team does today. Pull from operations job descriptions, process documentation, or even a week of observing the team directly. Then deliberately go to the commercial side and pull the same data: inquiry handling, RFP drafting, follow-up sequencing on stalled deals, expansion-conversation planning, counterparty intelligence work. Most lists miss the commercial half entirely, which is how the wrong half of the org ends up audited and later burned by failed automation attempts.
Score each on the five dimensions. Low, medium, or high is enough for a first pass. The rubric is designed to be applied at conversation speed. A 1-to-5 numerical scale is fine if you want to sort by aggregate or compare across teams. Either way is better than not scoring. The point is the conversation the rubric produces, not the precision of the number. Don’t forget the two binary gates from earlier: a task that scores high on the rubric but fails the tool-access or audit-requirement gate is still not deployable.
Walk into the vendor meeting with two lists. List A: the 20 tasks scoring high. These are your candidate pilots. List B: 5 tasks scoring low that you specifically want to keep human, with explicit reasoning. Hand both to the vendor at the top of the meeting.

The vendor’s reaction sorts them. A good vendor responds to List A by walking through how their agent maps to each task and where the friction will be. They respond to List B by agreeing not to scope into those tasks, and by naming an enhancement to the humans doing that work as a possible adjacent project. A bad vendor responds to List B by trying to demo something fancy on a task you just told them to leave alone.

The shift here is small but material, and it bakes in velocity for every vendor meeting that follows. When the vendor walks into a meeting where you’ve already done the triage, the conversation moves from “what can your AI do?” to “where does your offering fit on these lists, and what would you add or remove?” Both questions sound similar, but produce entirely different meetings.

The political trap

A task can score high on the rubric and still be impossible to deploy, because the deployment runs into a political dimension the rubric doesn’t measure. Political ownership creates a different kind of low-agentability. It’s not on the task, it’s on the org itself. The rubric scores the task. Deployment, or workflow re-engineering, requires scoring the org as well.

The trap shows up in both modes: deploying agents on tasks that already score high, and re-engineering workflows to raise tasks from low to moderate. The political dynamics overlap. Re-engineering is usually the harder of the two, because it requires the workflow’s current owners to redesign their own roles, not just hand off a piece of them. The patterns below apply to both. In each, the deployment case is the easier one to navigate. The re-engineering case is where most automation programs quietly die.

The three patterns to watch for.

Someone’s job identity is the task. The senior analyst whose value to the org is “I am the one who writes the variance memos.” Automating the memos automates their identity. The pattern: the analyst’s supervisor, not the analyst alone, has to redesign the role before the pilot, not after. If the supervisor isn’t willing to have that conversation, defer the pilot. You don’t fix this with technology.

Someone’s standing is the task. The ops manager who controls a queue because controlling the queue gives them visibility into the rest of the org. Automating the queue removes that standing. The pattern: give the manager a new useful control surface before the automation lands, not after. Otherwise they will quietly sabotage the pilot. They won’t admit they’re sabotaging it. They’ll just coincidentally be too busy to attend the planning meetings.

Someone’s compliance accountability is the task. The compliance officer who personally reviews every sanctioned-counterparty match because their signature is on the audit trail. Automating the review changes the accountability chain in ways that affect liability. The pattern: the automation needs to augment, not replace, the signature loop. The compliance officer needs to be in the room when the rubric is run, not consulted after the fact.

The rubric is a thinking tool, not a decision tool. The decision still requires the exec to read the politics. If the org isn’t ready, defer. The agent will still be there in three months, and the task will score the same. The political conditions might not.

The shift in the meeting

The next time a vendor walks in with a demo, the question you ask first is different.

Not “what can your AI do?” That question gives the vendor the floor. They’ll show you the most impressive thing in their demo, which may or may not have anything to do with your operating reality.

Ask instead: “Here are 20 tasks I think are ready. Here are 5 I am keeping human. Where does your offering sit on these lists, and what would you add or remove?”

The room changes when you bring the rubric. The vendor stops pitching and starts answering, and the conversation moves from “buy this” to “fit this.” That single move is worth more than most paid consulting engagements.

It’s also the kind of move that compounds. The next vendor meeting you take has the same two lists, and so does the vendor meeting after that. By the third meeting, you’ve stopped buying based on vendor demos and started buying based on operational fit. Which is the only frame in which automation actually moves the business.

Thanks for reading, and I’d love to hear from you. I read every comment. If you’ve run a rubric like this on your own org, or there’s a dimension I’m missing, I’d love to know.