Context Engineering

The AI Readiness Scorecard: Eight Dimensions That Determine Whether You Scale or Stall

Matthew Kruczek — Mon, 06 Apr 2026 16:57:51 GMT

If you only have a minute, here’s what you need to know.

Most organizations buying AI tools have never assessed whether their organization can absorb them. Cisco’s study of 8,000 companies found only 13% are actually ready to deploy AI at scale. The other 87% are spending money on capability they can’t operationalize.
AI readiness isn’t one thing. It’s eight: Leadership Commitment, Strategic Alignment, Data Readiness, Technology Infrastructure, Talent & Skills, Process Maturity, Culture & Change Readiness, and Governance & Ethics. Weakness in any single dimension blocks the others.
Two of those dimensions, Leadership and Strategic Alignment, function as multipliers. Organizations that score high on those two extract more value from every dollar invested in the other six. Organizations that score low waste money regardless of how advanced their technology is.
Industry benchmarks exist: financial services typically scores 2.8-3.5 out of 5.0, healthcare 2.0-2.8, manufacturing 2.2-3.0. Any dimension scoring below 3.0 acts as a blocker, no matter how strong the rest of the scorecard looks.
This article gives you enough to sketch your own radar chart and find your weak spots. The rest of this series will show you what to do about each one.

I’ve written before about why AI pilots fail, and why only 10% of success comes from the technology itself. Most executives I talk to already know their AI initiatives are underperforming. What they don’t know is where exactly their organization is breaking down and which problems to fix first.

That’s a diagnosis problem. And almost nobody is doing it.

The purchase order that replaces strategy

Here’s the pattern I see repeatedly. A leadership team reads the headlines, feels the competitive pressure, and makes a purchasing decision. Copilot licenses for 5,000 employees. An Azure OpenAI deployment. A handful of pilot projects with aggressive timelines.

Six months later, Copilot adoption is at 15%. The pilots produced impressive demos that nobody scaled. The Azure deployment is running one use case that could have been built with a Python script. The organization spent millions on AI capability and has almost nothing to show for it.

The problem isn’t the tools. The problem is that nobody asked whether the organization was ready for them.

Cisco’s 2025 AI Readiness Index surveyed 8,000 organizations globally. Only 13% qualified as “Pacesetters,” companies actually converting AI investment into production value. Those Pacesetters were 4x more likely to move AI from pilot to production than everyone else. The gap wasn’t technology budgets or model selection. It was organizational readiness across multiple dimensions simultaneously.

The organizations that succeed assessed their readiness before they started buying. The ones that struggle skipped diagnosis and went straight to procurement.

Eight dimensions, not one

AI readiness isn’t a single score. It’s a profile across eight distinct dimensions, and weakness in any one of them can block progress in all the others.

I’ve been refining this framework across multiple enterprise engagements. The eight dimensions that consistently determine whether an organization scales or stalls:

1. Leadership Commitment

This is the dimension most organizations overestimate. Having a CEO who talks about AI in earnings calls is not leadership commitment.

Leadership commitment means a named executive sponsor with dedicated time and authority over AI initiatives. Not “the CIO will oversee this alongside their other responsibilities.” A person whose calendar reflects AI transformation as a primary obligation, not an afterthought. It means an AI steering committee with cross-functional representation, because AI that lives only in IT never reaches the business units where value is created. And it means a dedicated budget line item that isn’t buried in general IT spend, because initiatives funded through discretionary budget get cut the first time the quarter looks tight.

Here’s what a 2.0 looks like: the CEO mentioned AI at the last all-hands, someone in IT was told to “look into it,” and there’s no dedicated budget. A 4.0 looks different: a named sponsor reports to the board quarterly on AI progress, there’s a cross-functional steering committee that meets biweekly, and AI has its own P&L line that survived two budget cycles.

Sustained sponsorship correlates with a 68% success rate. Without it, success drops to 11%. More than half of AI initiatives lose their executive sponsor within six months, usually because the sponsor treated it as a ribbon-cutting rather than a multi-year commitment. The next article in this series goes deep on what the right sponsor looks like and why most companies get this wrong.

2. Strategic Alignment

Most AI roadmaps are technology wish lists. They catalog tools the organization wants to buy rather than business problems AI should solve.

Strategic alignment means linking every AI initiative to one of the company’s top three to five business priorities. Not to vague goals like “explore AI” or “increase innovation.” If your CEO’s top priority is margin expansion in the North American retail division, your first AI initiative should trace a direct line to that outcome. If it doesn’t, it won’t survive the first budget review.

It also means having a use-case prioritization framework that weighs feasibility against impact against data readiness. Without one, organizations pursue whatever the loudest executive asks for or whatever the vendor demo looked most impressive. That’s how you end up with 20 teams building 20 disconnected chatbots.

A 2.0 here looks like a list of 30 AI “opportunities” with no ranking criteria and no connection to business outcomes. A 4.0 looks like five prioritized use cases, each mapped to a business metric, with clear owners and quarterly checkpoints. The difference isn’t sophistication. It’s discipline.

3. Data Readiness

This is where most AI projects actually die. Not in model selection, not in deployment, but in the data underneath. 85% of AI projects fail due to poor data quality or lack of relevant data.

Data readiness covers three things. First, quality: is the data accurate, complete, and current? Most organizations discover their data is far dirtier than they assumed once they try to feed it to a model. Second, accessibility: is the data centralized on a platform where AI services can reach it, or is it trapped in departmental silos, legacy databases, and someone’s Excel spreadsheet? Third, governance: are permissions set correctly, is sensitive data classified, and do you know who can access what?

That third one is where Microsoft Copilot deployments have been a wake-up call. Multiple enterprises discovered that rolling out Copilot exposed over-permissioned SharePoint sites, giving employees AI-surfaced access to documents they were never supposed to see. The technology worked perfectly. The data governance underneath it didn’t. Those organizations spent months on permission audits before the tool could be safely used.

Among top-performing organizations, 76% have fully centralized their data. Among everyone else, only 19%. That single gap explains more of the performance difference than any technology choice.

4. Technology Infrastructure

This is the dimension that gets the most attention and probably matters the least, at least in isolation.

Cloud capacity, a standardized AI/ML platform, an integration layer that connects AI services to existing systems, API management, compute scaling. These are necessary. They are nowhere near sufficient. The platforms are mature. Azure AI Foundry, Copilot Studio, Microsoft Fabric, AWS Bedrock, Google Vertex, and their equivalents provide more than enough technical foundation for any enterprise AI initiative shipping today.

Technology accounts for roughly 10% of what determines AI success. Yet it’s where most organizations start and where most of the budget goes. This isn’t an argument against investing in infrastructure. It’s an argument against investing in infrastructure first, before you’ve addressed the seven dimensions that determine whether anyone uses what you build.

A 2.0 is running AI experiments on individual developers’ laptops with no shared platform. A 4.0 is a standardized enterprise platform with self-service provisioning, monitoring, cost management, and integration connectors to your core systems. Most organizations are somewhere around 3.0 here already, which is fine. The bottleneck is almost never the platform.

5. Talent & Skills

A one-day workshop on prompt engineering changes nothing. I’ve watched organizations spend six figures on AI training programs that produce certificates and zero behavior change.

Talent and skills covers three layers. The first is AI literacy across the organization: does the average employee understand what AI can and can’t do, and do they feel confident experimenting with it in their daily work? The second is practitioner depth: do you have dedicated AI/ML engineers, data scientists, and platform engineers who can build and maintain AI systems? The third is cross-functional fluency: can business analysts, product managers, and domain experts articulate problems in ways that translate to AI solutions?

Role-based training matters. An executive needs to understand AI’s strategic implications and governance requirements. A middle manager needs to know how to redesign their team’s workflows around AI capabilities. A software engineer needs hands-on experience with the tools and frameworks. A finance analyst needs practical training on AI-assisted analysis within their actual tools. One curriculum for all four audiences will fail all four.

Among top-performing organizations, 75% report AI proficiency across their staff. Among everyone else, 16%. That’s the widest gap of any dimension in the data, and it’s not a gap you close with a lunch-and-learn series.

6. Process Maturity

AI can’t optimize processes that aren’t documented, standardized, or measured. This is the dimension that catches organizations off guard.

Here’s a test. Pick any core business process in your organization, something like “how we approve a new vendor” or “how we onboard a new customer.” Now ask three people in three different offices to describe the steps. If you get three different answers, your process maturity score is below 3.0, and AI will amplify the inconsistency rather than fix it.

The target state is making your implicit operating model explicit and machine-readable. That’s where AI becomes transformative, when it can read your processes, identify bottlenecks, and suggest or execute improvements. But the starting point is much simpler: are your core business processes documented? Are they standardized across teams and geographies? Are they measured with KPIs that would tell you whether AI is actually improving them?

If the answer is no, AI deployment will produce anecdotes, not outcomes. You’ll have impressive demos of AI “optimizing” a process that doesn’t exist in any consistent form, and no way to measure whether the optimization made a difference.

7. Culture & Change Readiness

I’ve written about the AI adoption paradox in detail, including the finding that 31% of employees actively sabotage AI initiatives. Culture isn’t a soft dimension. It’s the one that determines whether your other seven investments get used or ignored.

Culture and change readiness encompasses three things. Psychological safety: do employees feel safe experimenting with AI, including producing bad outputs while learning? Change management: does the organization have a plan for how roles, workflows, and expectations will evolve, or is the implicit message “figure it out”? And tolerance for failure: when an AI pilot doesn’t work, does the organization learn from it or kill the program?

The organizations that achieve the fastest adoption make it opt-in, not mandated. They create environments where early adopters become visible champions, where success stories spread organically, and where people adopt because they see peers getting better outcomes, not because they received a compliance email. Culture change through pull is more durable than culture change through push. But you have to know whether your culture can support that before you design the rollout.

A 2.0 looks like active resistance: employees avoiding AI tools, managers vocally skeptical in meetings, no change management plan. A 4.0 looks like organic experimentation: employees sharing tips in Slack channels, managers redesigning team workflows around AI, and a formal change management lead coordinating the transition.

8. Governance & Ethics

Most organizations frame this as a tradeoff: governance or speed. That’s a false choice. The answer isn’t less governance. It’s governance designed for velocity.

This dimension covers four areas. Principles: has the organization documented what responsible AI use looks like, in specific terms, not a vague values statement? Policies: is there an acceptable use policy that employees actually know about, covering what data can go into AI tools, what outputs require human review, and what’s off limits? Process: is there a review mechanism for new AI applications before they go into production, and can it turn around a decision in days rather than months? And identity: as AI agents start acting on behalf of the organization, who are they, what can they access, and who’s accountable for what they do?

That last area is evolving fast. Microsoft’s Entra Agent ID is one example of where this is heading: verified identity for every AI agent, not just every human user. Your governance framework needs to account for non-human actors that can read, write, and transact.

Here’s the reality most leaders haven’t fully absorbed: 68% of employees are already using AI tools without IT approval. Your people aren’t waiting for your governance framework. They’re working around it. The question isn’t whether you need governance. It’s whether your governance can move fast enough that people actually use the governed path instead of the shadow one.

The multiplier effect

Not all eight dimensions carry equal weight. Two of them, Leadership Commitment and Strategic Alignment, function as multipliers on everything else.

This makes intuitive sense. A brilliant data strategy accomplishes nothing if leadership isn’t committed to funding it through the 6-12 months it takes to operationalize. A world-class AI platform collects dust if there’s no strategic alignment on which problems to solve with it. Conversely, strong leadership and clear strategic direction amplify every other investment.

In the scoring model I use, Leadership and Strategic Alignment each carry 1.5x weight. The formula looks like this:

Overall Score = (1.5 x Leadership + 1.5 x Strategic Alignment + Data + Technology + Talent + Process + Culture + Governance) / 9
Scale: 1.0 to 5.0.

This weighting means an organization scoring 4.0 on leadership and strategy but 3.0 on everything else will outperform an organization scoring 3.0 on leadership and strategy but 4.0 on everything else. The multipliers matter more than the averages.

Where industries typically score

Here’s where industries tend to land based on the assessments I’ve seen and the broader research data:

Financial services: 2.8 to 3.5. Strongest in governance (3.5-4.5), reflecting decades of regulatory compliance muscle. Weakest in culture (2.0-3.0), where risk aversion that serves them well in banking works against them in AI experimentation.
Healthcare: 2.0 to 2.8. Strongest in governance (2.5-3.5), driven by HIPAA and clinical trial requirements. Weakest in data readiness (1.5-2.5), where fragmented EHR systems and interoperability challenges create a data infrastructure problem that predates AI by decades.
Manufacturing: 2.2 to 3.0. Strongest in process maturity (3.0-4.0), because manufacturing has been documenting and measuring processes since before software existed. Weakest in talent (1.5-2.5), where the AI skills gap intersects with an existing shortage of digital talent on factory floors.
Professional services: 2.3 to 3.2. Strongest in talent (3.0-4.0), because knowledge workers tend to adopt new tools faster. Weakest in technology infrastructure (2.0-3.0), where years of underinvestment in platforms create a foundation gap.

Company size matters too. Small companies (50M−50M−200M revenue) typically score 1.8-2.5, constrained primarily by talent. Mid-market companies (200M−200M−1B) score 2.3-3.2, with data readiness as the most variable dimension. Enterprise companies ($1B+) score 2.8-3.8 but struggle with organizational complexity and change fatigue.

These are ranges, not rules. But they give you a starting point for honest self-assessment.

The blocking rule

Here’s the finding that makes executives most uncomfortable: any single dimension scoring below 3.0 blocks transformation, regardless of how strong the other dimensions look.

A financial services company with a 4.5 in governance and a 2.0 in culture will stall. Their compliance framework is excellent, but employees won’t adopt the tools. A technology company with a 4.5 in talent and a 2.0 in data readiness will stall. Their engineers are eager and capable, but there’s nothing clean to feed the models.

This is why holistic assessment matters. Organizations naturally invest in their strengths, the dimensions where they already score well, because progress feels easiest there. But the scorecard makes visible a different truth: the highest-ROI investment is always in your weakest dimension, because that’s the one blocking everything else.

The pattern among top performers isn’t one dominant strength. It’s the absence of critical weaknesses. They score above threshold on every dimension simultaneously.

What to do this week

You don’t need a consulting engagement to start. You need a whiteboard, 90 minutes, and the willingness to be honest.

Score each dimension. Pull together a cross-functional group, not just IT, and rate each of the eight dimensions on a 1-5 scale. Use the descriptions above as rough calibration. Where you find disagreement within the room, you’ve found a gap in organizational awareness, which is itself a finding.

Plot the radar chart. Eight axes, one per dimension. The shape tells you more than the average. A spiky chart (high in some areas, low in others) indicates an organization investing unevenly. A uniformly low chart indicates an organization early in the journey. A chart with one or two deep dips below 3.0 shows you exactly what’s blocking progress.

Find your blocker. Look for any dimension below 3.0. That’s your first priority, not because it’s the most exciting work, but because nothing else moves until it does. Leadership and Strategic Alignment below 3.0 are the most urgent because they’re multipliers on everything else.

Stop buying tools for six weeks. This is the uncomfortable recommendation. If you haven’t done this assessment, pause net-new AI tool purchases until you have. Every tool you buy before understanding your readiness profile is a bet that your organization can absorb it. The data says that’s a bet you’ll lose 87% of the time.

This is the first article in The AI Readiness Playbook series. The next eight articles will walk through how to close the gaps this scorecard reveals, starting with executive sponsorship and strategic alignment, through data readiness, governance, skills, engineering enablement, and the messy middle of scaling from pilots to production.

The scorecard tells you where you are. The playbook tells you how to move.

Matthew Kruczek is Managing Director at EY, leading Microsoft domain initiatives within Digital Engineering. Connect with Matthew on LinkedIn to discuss AI readiness assessment and organizational enablement for your enterprise.

References

Cisco. “AI Readiness Index 2025.” 8,000 organizations surveyed globally. cisco.com
BCG. “From Potential to Profit: Closing the AI Impact Gap.” January 2025. bcg.com
BCG. “Enterprise as Code: Operating Model for the AI Era.” December 2025. bcg.com
Gartner. “AI Maturity Model and Roadmap Toolkit.” gartner.com
Microsoft. “Enterprise AI Maturity in Five Steps.” October 2025. microsoft.com
Kruczek, M. “The AI Adoption Paradox.” matthewkruczek.ai

This is Article 1 of 9 in “The AI Readiness Playbook” series, a step-by-step methodology for making your organization AI-ready.

Same Prompt, Different Results. Your Agent Harness Is the Multiplier

Matthew Kruczek — Mon, 30 Mar 2026 17:51:43 GMT

If you only have a minute, here’s what you need to know.

Most engineers run AI coding agents with default settings. They’re leaving 60-70% of the tool’s capability on the table.
An agent harness, the rules, hooks, skills, memory systems, and plugins wrapping your AI agent, determines output quality more than the model itself. Same model, same prompt, dramatically different results.
My Claude Code harness uses 16 lifecycle hooks, 28 reusable skills, 5 plugins, 8 MCP server endpoints (5 aggregated through a single gateway), and a cross-project memory graph. Every layer is designed around one principle: optimize what enters the context window.
Token comparisons show a 3-4x efficiency gain on research-heavy tasks and a 2x reduction in correction cycles on implementation tasks, compared to default configuration.
The context window is the bottleneck for all agentic coding. You don’t need my exact setup. But if you’re running an AI coding agent without governance rules, memory persistence, and context optimization, you’re paying for a sports car and driving it in first gear.

I wrote two weeks ago that agent harnesses need fewer layers, not more. That argument still stands. But “fewer layers” doesn’t mean “no layers.” It means the right layers.

Today I want to show you what that looks like in practice. Not a framework diagram. Not a theoretical architecture. My actual Claude Code configuration, the one I use every day to ship production code, monitor my social media, and manage a portfolio of 30+ projects.

This isn’t a tutorial. It’s an argument: your agent harness is the single highest-leverage investment you can make in AI-augmented engineering, and most teams haven’t even started.

Why the context window is everything

Before I walk through my setup, I need to make the case for why all of this matters. The answer is three words: the context window.

Every AI coding agent operates within a fixed context window. That’s the token budget holding system instructions, your conversation history, tool outputs, and the model’s own reasoning. When that window fills up, the agent compacts: it summarizes and discards older context to make room. Every compaction loses conversational state. Every lost state means re-explanation, re-exploration, and re-orientation.

The default Claude Code experience treats the context window as a dumb pipe. Raw build logs? Dump them in. Five hundred lines of git log? Dump them in. The model’s own correction attempts after generating code that doesn’t match your standards? All of it goes in. Most sessions, 60-75% of the context window is consumed by information the model doesn’t need in raw form.

This is the fundamental problem a harness solves. Every layer of a production harness exists to optimize what enters the context window. Front-load the right information, sandbox the noise, and eliminate the correction cycles that waste tokens on work the model should have gotten right the first time.

The model is powerful. The context window is finite. The harness bridges that gap.

What “default” actually looks like

When you install Claude Code and start a conversation, here’s what happens:

The model receives your prompt. It has access to built-in tools: file read, file write, bash execution, web search. It knows nothing about your project conventions, your coding style, your team’s architectural decisions, or what you worked on yesterday. Every session starts from zero.

This is like hiring a senior engineer and giving them no onboarding, no documentation, no code review standards, and no access to your team’s Slack history. They’re talented. They’ll produce something. But they’ll spend half their time asking questions you’ve already answered and writing code that doesn’t match your patterns.

That’s the default experience. And it’s what 90% of Claude Code users are running right now.

Here’s where their context window goes in a typical 2-hour session:

Less than half the context window is doing real work. The rest is overhead. That’s the cost of having no harness.

The seven layers of a production harness

My harness has seven layers. Each one solves a specific problem that default configuration doesn’t address, and each one is fundamentally about putting the right tokens into the context window while keeping the wrong ones out.

Layer 1: Governance rules (7 enforcement files)

The ~/.claude/rules/ directory contains seven markdown files that load into every conversation as system instructions. They encode:

Coding style. Immutability patterns for TypeScript and C#. Spread operators, never mutate. Records and with expressions. Init-only properties. This isn’t a preference; it’s a requirement enforced before the first line of code is written.
Security checklist. No hardcoded secrets. FluentValidation on all DTOs. EF Core only for SQL (auto-parameterized). JWT validation with ClockSkew=Zero. CSRF protection. Rate limiting on public endpoints. This runs as a pre-commit mental model, not an afterthought.
Agent orchestration. When to auto-spawn specialized subagents. Security-sensitive code triggers a security-reviewer. Build failures trigger a build-error-resolver. These aren’t optional. They’re wired into the rules.
Testing standards. 80% coverage minimum. TDD when requested. xUnit with AAA pattern. WebApplicationFactory for integration tests.
Performance guardrails. Context window management, research time limits, build troubleshooting protocols.
Cross-project memory. When to query Nexus before asking me to re-explain something. When to check for existing decisions instead of making new ones.
Common patterns. Privacy tags, skeleton project discovery, reusable architectural templates.

The cost: approximately 3,000 tokens of system context. The payoff: the model writes code that matches my standards on the first attempt. No correction cycles. No “actually, we use FluentValidation, not DataAnnotations.” No “please don’t use console.log.”

Layer 2: Lifecycle hooks (16 hooks across 6 phases)

This is where a harness becomes autonomous. Claude Code supports hooks at six lifecycle phases: SessionStart, PreToolUse, PostToolUse, PreCompact, PostCompact, and Stop. I use all six.

SessionStart (3 hooks):

Memory persistence restores my previous session’s context: what I was working on, what’s next, what decisions were made.
Nexus syncs at session start. Nexus is a tool I built, a local-first knowledge graph that spans all my projects (more on this in Layer 6). It loads architectural decisions and dependency maps from all 30+ projects so the model doesn’t start cold.
Nexus session initialization sets up cross-project tracking for the current session.

PreToolUse (3 hooks):

A skill switchboard routes file edits to the appropriate skill based on file type and project context.
A strategic compact advisor monitors context usage and suggests compaction before I hit limits, preventing emergency compactions that lose more state.
A push safety gate lists commits before any git push, requiring review.

PostToolUse (3 hooks):

Auto-formatting runs after every edit, eliminating style inconsistency.
Observation logging records file paths and patterns to session memory.
Nexus records architectural patterns and decisions for cross-project learning.

PreCompact & PostCompact (2 hooks):

Before compaction: update MEMORY.md, save reusable patterns to lesson storage.
After compaction: re-orient from MEMORY.md and task list so the conversation continues without interruption.

Stop (5 hooks):

Clean MCP subprocess termination.
Autonomy throttle (tracks how often Claude stops to check in; if it stops 3+ times in 5 minutes, it pauses and asks what I need instead of continuing on autopilot).
Session end persistence.
Nexus post-session analysis.
Telemetry export for cost and token tracking.

The critical insight: these hooks run without my intervention. I don’t think about memory persistence or auto-formatting or session telemetry. The harness handles it. Every cognitive cycle I’m not spending on housekeeping is a cycle I’m spending on the actual problem.

Layer 3: Skills (28 domain expertise modules)

This is the layer nobody talks about, and it might be the most underrated.

Claude Code skills are reusable prompt templates, slash commands that load pre-encoded domain expertise into the conversation on demand. Instead of writing a 500-token ad-hoc prompt explaining what you want, you invoke a skill that loads optimized, battle-tested instructions.

I maintain 28 skills across two tiers: 18 global skills available in every project, and 10 project-specific skills tailored to individual workflows.

Engineering skills:

/tdd-workflow enforces test-driven development: scaffold interfaces, write tests first, implement to pass, verify coverage. No ambiguity about the TDD process.
/security-review triggers a comprehensive security audit for auth, payment, or identity code. Covers OWASP Top 10, secrets management, and attack surface analysis.
/shannon is an autonomous AI pentester. White-box security assessments with real exploit execution. Not theoretical. It finds actual vulnerabilities.
/llm-cost-optimizer audits model selections across call sites and recommends cheaper alternatives by complexity tier. Pays for itself immediately.
/code-review, /build-fix, /refactor-clean are post-implementation quality gates that catch issues before they compound.

Design and visualization skills:

/design covers full brand identity, design tokens, UI styling, and logo generation with 55+ styles.
/slides creates strategic HTML presentations with Chart.js, design tokens, and contextual slide strategies.
/visual-explainer generates self-contained HTML pages that visually explain systems, code changes, and data.
/dashboard-creator builds KPI metric cards, charts, and data visualizations as standalone HTML.

Planning and analysis skills:

/deep-plan creates sectionized implementation plans with multi-LLM review and stakeholder interviews.
/skeptic is one I built myself (open source on GitHub). It runs critical analysis that pokes holes in plans, surfaces hidden complexity, and challenges assumptions before I commit to an approach. This one exists because AI agents have an agreeableness problem. You describe an idea and the model tells you it’s brilliant, then helps you build it. Three days later you realize the approach had an obvious flaw that a skeptical colleague would have caught in five minutes. /skeptic is that colleague. It forces the model to argue against the plan before I greenlight it, and it’s saved me from more bad architectural decisions than any other single tool in this harness.
/functional-design handles end-to-end UI/UX creation from functional spec to working code.

The context window angle on skills is easy to miss. Without a skill, I type out detailed instructions every time, and the model still pulls its punches or misses nuance. That means follow-up corrections. With a skill, I type three words and the model loads 2,000 tokens of pre-optimized instructions. Comprehensive and correct from the first pass. Zero follow-up cycles. Skills convert ad-hoc prompting (variable quality, frequent corrections) into encoded expertise (consistent quality, zero corrections).

Layer 4: Plugins (5 active)

Claude Code’s plugin ecosystem is young, but the right plugins dramatically extend capability:

context-mode is the biggest context window optimizer in my stack. Instead of dumping raw command output (build logs, git history, file contents) directly into the context window, context-mode sandboxes execution and indexes the output in a local SQLite FTS5 database. Only summaries enter my conversation. A 500-line git log that would normally consume ~8,000 tokens? Indexed and searchable for a few hundred. A build failure with 200 lines of stack trace? Summarized to the 3 relevant error lines. Over a session, this compounds fast.
deep-plan provides sectionized implementation planning with multi-LLM review. Turns vague feature requests into structured, reviewable plans before a line of code is written.
deep-implement runs TDD-oriented implementation from deep-plan sections. Writes tests first, implements to pass, then reviews.
deep-project decomposes high-level project requirements into scoped planning units.
code-simplifier handles post-implementation cleanup. Reviews changed code for reuse opportunities and unnecessary complexity.

Layer 5: MCP server aggregation (Bifrost)

MCP (Model Context Protocol) servers give Claude Code access to external tools, from APIs to documentation indexes to video analytics, through a standardized interface. The problem: each MCP server exposes its own set of tool definitions, and every tool definition consumes context window tokens. Run five MCP servers with 3-5 tools each, and you’re spending thousands of tokens just on tool schemas before you’ve asked a single question.

Bifrost solves this by acting as an HTTP gateway that aggregates multiple MCP servers behind a single endpoint. It runs on a dedicated machine in my network and proxies requests to backend servers:

context7 provides library documentation and code examples on demand. When Claude needs to reference a framework’s API, it queries context7 instead of burning tokens on web searches or hallucinating method signatures. This is the one I lean on hardest. Accurate docs in the context window, not guesswork.
sequential-thinking enables structured multi-step reasoning for complex architectural decisions. Forces the model to decompose a problem before committing to an approach, rather than jumping straight to code.
github provides direct GitHub API access for PR management, issue tracking, and repository operations without leaving the conversation.
firecrawl handles web scraping and content extraction. When I need to pull in a competitor’s documentation, an API reference, or a technical article, firecrawl fetches and cleans it without me alt-tabbing to a browser.
youtube provides video analytics and transcript extraction for competitive research and content analysis.

Two additional MCP servers run standalone: Nexus (cross-project memory graph) and CMEM (session memory). These stay direct because they’re lightweight, local-only operations that don’t benefit from gateway aggregation.

That’s 8 MCP server endpoints total: 5 through Bifrost, 2 standalone, plus Bifrost itself. From the model’s perspective, it sees 3 MCP connections. From my perspective, it’s 3 processes to manage instead of 8.

The token savings come from two places. First, Bifrost consolidates all five servers behind one HTTP endpoint with one schema negotiation instead of five, reducing tool definition overhead in the context window. Second, and more importantly, these MCP servers return focused results instead of raw output. A context7 documentation query returns the exact function signature, parameters, and a usage example in 200-400 tokens. The alternative? A web search that dumps 3,000-6,000 tokens of HTML, ads, and irrelevant content into the context window. Firecrawl returns clean markdown instead of raw DOM. These MCP servers return what the model needs, not everything the source contains.

The bigger point: MCP servers are how you give an agent real-world reach without bloating its context. The harness controls what enters the context window, and Bifrost is the gatekeeper.

Layer 6: Cross-project memory (Nexus + CMEM)

This is the layer most people don’t realize they’re missing.

Nexus is a tool I built: a local-first, encrypted knowledge graph that spans all my projects. When I make an architectural decision in Project A, Nexus records it. When I’m working on Project B and face a similar decision, Nexus surfaces the prior art. It tracks:

Architectural decisions and their rationale
Code patterns and where they’re used
Cross-project dependencies and potential conflicts
Infrastructure notes (SSH configs, deployment targets, service accounts)

CMEM (session memory) provides semantic search across past conversations. When I worked on a similar problem three weeks ago, CMEM surfaces the relevant context without me re-explaining it.

Without cross-project memory, every session starts cold. The model reads files it read yesterday, asks questions I’ve answered before, and explores architecture it’s already mapped. With Nexus and CMEM, the model loads synced context at session start and picks up where we left off. The first task of every session starts productive instead of exploratory.

Layer 7: Observability (Langfuse + status line)

You can’t optimize what you don’t measure. My harness exports every session to Langfuse for cost and token tracking. A custom status line shows project name, model, context usage percentage, and remaining capacity in real time.

This might seem like a nice-to-have. It’s not. When you can see that a research task consumed 45% of your context window on raw command output, you know exactly where to optimize. When you can compare token-per-task costs across sessions, you can make data-driven decisions about which plugins and rules are pulling their weight.

The status line is also a context window guardian. Seeing “Context: 72% | ~56K remaining” in real time changes how you work. You don’t issue a massive git log --all when you can see you’re at 70% capacity. You reach for context-mode’s sandboxed execution instead. Observability turns unconscious context waste into conscious context management.

The token math

Here’s what this looks like in practice. These are representative estimates from my Langfuse telemetry across typical coding sessions. Individual sessions vary, but the pattern is consistent.

The combined effect: my sessions run at roughly 70-80% productive work ratio compared to the ~45% I see in default configurations. Compactions drop from 4-6 per session to 1-2. The model gets things right on the first attempt instead of the third. And I spend my time thinking about the actual problem instead of re-explaining my project.

Same prompt, different worlds

Let me make this concrete. Here’s a real prompt I might give Claude Code:

“Refactor the authentication middleware to use the Result pattern and add FluentValidation”

Without a harness (default Claude Code):

Claude asks clarifying questions (500 tokens): “What framework? What’s the project structure? What’s the Result pattern you’re using?”
Exploration phase (20,000 tokens): Reads 10-15 files to understand the codebase, architecture, existing patterns
First implementation attempt (1,500 tokens): Writes code with try/catch and DataAnnotations. Reasonable, but wrong for this codebase.
User correction (800 tokens): “We use FluentValidation, not DataAnnotations. And we use Result with error discriminated unions.”
Second attempt (1,500 tokens): Closer, but uses mutable patterns and console.log for debugging
User correction (600 tokens): “Immutable only. No console.log. Use structured logging.”
Third attempt (1,500 tokens): Finally correct
No security review triggered
No tests written

Total: ~26,000 tokens. Three iterations. No tests. No security review.

With my harness:

Rules pre-loaded: Claude already knows FluentValidation, Result, immutability requirements, structured logging, security standards
Nexus syncs (2,000 tokens): Project architecture, existing middleware patterns, and dependency graph loaded
First implementation (1,500 tokens): Correct patterns from the start. Immutable, FluentValidation, Result, structured logging.
Security reviewer auto-spawns (rules detect auth code): Reviews for JWT validation, CSRF, token storage
Auto-format runs: Code style consistent without manual review
Build output sandboxed by context-mode: a few hundred tokens instead of thousands

Total: ~10,000 tokens. One iteration. Security reviewed. Standards enforced.

That’s not a 10% improvement. That’s a fundamentally different relationship with your context window.

Context window optimization: the throughline

If you’ve been counting, every layer of this harness optimizes the same scarce resource. That’s not a coincidence. It’s the design principle.

Rules front-load 3,000 tokens to prevent thousands more in correction cycles. Hooks automate state management that would otherwise consume context with manual “remember what we were doing” prompts. Skills convert variable-quality ad-hoc instructions into consistent, pre-optimized prompts. context-mode keeps raw tool output out of the window entirely. Bifrost routes queries to MCP servers that return focused results instead of raw HTML. Memory eliminates the cold-start exploration that consumes 10-20% of every default session. Observability makes all of this measurable so you know what’s working.

The context window is the fundamental constraint of agentic coding. Every token spent on overhead is a token not spent on reasoning about your actual problem. Every unnecessary compaction is lost state and degraded continuity. Every correction cycle is the model doing the same work twice.

A harness doesn’t make the model faster or smarter. It makes the context window deeper. And depth, the ability to sustain complex, multi-step reasoning without losing state, is what separates a coding assistant from a coding agent.

This is the new skill gap

Here’s why this matters beyond my personal setup.

In every previous technology wave, the differentiator was knowledge of the technology itself. Know Java better than the next person, ship faster. Know Kubernetes better, deploy more reliably.

With AI coding agents, the technology is the same for everyone. We all have access to the same Claude, the same GPT, the same Gemini. The model isn’t the differentiator.

The harness is.

The engineer who spends a week configuring governance rules, building skills for their recurring workflows, and optimizing context consumption will outperform the engineer with default settings for the next two years. Every session. Every task. Compounding daily.

The gap compounds in both speed and quality. The harnessed engineer gets security reviews triggered automatically. They get consistent code patterns without thinking about it. Their sessions don’t break every 30 minutes from compaction. Their model has access to documentation through MCP tools instead of hallucinating API signatures from training data.

This is also why I keep arguing that the agent harness is the architecture. It’s not infrastructure you bolt on after the fact. It’s the primary lever for engineering productivity in an AI-native workflow.

What to do this week

You don’t need 16 hooks, 28 skills, and 5 plugins on day one. Start with the highest-leverage layers, the ones that save the most context for the least effort:

Start with rules. Create ~/.claude/rules/coding-style.md and encode your team’s three most-violated coding standards. This single file will eliminate the most common correction cycles immediately.

Add a CLAUDE.md to your project. Document your stack, your patterns, and your conventions. This is the project onboarding document you wish every new hire got on day one, except now your AI agent reads it every session. It’s the highest-leverage single file in your entire repository.

Install context-mode. If you use Claude Code for any research or debugging, this plugin will be the single biggest context window optimizer. Raw command output flooding your context window is the #1 source of unnecessary compactions.

Build your first skill. Take the prompt you type most often, the one with specific formatting requirements, voice guidelines, or domain conventions, and turn it into a skill. It takes 30 minutes. It saves correction cycles forever.

Set up memory persistence. Even a simple PreCompact hook that prompts the model to update a MEMORY.md file will dramatically improve cross-session continuity. You shouldn’t have to re-explain your project every conversation.

Measure everything. You can’t optimize a context window you don’t observe. Start tracking token consumption per task. The patterns will be obvious, and they’ll tell you exactly which layer to build next.

The model is the engine. The context window is the fuel tank. The harness determines how far you go on every drop.

Subscribe now

Matthew Kruczek is Managing Director at EY, leading Microsoft domain initiatives within Digital Engineering. Connect with Matthew on LinkedIn to discuss agent harness architecture and context optimization for your engineering organization.

References

Kruczek, M. “Agent Harnesses Don’t Need More Layers. They Need Fewer.” matthewkruczek.ai, March 17, 2026. matthewkruczek.ai
Anthropic. “Claude Code: Hooks.” Anthropic Documentation, 2026. docs.anthropic.com
Anthropic. “Claude Code: CLAUDE.md.” Anthropic Documentation, 2026. docs.anthropic.com
mksglu. “Context-Mode Plugin for Claude Code.” GitHub, 2026. github.com
Kruczek, M. “Progressive Disclosure for MCP Servers.” matthewkruczek.ai, 2026. matthewkruczek.ai
Kruczek, M. “Context Engineering for Enterprise AI.” matthewkruczek.ai, 2026. matthewkruczek.ai
Anthropic. “Claude Code: Model Context Protocol.” Anthropic Documentation, 2026. docs.anthropic.com
Bifrost. “MCP Gateway for Claude Code.” GitHub, 2026. github.com
Anthropic. “Claude Code: Skills.” Anthropic Documentation, 2026. docs.anthropic.com

The Missing Layer: Why Your AI Agents Need a Package Manager

Matthew Kruczek — Mon, 23 Mar 2026 16:01:03 GMT

If you only have a minute, here’s what you need to know.

A 5-person team using AI coding agents today has 5 divergent agent configurations. A 50-person team has 50. Nobody is managing this.
Microsoft’s open-source APM (Agent Package Manager) treats agent configuration with the same rigor we’ve applied to code dependencies for decades: declare once in apm.yml, resolve transitively, lock versions, reproduce everywhere.
APM manages seven primitives that define how agents work in your codebase: instructions, skills, prompts, agents, hooks, plugins, and MCP servers. It compiles a single manifest into native formats for Copilot, Claude Code, Cursor, and OpenCode.
Supply chain security for agent configurations is already a concern. APM includes apm audit for detecting hidden Unicode vulnerabilities and blocks compromised packages before agents access them.
Agent configuration is infrastructure, not a personal preference. Organizations that treat it as such will compound their advantages the same way teams with disciplined dependency management outperformed those without.

The dependency nobody is managing

Here’s something that should bother every engineering leader deploying AI coding agents: your developers are all using different ones.

Not different models. Different configurations. Different instructions. Different skills. Different MCP servers. Different hooks. Different rules about what the agent should and shouldn’t do.

One developer has a carefully tuned CLAUDE.md with coding standards, security guardrails, and domain-specific knowledge. The developer sitting next to them has a bare default. A third copied a configuration from a blog post three months ago and hasn’t updated it since. Your newest hire has nothing at all.

This is the dependency management problem of 2015, replaying in fast-forward. We’ve seen this movie before. And Microsoft just shipped an open-source tool that solves it the same way npm and NuGet solved it for code: APM (Agent Package Manager). One manifest. Transitive resolution. Lock files. Reproducible agent configurations across your entire org.

Before I walk through what APM does and why it matters, it’s worth understanding the pattern it’s built on. This isn’t a new problem. It’s a solved problem being applied to a new domain.

We solved this problem 15 years ago (for code)

Software engineering went through an identical maturation cycle with code dependencies. The timeline is instructive.

In the early days, developers managed libraries by hand. You’d download a JAR file, drop it in a /lib folder, and pray that everyone on the team had the same version. It worked for small teams. It collapsed at scale. Version conflicts. Missing transitive dependencies. The infamous “works on my machine” problem that consumed entire sprints.

Then package managers arrived. npm for JavaScript. pip for Python. Maven for Java. NuGet for .NET. Cargo for Rust. Each brought the same core innovation: declare your dependencies in a manifest, resolve them automatically, lock versions for reproducibility. package.json. requirements.txt. Cargo.toml. Different ecosystems, identical principle.

The effect wasn’t incremental. It was foundational. Package managers didn’t just save time. They made entire categories of problems disappear. They enabled open-source ecosystems to scale from dozens of libraries to millions. They made CI/CD pipelines possible by guaranteeing build reproducibility. They turned “dependency management” from a skill into a solved problem.

AI agent configuration is sitting right where code dependencies sat before package managers arrived. Manual setup. Configuration drift. No version pinning. No transitive resolution. No reproducibility guarantees. No security scanning.

The only difference is the speed. Code dependency management took a decade to mature. Agent configuration needs to get there in months, because the adoption curve is steeper and the blast radius of a misconfigured agent is larger than a mismatched library version.

Enter APM

Microsoft’s Agent Package Manager is an open-source project (MIT licensed, 710+ stars, 33 releases as of March 2026) that applies the package manager pattern to AI agent configuration. The creator, Daniel Meppiel, a Software Global Black Belt at Microsoft, built it on a straightforward premise: if package.json works for code dependencies, apm.yml should work for agent dependencies.

The comparison is more than metaphorical. APM implements the same lifecycle that made code package managers transformative:

Declare. Your project’s apm.yml specifies what your agents need. Not how to configure them manually, but what packages of configuration to install.

name: my-enterprise-project
version: 1.0.0
dependencies:
  apm:
    - org/coding-standards
    - org/security-guardrails
    - org/api-review-skill#v2.1
    - anthropics/skills/skills/frontend-design

Resolve. apm install pulls packages from any git host (GitHub, GitLab, Bitbucket, Azure DevOps, GitHub Enterprise), resolves transitive dependencies, and places files in the correct directories.

Lock. apm.lock.yaml pins every dependency to an exact commit. The same lock file produces the same agent configuration on every machine, every time.

Compile. A single manifest generates native configuration files for multiple tools. AGENTS.md for Copilot. CLAUDE.md for Claude Code. .cursor/rules/ for Cursor. One source of truth, many deployment targets.

Distribute. apm pack bundles resolved dependencies into portable artifacts that work without APM, Python, or network access. This covers CI pipelines, air-gapped environments, and dev container setups.

This isn’t a prototype. APM is at v0.8.4 with 18 contributors, 772 commits, and a GitHub Action for CI integration. It’s built on established open standards: the AGENTS.md specification (a Linux Foundation AAIF founding project alongside MCP), the Agent Skills framework, and the Model Context Protocol.

Seven primitives, one manifest

APM manages seven types of agent configuration that together define how AI coding agents operate in your codebase:

Instructions are coding standards and guardrails scoped to file patterns. A Python standards instruction applies only to \\/\*.py files. A security policy applies everywhere. These replace the ad-hoc “paste your coding standards into the system prompt” approach that most teams use today.

Skills are reusable AI capabilities: code reviewers, form builders, deployment assistants, security scanners. If you’ve read my previous article on Agent Skills as the missing link, this is the distribution mechanism that was absent. APM makes skills installable, composable, and version-pinned instead of manually copied between repositories.

Prompts are slash commands for common tasks. /review, /deploy, /test. Standardized across the team instead of each developer inventing their own.

Agents are specialized personas with defined roles, tools, and behavioral boundaries. Your security review agent. Your architecture advisor. Your documentation generator.

Hooks are lifecycle event handlers that fire on specific triggers: before a commit, after a file edit, on session start. These automate quality gates and organizational workflows.

Plugins are pre-packaged bundles that combine multiple primitives into a single installable unit. Think of them as meta-packages.

MCP Servers are external tool integrations that give agents access to databases, APIs, file systems, and other services through the Model Context Protocol.

Without APM, configuring all seven of these for a new developer on your team is a multi-hour manual process that nobody documents and everyone does differently. With APM, it’s git clone && apm install.

The problem that’s hiding in plain sight

Most organizations don’t realize they have an agent configuration problem because they haven’t tried to standardize yet. The moment you do, the scale of the issue becomes obvious.

Imagine a financial services firm with 200 developers using AI coding agents. Some teams use Claude Code. Others use Copilot. A few use Cursor. Each team has its own coding standards, security requirements, and compliance guardrails. Some teams have encoded these into their agent configurations. Most haven’t.

Without a dependency manager, you’re left with two options. Option one: a wiki page titled “How to Set Up Your AI Agent” that’s perpetually out of date and that new hires follow maybe half of. Option two: a senior engineer spends an afternoon with each new team member walking them through the setup, which means the configuration is as good as that engineer’s memory on that particular day.

Neither option scales. Neither option is auditable. Neither option gives you confidence that the agent writing code for your regulated application has the same security guardrails as the agent that passed your last compliance review.

APM’s lock file changes this equation. apm.lock.yaml is a diffable, version-controlled artifact that proves every developer on your team is running the same agent configuration. When your compliance team asks “how do we know the AI agents are following our security standards?”, you show them the lock file.

Supply chain security for agent configuration

Here’s where it gets serious. AI agent configurations aren’t just preferences. They’re instructions that shape what code gets written, what security patterns get followed, and what guardrails get applied. A compromised agent configuration is a supply chain attack on your entire codebase.

APM includes apm audit, which scans packages for hidden Unicode vulnerabilities and blocks compromised configurations before they reach your agents. This is the agent equivalent of npm audit or GitHub’s Dependabot alerts. It’s table stakes for any organization that takes software supply chain security seriously.

CIO.com recently identified AI agent versioning as “the CIO’s next big challenge,” noting that “a minor API update could significantly alter an agent’s behavior, even if the agent’s core logic remains unchanged.” The article recommended treating agent versioning as “a first-class discipline” alongside traditional software release management.

APM makes that recommendation actionable. Version pinning. Lock files. Auditing. Reproducible builds. These aren’t new concepts. They’re proven infrastructure patterns applied to a new domain.

Where this fits in the agent-first enterprise

Throughout this series, I’ve argued that the organizations seeing transformational results with AI aren’t just using agent tools. They’re building infrastructure around agent capabilities.

Skills gave us composable, portable organizational knowledge. MCP gave us a standard protocol for connecting agents to external tools. Progressive disclosure solved the context window scaling problem.

APM is the layer that was missing between all of these. It’s the composition and governance infrastructure that answers questions the other standards don’t address:

Which skills should this project use? (Declared in apm.yml)
What version of those skills? (Pinned in apm.lock.yaml)
Are all developers using the same setup? (Reproducible via apm install)
Are the configurations safe? (Verified via apm audit)
How do we distribute this to new team members? (Automated via apm pack)

The progression looks like this: first you adopt agents, then you customize them with skills and MCP servers, then you realize you need a way to manage and distribute those customizations consistently. APM is what you reach for at that third step.

What to do this week

If you’re already using AI coding agents in your organization:

Start by auditing what you have. How many different agent configurations exist across your teams? Are coding standards encoded in those configurations? Security guardrails? If you can’t answer these questions, you’ve confirmed the problem.

Install APM. It takes less than a minute. On Windows: irm https://aka.ms/apm-windows | iex. On Mac/Linux: curl -sSL https://aka.ms/apm-unix | sh. Run apm init in a project to see the manifest structure.

This month:

Pick one team as a pilot. Have them declare their existing agent configuration in apm.yml, including their coding standards, security rules, and any skills or MCP servers they’ve configured manually. Install from a shared git repository so the entire team gets the same setup.

Track two things: time saved on new developer onboarding, and the number of configuration inconsistencies you discover during the migration. Both numbers will surprise you.

This quarter:

Build an organizational package library. Your coding standards. Your security guardrails. Your compliance requirements. Your domain-specific skills. Publish them to internal git repositories that any team can install from. Establish a review process for changes to these packages, the same way you review changes to shared libraries.

Integrate apm audit into your CI pipeline. Make agent configuration a first-class artifact alongside your code.

The infrastructure maturity curve

Every technology goes through the same maturation arc. First comes capability: the thing works. Then adoption: people start using it. Then chaos: everyone uses it differently. Then infrastructure: tooling emerges to manage the chaos. Then standardization: the tooling becomes an expected part of the stack.

AI coding agents are somewhere between chaos and infrastructure. The capability is proven. Adoption is accelerating (96% of enterprise IT leaders plan to expand AI agent use in the next 12 months). The chaos is real but largely unacknowledged. The infrastructure is arriving.

APM is one of the first serious attempts to move agent configuration from artisanal to industrial. It won’t be the last. But it’s built on the right foundations (open standards, proven patterns, multi-tool support), and it’s solving a problem that gets worse with every developer you add to your team.

The organizations that figure out agent infrastructure early will compound their advantages over those that don’t. That’s not speculation. It’s exactly what happened with code dependency management, with CI/CD, with containerization, with every infrastructure layer that made software engineering more reproducible and reliable.

Agent configuration is infrastructure. Treat it accordingly.

You're Measuring Agentic AI Wrong: The Three-Layer Framework Leaders Actually Need

Matthew Kruczek — Thu, 19 Mar 2026 13:28:44 GMT

If you only have a minute, here’s what you need to know.

Agentic AI doesn’t just automate tasks. It makes decisions within processes. Measuring it with the same KPIs you used for RPA will give you numbers that move without telling you anything useful.
Most enterprises are sitting at Layer 1: operational metrics like FTE saved and cycle time reduction. These matter, but they’re the floor, not the ceiling.
Layer 2 metrics measure decision quality: how often the agent made the right call, how often humans overrode it, and where confidence broke down. This is where most organizations have a measurement gap today.
Layer 3 metrics measure adaptive capacity: whether the agent is expanding what the process can do, not just running the existing process faster. This is where transformational value lives.
The biggest measurement mistake leaders make is declaring success when Layer 1 numbers improve, before checking whether Layers 2 and 3 are even being tracked.

Nearly every leader I talk to right now has some version of the same conversation. Their team has introduced an AI agent into a core process. Something is clearly happening. The numbers are moving. But when the business asks “what’s the impact?”, the answer sounds thinner than the experience feels.

That gap between what’s happening and what the metrics are capturing is not an accident. It’s structural.

Agentic AI is not automation. Traditional automation executes a predetermined sequence of steps. An agent evaluates context, makes judgment calls, handles exceptions, and adapts to conditions the original process designer never anticipated. When you put an agent into a business process, you are not installing a faster conveyor belt. You are introducing a new decision-maker.

Most enterprises are still measuring the conveyor belt.

The problem with your current dashboard

When organizations first deploy AI agents, the metrics that surface naturally are the ones already on their dashboards: cycle time, task volume, FTE hours, cost per transaction, error rate. These numbers are real and they matter. But they were designed to measure deterministic processes, where the only question is how fast and how reliably the sequence runs.

Agents introduce non-determinism. They don’t always take the same path. They encounter novel situations and handle them in ways the process never specified. They can escalate appropriately, fail silently, or make calls that are locally correct but strategically wrong.

None of that is visible in your existing KPIs.

The organizations that are ahead on this have built a three-layer measurement framework. Each layer answers a different question.

Layer 1: Operational metrics (the efficiency floor)

These are the metrics every organization starts with, and they are necessary.

Tasks completed per agent per hour
Cycle time reduction versus the baseline measured before deployment
Cost per transaction with and without agent involvement
Volume handled without human escalation
Error and exception rate

Layer 1 answers the question: is the agent doing the work? For any deployment to justify itself, these numbers need to be positive. Organizations seeing 60–80% reductions in cycle time are tracking Layer 1 metrics well.

The failure mode is stopping here. Layer 1 tells you the agent is running. It does not tell you the agent is running well.

Layer 2: Decisional metrics (the quality layer)

This is where most organizations have an active measurement gap, and where the most important diagnostic information lives.

Agents make decisions. Those decisions have quality. Quality can be measured.

Human override rate. When a human reviews an agent’s output, how often do they change it? A high override rate on routine tasks signals miscalibration. A low override rate on genuinely complex tasks signals overconfidence — often more dangerous than the first problem.

Confidence threshold distribution. Well-designed agents signal uncertainty. Track how often your agent is operating at high, medium, and low confidence, and whether those self-assessments correlate with actual accuracy. An agent that reports high confidence but triggers frequent corrections needs retraining or rescoping.

Exception escalation precision. When the agent escalates to a human, is the escalation justified? Track both the rate and the appropriateness. Agents that over-escalate are expensive. Agents that under-escalate are dangerous.

Decision reversibility lag. How often is an agent decision reversed after the fact, and how much time passes before the reversal? Irreversible decisions made incorrectly compound before they surface. This metric is particularly important in financial, compliance, and customer-facing processes.

Novel situation handling rate. What percentage of tasks fall outside the agent’s training distribution? This tells you something important about whether the deployment scope is well-matched to the agent’s actual capabilities.

These metrics require intentional instrumentation. They will not appear in your process management tool by default. Building the logging and evaluation layer to capture them is non-trivial work — and it is exactly the work most organizations skip because Layer 1 numbers look acceptable.

Layer 3: Adaptive capacity metrics (the transformation ceiling)

This layer measures something qualitatively different from the other two. Not whether the agent is running, and not how well it’s making decisions within the existing process. Instead: whether the process itself is expanding because the agent exists.

This is the measurement of transformational value. It is also the hardest to quantify, because it requires a counterfactual. You are measuring what is now possible that was not possible before.

New capability acquisition rate. How quickly can you extend the agent to handle adjacent task types? An agent that required six weeks of development to add a new task type in its first quarter, but only two weeks by its fourth quarter, is compounding capability. One that remains at six weeks is not.

Human attention quality shift. Are the humans who work alongside the agent spending more of their time on genuinely high-judgment work? Track what your people are actually doing now versus what they did before. If agent deployment simply freed them up for more of the same work, Layer 3 value is not materializing. If it redirected their attention toward decisions that actually require human judgment, it is.

Process boundary expansion. Has the agent enabled the organization to take on scope that would have been infeasible before? Agentic AI’s most significant impact in mature deployments is not doing the same process faster. It is doing a different, more ambitious version of the process that was previously impractical at scale.

Time-to-value on new process introductions. As you add new processes to your agent environment, how long does the ramp from introduction to operational stability take? Organizations where this number is declining have built genuine organizational capability. Those where it stays flat are running deployments, not building systems.

What this looks like in practice

Consider a representative scenario: a global financial services firm deploys an agent to handle initial client inquiry triage. Six months in, Layer 1 metrics look excellent. Inquiry cycle time is down 65%. Volume handled without escalation is up significantly.

But Layer 2 reveals a problem the leadership team hadn’t seen. The human override rate on medium-complexity inquiries is 34% — far above the 10–12% the team had assumed. And the override rate is higher on cases the agent rates as high confidence than on cases it flags as uncertain. The agent is most wrong when it thinks it’s most right.

Without Layer 2 metrics, this organization would have declared the deployment a success. With them, they have a clear retraining target, a scope adjustment to consider, and a monitoring requirement to build.

Layer 3 metrics tell a different story. The same organization discovers that because agent triage is now handling volume that previously required four full-time analysts, those analysts are available to work on relationship-intensive activities the team never had capacity for before. A new capability has emerged. That value was always latent in the process. The agent made it accessible.

What to do this week

Audit your current measurement approach. Which layer are you sitting in? If your entire agentic AI measurement program is Layer 1 metrics, you have a blind spot problem regardless of how good the numbers look.

Build the Layer 2 instrumentation. The human override rate, escalation precision, and confidence calibration metrics require logging at the agent decision level. If your current deployment does not produce this data, that is your first engineering priority.

Define your Layer 3 baseline. Before you can measure what new capabilities the agent creates, you need a documented picture of what the process could and could not do before deployment. This does not require sophisticated tooling. It requires a clear-eyed audit of process scope and capacity constraints.

Don’t wait for the framework to be complete before sharing it. Your leadership team is asking this question now. A two-slide summary of the three-layer framework, paired with honest assessment of which layer you are currently measuring and which you are not, is more useful than a fully instrumented measurement system that arrives in six months.

The organizations that will be ahead on agentic AI measurement are not the ones with the most sophisticated dashboards. They are the ones that correctly understood what they were measuring in the first place.

Your Engineering Team Is Speaking a New Language: An Executive's Plain-English Guide to AI Development

Matthew Kruczek — Tue, 17 Mar 2026 18:29:44 GMT

If you only have a minute, here’s what you need to know.

AI development has its own vocabulary, and most of it was never designed for business leaders. If you can’t follow the conversation, you can’t make informed investment decisions.
This article translates 24 essential AI terms into plain English, organized by how they relate to business decisions you’re already making.
You don’t need to become technical. You need to understand enough to ask the right questions and recognize when a vendor pitch is substance versus hand-waving.
Bookmark the glossary. Share it with your leadership team. Refer to it before your next AI briefing.

I was explaining our AI development approach to a senior executive last week when he stopped me mid-sentence. “I don’t know what an MD file is,” he said. “What’s a skill? When you say ‘context,’ what are you talking about?”

He wasn’t being difficult. He was being honest. And that honesty revealed a problem hiding in plain sight across every enterprise AI initiative I’ve been part of.

The people making multi-million-dollar decisions about AI strategy often can’t follow the conversations where those strategies get shaped. Not because they lack intelligence. Because the AI industry has built its own dialect, and nobody wrote a translation guide.

This is that guide.

The vocabulary gap is a strategy gap

A 2025 survey by Lucidworks found that 72% of C-suite executives feel confident about their organization’s AI strategy. But when researchers tested those same executives on basic AI concepts, the results told a different story. Most couldn’t accurately explain how the tools they’d approved actually work.

This isn’t an academic problem. When you can’t distinguish between a large language model and a fine-tuned model, you can’t evaluate whether your team’s proposed approach is sound. When you don’t know what a context window is, you can’t understand why your AI assistant sometimes forgets what you told it five minutes ago. When “agent” means nothing more specific than “smart bot,” you’ll overpay for simple automation dressed up in agentic packaging.

The vocabulary gap isn’t about buzzwords. It’s about decision quality.

How AI-assisted development actually works

Before we get to the glossary, here’s the 60-second version of how AI fits into software development today.

Your engineering teams used to write every line of code by hand. Then autocomplete tools started suggesting the next few characters. Now, AI can write entire functions, review code for security problems, generate test cases, and even architect solutions based on a description of what you want built.

Think of it like the evolution of manufacturing. Artisans making everything by hand gave way to power tools, then assembly lines, then robotic automation. The humans didn’t disappear. Their role shifted from manual labor to design, oversight, and quality control.

That’s exactly what’s happening in software development. Your best engineers are becoming orchestrators who direct AI tools rather than writing every line themselves. The ones who adapt are delivering work 60-80% faster. The ones who don’t are writing code the same way they did five years ago.

Your job as a leader isn’t to understand the technical details. It’s to understand enough to recognize where your organization sits on that spectrum, and what it takes to move forward.

The glossary: 24 terms you’ll actually hear

I’ve organized these into four categories that map to how you’ll encounter them in practice: the foundational concepts, the development terms, the enterprise and strategy layer, and the quality and safety terms.

The foundation: How AI thinks

Artificial Intelligence (AI)

Software that performs tasks typically requiring human judgment. In business context today, this almost always means software that understands and generates language, images, or code. When your teams say “AI,” they usually mean the next term.

Large Language Model (LLM)

The engine behind ChatGPT, Claude, and Copilot. An LLM is a program trained on enormous amounts of text that can understand questions, generate responses, write code, and analyze documents. Think of it as a very well-read colleague who has studied millions of books and conversations but has never actually worked at your company.

Generative AI (GenAI)

AI that creates new content: text, code, images, presentations. This is the umbrella term for the current wave of AI tools. Every time someone asks ChatGPT to draft an email, they’re using generative AI.

Token

The unit of measurement for AI text processing. Roughly one token equals one word, though some words get split into multiple tokens. Why this matters to you: AI services charge by token count, and every model has a maximum number of tokens it can handle at once. When your team talks about cost optimization, they’re often talking about reducing token usage.

Context window

The amount of information an AI can hold in its “working memory” during a single conversation. Think of it as the AI’s desk. A bigger desk lets it spread out more documents and reference more material. When someone says a model has a “200K context window,” they mean it can work with roughly 200,000 words at once. When the desk fills up, older information falls off the edge.

Prompt

The instruction you give an AI. “Write me a summary of this report” is a prompt. “Review this code for security issues” is a prompt. The quality of the output depends heavily on the quality of the prompt, which is why “prompt engineering” became a discipline.

Inference

When an AI generates a response, that process is called inference. This is the operational cost of running AI. Model training happens once (and costs millions). Inference happens every time someone asks the model a question (and costs fractions of a cent). Your cloud bills for AI are mostly inference costs.

Hallucination

When an AI generates information that sounds correct but is factually wrong. The model isn’t lying. It’s filling in gaps with statistically plausible text. This is why human review remains essential for anything consequential. If someone tells you their AI “never hallucinates,” they’re either confused or selling you something.

The development layer: How teams build with AI

Copilot

Microsoft’s branding for AI assistants embedded in their products (GitHub Copilot for code, Microsoft 365 Copilot for office work). The term has become semi-generic, the way “Kleenex” stands for tissue. When your teams say “copilot,” they mean an AI assistant that works alongside them inside the tools they already use.

Agent / AI Agent

An AI that can take actions on its own, not just answer questions. A chatbot waits for you to ask something. An agent can browse the web, call APIs, modify files, run code, and chain multiple steps together to complete a task. An agent is the difference between a librarian who finds books for you and an assistant who researches, summarizes, and drafts the report.

Multi-agent

Multiple specialized AI agents working together on a task. One agent handles requirements analysis. Another writes code. A third reviews it for security issues. A fourth runs tests. This mirrors how human teams divide labor, except agents can run in parallel and hand off work without scheduling meetings.

Skill

A packaged set of instructions that teaches an AI agent how to perform a specific task the way your organization does it. Without skills, an agent writes code in a generic style. With your company’s skill loaded, it follows your naming conventions, uses your preferred frameworks, and applies your security standards. Skills are how you turn a general-purpose AI into one that understands “how we do things here.”

Orchestration

The act of coordinating multiple AI agents, tools, and workflows to accomplish a complex task. Your senior engineers are increasingly becoming orchestrators rather than hands-on coders. They design the workflow, assign agents to each step, set quality checkpoints, and validate the results.

Markdown

A simple text formatting language that uses symbols instead of toolbar buttons. A hashtag (#) makes a heading. Asterisks make text bold. Dashes create bullet lists. AI tools use markdown extensively because it’s lightweight and readable by both humans and machines. When your team mentions markdown, they’re talking about a plain text format, not a programming language.

MCP (Model Context Protocol)

An open standard that lets AI agents connect to external tools and data sources. Think of it like USB for AI. Before USB, every peripheral needed its own proprietary cable. MCP does the same for AI integrations: one standard protocol that connects any AI model to any tool, database, or service.

RAG (Retrieval-Augmented Generation)

A technique that lets AI pull in relevant information from your company’s documents before generating a response. Instead of relying only on what the model learned during training, RAG searches your knowledge bases in real time. This is how you make AI answers specific to your organization without retraining the entire model.

Fine-tuning

Retraining an AI model on your company’s specific data so it performs better on your use cases. This is more expensive and complex than RAG but produces a model that deeply understands your domain. Most organizations start with RAG and fine-tune only when they’ve proven the use case justifies the investment.

The strategy layer: Enterprise AI decisions

Prompt engineering

The discipline of crafting effective instructions for AI. Early AI adoption treated this as an art. It’s now becoming a systematic practice with documented patterns and measurable outcomes. Good prompt engineering is the difference between AI that produces generic output and AI that delivers exactly what you need.

Context engineering

The practice of controlling everything an AI can see when it processes a request: the prompt, the documents, the conversation history, the organizational knowledge. If prompt engineering is writing a good question, context engineering is preparing the entire briefing package. This is where the real enterprise value lives.

Specification (in agent development)

A document that describes what success looks like rather than step-by-step instructions. Traditional requirements tell developers how to build something. Specifications for agents describe the desired outcome, constraints, and quality criteria, then let the agent determine the best approach. Writing good specifications is becoming one of the most valuable skills in engineering.

Agentic AI

AI systems designed to act autonomously toward goals. Rather than answering individual questions, agentic AI breaks down complex objectives, creates plans, executes steps, evaluates results, and adjusts course. This is the frontier of enterprise AI, where the technology moves from assistant to operator.

The safety layer: Quality and risk

Guardrails

Rules and constraints that prevent AI from doing things it shouldn’t. Content filters, spending limits, approval gates before actions are taken, restrictions on what data the AI can access. Guardrails are how you maintain control while giving AI enough autonomy to be useful. More guardrails means more safety but less speed. Finding the right balance is a leadership decision.

Grounding

Connecting AI outputs to verified information sources. An ungrounded response comes purely from the model’s training data (which may be outdated or wrong). A grounded response cites specific, verifiable sources. Grounding is how you reduce hallucinations and make AI outputs trustworthy enough for business decisions.

Human-in-the-loop

A workflow design where humans review and approve AI decisions at critical points. Not everything needs human approval (an AI can auto-format code without asking). But security changes, financial transactions, and customer-facing content should have a human checkpoint. The question isn’t whether to include humans but where in the process they add the most value.

Putting it all together: A real-world scenario

The glossary gives you definitions. But the real value is understanding how these pieces connect. Let me walk you through a scenario your teams are probably living right now.

Imagine your engineering team gets a request: build a new customer portal that lets clients track their order status, contact support, and manage their account. Here’s how AI-assisted development handles this, with every term from the glossary showing up naturally.

The starting point. A senior engineer sits down with an AI agent, not a simple chatbot, but one that can take actions, read files, and execute multi-step work. The engineer writes a prompt: “Build a customer portal with order tracking, support chat, and account management. Follow our company’s security and design standards.”

That prompt is short. But here’s where context engineering comes in. The agent doesn’t just see those two sentences. It also loads the team’s skills, pre-packaged instructions that encode how this company builds software. One skill says “we use React for frontends.” Another says “all customer data must be encrypted at rest and accessed through our API gateway.” A third contains the company’s design system with approved colors, fonts, and component patterns.

All of that, the prompt, the skills, the conversation history, the referenced documents, fills the agent’s context window. Think of it this way: if the context window is a desk, the prompt is the sticky note with today’s assignment. The skills are the company handbook, the style guide, and the security policy. The referenced codebase documentation is the stack of technical drawings. Everything has to fit on that desk at the same time, and the desk has a fixed size.

This is why tokens matter to your budget. Every word in those skills, every line of referenced documentation, every back-and-forth message in the conversation consumes tokens. A skill that’s 3,000 tokens is taking up desk space that could be used for the actual work. Multiply that across dozens of agents running hundreds of requests per day, and token efficiency directly affects your cloud spend. A bloated set of instructions that consumes 50,000 tokens per request costs real money at scale, and it crowds out the space the agent needs to reason about the actual problem.

Where context limits hit. Here’s the moment that catches most teams off guard. The engineer asks the agent to review the entire existing codebase (150,000 lines) alongside the new portal requirements. But the agent’s context window is 200,000 tokens. The codebase alone consumes 120,000 tokens. The skills take another 15,000. The conversation history is 10,000. That leaves just 55,000 tokens for the agent to actually think and generate code. Performance drops. The agent starts “forgetting” instructions from earlier in the conversation because they’ve fallen off the edge of the desk.

This is why your team talks about RAG (retrieval-augmented generation). Instead of cramming the entire codebase into the context window, RAG lets the agent search for just the relevant files when it needs them. The agent pulls in only the authentication module when working on login, only the order database schema when building the tracking page. The desk stays clean. The agent stays sharp.

The multi-agent workflow. Now the work splits across specialized agents in a multi-agent setup. One agent analyzes the requirements and produces a technical specification, a document describing what success looks like rather than step-by-step instructions. A second agent takes that specification and generates the code. A third agent reviews the code for security vulnerabilities. A fourth writes and runs tests.

Each agent has its own context window, its own desk. The security review agent loads the company’s security skill (which includes OWASP compliance requirements and the company’s specific data handling policies). It doesn’t need the design system skill. That keeps its desk clear for focused security analysis.

Orchestration is the act of coordinating all of this. The senior engineer isn’t writing code line by line. They’re designing the workflow: which agent handles which task, what quality gates sit between each step, where a human-in-the-loop checkpoint is needed. Security changes? Human approval required. Color adjustments to a button? The agent handles it autonomously.

The connection layer. The portal needs to pull order data from the existing ERP system and send support tickets to the CRM. This is where MCP (Model Context Protocol) comes in. Rather than building custom integrations for each AI tool, MCP provides a standard interface. The agent connects to the ERP through an MCP server the same way it connects to the CRM, the same way it connects to the company’s internal documentation. One protocol, many connections, like USB for AI.

Quality at the end. Before anything ships, guardrails catch problems. The agent can’t deploy directly to production without approval. It can’t access customer financial data outside the approved API. Grounding ensures that when the agent generates documentation for the portal, it cites actual API endpoints and real database fields rather than hallucinating plausible-sounding but nonexistent ones.

What the executive sees. From your chair, here’s what happened: a feature that used to take a team of five developers three months was delivered in three weeks by two engineers orchestrating AI agents. The code follows your company’s standards (because those standards were encoded as skills). The security review was thorough (because a specialized agent ran checks against your actual compliance requirements). And the cost was measurable in tokens consumed, agent time used, and human hours for oversight.

That’s the full picture. Every term in the glossary has a role. None of them exist in isolation. And when your team throws these words around in their next briefing, you’ll know exactly how the pieces connect.

How to use this in your next meeting

You don’t need to memorize these definitions. You need to know enough to ask three questions:

“What’s the context window for our current setup, and are we hitting limits?” This tells you whether your teams are constrained by the AI’s working memory, which directly affects output quality.

“Where are the human-in-the-loop checkpoints, and why there?” This reveals how much autonomy the AI has and whether the risk controls match your tolerance.

“What skills have we built, and what institutional knowledge are we still missing?” This tells you whether your AI investment is accumulating organizational value or starting from scratch every time.

Those three questions will give you more strategic insight than any vendor demo.

The real risk isn’t the technology

Every technology wave comes with its own vocabulary. Cloud computing brought us “elastic scaling” and “microservices.” Mobile brought us “responsive design” and “push notifications.” Leaders figured those out because they had to.

AI’s vocabulary wave is bigger, faster, and more consequential. The decisions being made right now, about agents, context engineering, multi-agent orchestration, and skill development, will shape your technology organization for the next decade.

You don’t need to write code. You don’t need to understand neural network architecture. But you do need to understand the language well enough to evaluate whether the strategy your team is proposing will actually work.

The executives who learn this vocabulary won’t just follow the conversation. They’ll lead it.

Agent Harnesses Don't Need More Layers. They Need Fewer.

Matthew Kruczek — Mon, 16 Mar 2026 14:37:11 GMT

If you only have a minute, here's what you need to know.

Agent harnesses, the infrastructure wrapping AI agents for production use, are becoming the defining architecture pattern of 2026. But most enterprises are building them wrong.
Evidence from Vercel, Manus, and OpenAI’s Codex shows that stripping tools and complexity consistently outperforms adding more scaffolding. Vercel cut 15 specialized tools down to 2 and saw accuracy jump from 80% to 100%.
Microsoft’s Agent Framework takes a disciplined approach: approval workflows, context compaction strategies, and dual-language support in Python and .NET, without the bloat that plagues custom harness implementations.
The enterprises that will succeed with agent infrastructure aren’t the ones with the most sophisticated orchestration layers. They’re the ones with the fewest.
If your agent harness has more abstraction layers than your agent has tools, you’ve already lost the plot.
Thanks for reading Context Engineering! This post is public so feel free to share it.
Share

The AI industry spent 2025 building agents. 2026 is the year we figure out how to control them.

A new term is circulating in enterprise architecture circles: agent harness. It refers to the infrastructure layer that sits between an AI model and the real world, managing tool access, approval workflows, context windows, error recovery, and state persistence. Think of the model as the engine and the harness as the car. Without the car, the engine is impressive but useless. Without a good car, the engine destroys itself.

The concept isn’t new. Anyone running agents in production has been building some version of a harness for the past year. What’s new is that the industry is now treating harness engineering as a first-class discipline, with dedicated frameworks, design patterns, and an emerging consensus that the harness, not the model, determines whether agents succeed or fail in production.

I agree with that premise. But I’m watching enterprises draw exactly the wrong conclusion from it.

The complexity trap

The natural instinct when you recognize the harness matters is to build more of it. More abstraction layers. More specialized tools. More governance checkpoints. More orchestration logic. Enterprise architects see the agent harness and think: finally, something I can over-engineer.

The evidence points in the opposite direction.

Vercel’s agent team ran an experiment that should be required reading for every enterprise architect. They had 15 specialized tools powering their AI coding agents. They removed 13 of them, keeping only two: bash execution and SQL queries. The result? Accuracy jumped from 80% to 100%. Token usage dropped 37%. Speed improved 3.5x. Fewer tools, dramatically better outcomes.

Manus, the autonomous agent framework, tells the same story. The team rebuilt their agent system four times. The greatest performance gains came not from adding capabilities but from removing complexity. They implemented filesystem-as-memory, aggressive context compaction (100:1 input-to-output ratio), and KV-cache optimization. The result was a 10x cost reduction through pure infrastructure simplification.

OpenAI’s internal Codex agents converged on identical principles independently. Minimal, general-purpose tools. External state persistence through git and files. Structured error retention. Strict context discipline.

Three separate teams, three different organizations, one conclusion: the best agent harness is the one with the least in it.

Why less works better

This isn’t counterintuitive once you understand the mechanics.

Every tool you add to an agent’s context window competes for the model’s attention. I wrote about this in my piece on progressive disclosure for MCP servers: 400 tools can consume 400,000+ tokens, exceeding even the largest context windows. But the problem isn’t just token count. It’s decision fatigue. Models, like humans, make worse choices when presented with too many options.

Specialized tools also create routing problems. When you give an agent 15 ways to accomplish similar tasks, it spends reasoning cycles figuring out which tool to use instead of solving the actual problem. Strip it down to bash and a database connection, and the model focuses its reasoning on what matters: accomplishing the objective.

Richard Sutton’s bitter lesson from machine learning applies directly here. General methods that use computation beat specialized methods that try to encode human knowledge about the domain. Complex scaffolding becomes obsolete as models improve. The harness should simplify with model upgrades, not accumulate complexity.

What Microsoft gets right

This is the lens through which I’ve been evaluating Microsoft’s Agent Framework, and its recently published agent harness patterns specifically.

The framework focuses on three building blocks: local shell execution with approval gates, hosted shell in managed environments, and context compaction. That’s it. Three patterns, not thirty.

The approval workflow is particularly well-designed. In Python, you decorate a tool with

@tool(approval_mode="always_require")

and the framework handles the rest. In .NET, you wrap tools with

ApprovalRequiredAIFunction.

The pattern is explicit and minimal. There’s no sprawling governance layer, just a clear gate at the point where the agent wants to do something irreversible.

Context compaction is where the real discipline shows. Long-running agent sessions inevitably exceed context windows. Microsoft’s approach offers composable strategies: sliding window, tool result compaction, and truncation, combined through a pipeline. The .NET implementation chains

ToolResultCompactionStrategy, SlidingWindowCompactionStrategy,

and

TruncationCompactionStrategy

into a single

PipelineCompactionStrategy.

Configurable. Composable. Not over-abstracted.

And the dual-language support in Python and .NET matters more than it might seem. Enterprise teams aren’t monolingual. The same harness patterns working identically in both ecosystems means your .NET backend team and your Python data science team can build agent infrastructure using shared concepts. That’s practical enterprise architecture, not marketing.

The governance question

I can already hear the objection: “But we need governance. Compliance. Audit trails. We can’t just give agents bash access and hope for the best.”

Fair. But governance doesn’t require complexity. The CNCF’s framework for autonomous enterprise governance identifies four pillars: golden paths (pre-approved configurations), guardrails (policy enforcement), safety nets (automated recovery), and manual review gates. Notice what’s missing: there’s no pillar for “add seventeen orchestration layers.”

The most effective governance I’ve seen in production agent systems follows a simple rule: intervene only when the model can’t self-correct. That means approval gates for irreversible actions, sandboxing for execution environments, audit logging for everything, and nothing else. Every additional governance mechanism is a tax on agent performance that needs to justify its existence with a specific risk it mitigates.

What to do this week

If you’re building or evaluating agent harness infrastructure, here’s my recommendation:

Audit your tool count. If your agents have access to more than 5-7 tools, run the Vercel experiment. Strip down to the minimum general-purpose set and measure the difference. You may be surprised.

Adopt a framework, don’t build from scratch. Microsoft’s Agent Framework, LangGraph, or similar production-tested frameworks have already solved the foundational problems. Your engineering effort should go into your specific approval workflows and domain logic, not reinventing context management.

Measure harness complexity as a cost. Every abstraction layer, every custom tool, every governance checkpoint has a performance cost in tokens, latency, and error surface. Track it. If a layer doesn’t measurably improve outcomes, remove it.

Design for deletion. As models improve, your harness should get simpler, not more complex. Build infrastructure that’s easy to remove. The scaffolding you need today for GPT-4o may be unnecessary for whatever ships next quarter.

The enterprises that will win the agent infrastructure race aren’t building the most sophisticated harnesses. They’re building the most disciplined ones. And discipline, in this context, means knowing what to leave out.

References

Microsoft Agent Framework. “Agent Harness in Agent Framework.” March 12, 2026. https://devblogs.microsoft.com/agent-framework/agent-harness-in-agent-framework/
Pappas, E. “The Agent Harness Is the Architecture (and Your Model Is Not the Bottleneck).” DEV Community, 2026. https://dev.to/epappas/the-agent-harness-is-the-architecture-and-your-model-is-not-the-bottleneck-3bjd
HTEKDev. “Agent Harnesses: Why 2026 Isn’t About More Agents, It’s About Controlling Them.” DEV Community, 2026. https://dev.to/htekdev/agent-harnesses-why-2026-isnt-about-more-agents-its-about-controlling-them-1f24
OpenAI. “Harness Engineering: Codex Agents Power Large-Scale Software Development.” InfoQ, February 2026. https://www.infoq.com/news/2026/02/openai-harness-engineering-codex/
Gupta, A. “2025 Was Agents. 2026 Is Agent Harnesses.” Medium, 2026. https://aakashgupta.medium.com/2025-was-agents-2026-is-agent-harnesses-heres-why-that-changes-everything-073e9877655e
Kruczek, M. “Progressive Disclosure for MCP Servers: A Design Pattern for Scalable AI Tool Integration.” matthewkruczek.ai.
CNCF. “The Autonomous Enterprise and the Four Pillars of Platform Control: 2026 Forecast.” January 23, 2026. https://www.cncf.io/blog/2026/01/23/the-autonomous-enterprise-and-the-four-pillars-of-platform-control-2026-forecast/
Sutton, R. “The Bitter Lesson.” 2019. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Why I'm Starting This

Matthew Kruczek — Sun, 15 Mar 2026 17:17:41 GMT

Most AI content falls into one of two categories.

The first is academic. Well-sourced, carefully hedged, and almost entirely disconnected from what it actually takes to deploy an AI system inside a company with compliance requirements, legacy infrastructure, and people who’ve been burned by a previous technology promise.

The second is enthusiasm. Tools, demos, benchmark comparisons. Useful for keeping up with what’s shipping, not so useful for figuring out what to build and why.

Context Engineering is an attempt at a third thing: a practitioner’s record of what I’m actually seeing.

I’ve spent years deploying AI systems at Fortune 500 scale. Before that, I created technical courses that reached 17 million developers on Pluralsight. I’ve been on the ground for a lot of AI projects, both the ones that shipped and the ones that didn’t.

The difference, in my experience, is almost always context.

Not model capability. Not prompt phrasing. Context: what the system knows, when it knows it, how it’s structured, and what happens to it at scale. That’s the skill nobody is talking about clearly, and it’s the skill that actually determines whether an AI project ships or stalls in the demo phase.

That’s what this newsletter covers. Every issue is one observation from the field. Written for the engineers designing the systems and for the leaders deciding whether to invest in them. Specific enough to be useful, honest enough to be trusted.

If that sounds worth reading, subscribe below and share it with one person who’d get something out of it.

First issue coming this week.

Subscribe now