Spec-Driven Development Doesn't Remove Your Knowledge SPOFs

Part A of a two-part series. Part B follows next week.

The pitch you’re hearing about spec-driven development goes something like this: AI-assisted coding has matured to the point where you can author a structured specification upfront, let the AI generate the implementation from it, and end up with a system that’s portable, regeneratable, and — the line that catches every CTO’s attention — no longer dependent on the people who originally built it. The spec becomes the source of truth. Single points of failure in knowledge disappear. Anyone can pick up any system.

It’s a compelling pitch. It’s also wrong in the way that matters most.

Spec-driven development is real, it’s useful, and it’s worth investing in. But the framing that sells it — that it removes knowledge SPOFs across the enterprise — is the kind of claim that survives the demo and dies in production. There’s also a growing body of empirical evidence that the demo-to-production gap is wider than the marketing admits, and any serious enterprise case needs to factor that in.

This piece is about why the SPOF claim doesn’t hold up under serious examination, and what the evidence actually says about applying SDD at enterprise scale. Part B, next week, picks up where this leaves off — what spec-driven development actually does buy you, and why the quality engineering function decides whether that value lands or evaporates.

What spec-driven development actually is

For readers who haven’t bumped into it yet: spec-driven development flips the usual AI coding workflow. Instead of chatting with a model until something works, you author a structured specification first — intent, design, contracts, edge cases — and treat that spec as the primary artefact. Code becomes a downstream output. GitHub’s Spec Kit (released in September 2025, with its specify/plan/tasks/implement commands) and AWS’s Kiro (with requirements.md/design.md/tasks.md) are the visible standard-bearers, though similar patterns are appearing across the major tools.

The intellectual move is sound. AI models are good at generating code and bad at inferring the right code without unambiguous context. Forcing the ambiguity out before generation is where the quality gain comes from. As an engineering discipline, it’s a meaningful step forward.

What it claims to solve

The enterprise pitch usually leans on two promises: portability (you can regenerate the system on a different stack from the spec) and knowledge resilience (the system no longer lives in the heads of the people who built it). The second is the one that lands with CTOs and CIOs, because key-person risk is real, expensive, and visible on the risk register.

If a structured specification really did externalise the knowledge required to maintain and evolve a system, that would be transformative. The honest test is whether it actually does.

Where it genuinely reduces SPOF risk

There is a class of knowledge that spec-driven development externalises well, and it’s worth being precise about it. The structural knowledge — what the system does, how its modules relate, what the contracts are, what the edge cases are supposed to be — moves out of individual heads and into reviewable artefacts. That’s a real and meaningful reduction in key-person risk. A new joiner can read the spec rather than ambush the one person who’s been on the system for eight years. An auditor can engage with the spec rather than reverse-engineer intent from code. A test team can derive coverage from the spec rather than discover edge cases in production.

Regeneration becomes economically viable in principle. If a system is properly spec’d, you can rebuild it on a different stack without losing design intent. For organisations carrying significant legacy estate, that’s potentially transformative. A well-authored spec lets you re-implement rather than refactor, which changes the economics of modernisation programmes that previously stalled on the cost of reverse-engineering decades of accumulated decisions.

These are real benefits. Worth investing in. Not transformative in the way the vendors suggest — and the rest of this piece is about why.

Three problems with the “removes knowledge SPOFs” claim

There are three structural problems with the SPOF claim that don’t survive contact with how enterprises actually work.

First, tacit knowledge stays tacit. Specs capture decisions; they rarely capture the reasoning, and almost never the context that produced the decisions. Why a particular trade-off was made. Which stakeholders care about which behaviours. The political history of a constraint. How this system interacts with three others in ways the spec doesn’t capture. What “production-like” really means in your environment. Who you call when the integration with the legacy mainframe goes weird at month-end. None of that lives in a spec. All of it lives in people, and most of it walks out of the door when those people do.

Second, spec rot. A spec that drifts from the code is not neutral — it’s actively misleading. The moment your specs and your implementation diverge, you have a system that looks documented but isn’t, which is operationally worse than a system everyone knows is undocumented. Avoiding spec rot requires real discipline, tooling, and ownership. You now have a new SPOF: the team or capability that keeps specs trustworthy.

Third, the spec itself becomes a SPOF. A production-grade specification for an enterprise system requires someone who can reason cleanly about boundaries, contracts, failure modes, non-functional requirements, integration surfaces, data lineage, and the trade-offs between them. That’s architect or senior engineer work. If only two people in your company can reliably author and modify production-grade specs, you haven’t removed the SPOF. You’ve moved it upstream and made it more concentrated.

This last point is worth sitting with. Spec-driven development doesn’t lower the skill floor for system design. If anything, it raises it, because the quality of the thinking is now visible in a reviewable artefact rather than camouflaged inside code that runs. The people who can produce trustworthy specs are the same scarce population who could already produce trustworthy systems.

The enterprise reality check — where the demos break

The first problem you hit when trying to apply spec-driven development to a real enterprise estate isn’t methodological. It’s that the demo doesn’t match the system.

Every spec-driven development demonstration you’ll see lives in the same world: a greenfield application, a single codebase, a clean stack — React on the front, Python or Node on the back, Postgres underneath. That’s the world in which “the spec is the source of truth” makes sense, because the entirety of the system is code you own and can regenerate.

That isn’t what an enterprise estate looks like. The systems carrying the most knowledge SPOF risk in any large organisation tend to be one of four things:

Customised COTS. SAP, Salesforce, Oracle Fusion, ServiceNow, Workday. You don’t own the source. Your “spec” isn’t a specification of the system. It’s a specification of your configuration, your extensions, your customisations, and your integrations to other systems. The vendor still owns the core, and “regenerate from spec” doesn’t apply to the majority of the system that isn’t yours. Move from Salesforce to Dynamics on the basis of your spec? Not without losing most of what the spec actually captured.

Configuration-heavy platforms. Much of what an enterprise actually runs is configuration, not code. Business rules in BRMS engines. Workflow definitions in BPM tools. Mappings in iPaaS platforms. Customising tables in SAP. None of this is what current spec-driven tooling is built for. Configuration has different semantics, different change control, and different regeneration behaviour. The conceptual model doesn’t translate cleanly.

Cross-system business processes. The end-to-end customer journey in a telco, insurer, or bank spans eight to fifteen systems. The spec for any individual component doesn’t capture the process. The integration layer becomes its own spec problem, and historically that’s where most of the tacit knowledge in an enterprise actually lives — in the heads of the people who remember why the ESB mapping for that one field is the way it is.

Regulated estates. Spec changes carry audit, regulatory, and change control implications that don’t exist in greenfield demos. A spec isn’t just a design artefact. It becomes a controlled document with version history, sign-off requirements, traceability obligations, and review burden. The discipline that makes specs valuable at scale is also what makes them expensive to maintain.

None of this is fatal to spec-driven development as a methodology. It does mean the enterprise pitch — “spec all our systems and remove key-person risk” — collides with realities the vendor demos don’t show. There’s an uncomfortable inversion here: the systems where SDD applies most cleanly tend to be the systems where SPOF risk is lowest (greenfield, well-staffed, modern stacks), while the systems where SPOF risk is highest (legacy, COTS-heavy, deeply customised, integration-bound) are precisely where SDD struggles to apply. Worth being clear-eyed about before committing to a programme.

The code question — what does the AI actually produce?

The vendor pitch tends to skirt a second issue. Spec-driven development assumes the code generated from the spec is fit for production. The evidence on AI-generated code at scale is less reassuring than the marketing suggests.

METR’s randomised controlled trial published in July 2025 measured the actual productivity impact of AI tools on experienced open-source developers working in mature codebases (Cursor Pro and Claude 3.5/3.7 Sonnet, 16 developers, 246 tasks). The headline result: AI tools made developers 19% slower, against an expectation — which the developers themselves held both before and after the trial — of a 24% speedup. Developers were still convinced they were faster even after experiencing the slowdown. That’s a study about developer productivity, not spec-driven development specifically, but it speaks to a broader pattern: AI-generated code requires more downstream work than it appears to.

GitClear’s 2025 AI Code Quality Research report, based on analysis of 211 million lines of code across the 2020–2024 period, shows what that downstream work looks like in aggregate. Code duplication blocks increased roughly eightfold once AI tooling became prevalent. Code churn — the proportion of code being revised within two weeks of authoring — rose substantially. Refactoring as a proportion of code changes collapsed from around 25% in 2021 to under 10% in 2024. These are textbook signals of technical debt accumulation.

Google’s 2024 DORA report quantified the operational consequence: every 25% increase in AI adoption was associated with a 7.2% decrease in delivery stability and a 1.5% decrease in throughput. Despite individual developers reporting higher productivity, system-level delivery performance was getting worse. The 2025 DORA update confirmed the pattern persists — AI adoption remains correlated with delivery instability even as throughput effects have normalised in some segments.

On security specifically, Stanford researchers (Perry et al.) found that developers using AI coding assistants produced less secure code than those without AI access, while being more confident their code was secure. The trust gap is the dangerous part. A 2024 ACM Transactions on Software Engineering and Methodology empirical study of Copilot-generated code in real GitHub projects found 29.5% of Python and 24.2% of JavaScript snippets contained security weaknesses, spanning 43 distinct CWE categories. Earlier NYU research (Pearce et al.) found around 40% of Copilot-generated programs contained vulnerabilities across security-relevant scenarios.

The pattern is consistent across the evidence: AI generates code that compiles, runs, and looks plausible, while carrying higher rates of duplication, churn, and security vulnerability than human-written equivalents. None of this disqualifies spec-driven development. It does mean that the regeneration claim — “rebuild the system from the spec on a different stack” — needs to come with a clear-eyed view of what’s being rebuilt, and at what quality.

The cliff problem — long-horizon failure

The third evidence-based concern is more specific to spec-driven development than to AI coding generally.

Microsoft Research’s DELEGATE-52 benchmark tested how frontier models perform when given long, delegated workflows — exactly the kind of work spec-driven development asks them to do. The results were uncomfortable. Strong state-of-the-art models showed roughly 19–34% degradation in artefact fidelity over 20 delegated iterations. Performance after two interactions did not predict performance after twenty. Stronger models didn’t avoid errors better; they delayed critical failures to later rounds. Adding agentic tooling — file reading, writing, execution — made performance worse on average, not better.

The researchers’ own framing is worth quoting directly: current LLMs are ready for delegated workflows in some domains such as Python coding, but not in less common domains. That qualifier matters. Python workflows showed under 1% degradation. Other domains — exactly the heterogeneous, integration-heavy enterprise contexts most relevant to large estates — showed substantial collapse.

Sourcegraph’s CodeScaleBench analysis of 1,281 agent runs across more than 40 enterprise-scale repositories identified what their team called the “80% problem”: agents reliably complete the visible 80% of a task and miss the invisible 20% that lives outside their context window. The missed 20% is, predictably, the parts that matter most for production reliability — auth middleware wrapping the changed function, API DTOs serialised at a different layer, audit logs recording state transitions, integration tests in a sibling repo, frontend guards mirroring backend permissions, migration scripts that need regenerating. The agent’s diff looks clean, tests pass, and three days later a downstream service in another team’s repo breaks because the agent had no idea that service existed.

These two findings — DELEGATE-52’s degradation curve and Sourcegraph’s 80% problem — point at the same underlying issue from different angles. AI systems handle bounded tasks better than long horizons. They handle isolated modules better than interconnected systems. They handle the work in front of them better than the work adjacent to it. Spec-driven development’s regeneration premise asks the model to do exactly what the evidence says it struggles with: produce a complete, internally consistent, integration-aware implementation from a structured input, in one extended pass.

That doesn’t make the approach unworkable. It does mean that “regenerate the system from the spec” is a claim that needs to be tested against the kind of system you actually have, not the kind that appears in the demo. Enterprise estates are full of long horizons and invisible 20%s.

Where this leaves us

Pull the threads together and the picture is clear enough. Spec-driven development genuinely externalises some structural knowledge — that’s real and worth having. It does not remove your knowledge SPOFs, because tacit knowledge stays where it is, spec rot creates new ones, and the spec authors themselves become a more concentrated form of the same problem. The enterprise systems where SPOF risk is highest are precisely the ones where SDD struggles to apply cleanly. The empirical evidence on AI code at scale shows measurably higher rates of duplication, churn, and security defect than human-written equivalents. Long-horizon delegated workflows degrade in ways short demos don’t reveal. Confident-looking output reliably misses the integration context that matters most.

None of which rules out spec-driven development. It rules out the version of spec-driven development the vendors are selling.

So if the SPOF claim doesn’t hold up, and the evidence on AI code quality is this mixed, what does spec-driven development actually buy you? And how does an enterprise capture that value without falling into the failure modes the evidence points at? That’s the question for Part B, where the quality engineering function turns out to matter more than the methodology suggests — and where the case for SDD, properly framed, gets quite a bit stronger than the vendor pitch ever manages.

Part B follows next week: “What Spec-Driven Development Actually Buys You — And Why Quality Engineering Decides Whether It Works.”

Sources

Becker, J., Rush, N., Barnes, E., Rein, D. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR. arxiv.org/abs/2507.09089
GitClear (2025). AI Copilot Code Quality: Evaluating 2024’s Increased Defect Rate. Analysis of 211 million lines of code, 2020–2024.
Google Cloud DORA (2024). Accelerate State of DevOps Report 2024. dora.dev
Microsoft Research (2026). DELEGATE-52: Long-Horizon Delegated Reliability Benchmark. See Microsoft Research blog: “Further Notes on Our Recent Research on AI Delegation and Long-Horizon Reliability.”
Sourcegraph (2025). CodeScaleBench: Testing Coding Agents on Large Codebases. Analysis of 1,281 agent runs across 40+ enterprise-scale repositories.
Perry, N., Srivastava, M., Kumar, D., Boneh, D. (Stanford). Do Users Write More Insecure Code with AI Assistants?
Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., Karri, R. (NYU). Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions.
Fu, Y., et al. (2024). Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study. ACM Transactions on Software Engineering and Methodology. dl.acm.org/doi/10.1145/3716848