We Keep Building Castles on the Swamp

When even the model providers can't tell when their own toolchain has regressed


Last month, Anthropic's own eval suite missed a quality regression in Claude Code that users had been reporting for weeks. If the people who built the model can't tell when their toolchain has shifted, the rest of us are building a castle in a swamp and wondering why our feet are wet.

I've spent the last few months building governance automation across a handful of LLM harnesses. Each one is its own substrate. And every substrate moves. Prompts that worked in March stop working in April. Schemas drift. Switch harnesses and you're starting over on different ground. The only way I find out is by watching outputs.

That's not a complaint about any single vendor. It's a structural property of building on LLMs. The fix is harness eval: your prompts, your schemas, your expected behaviours, gated in CI. Many ship without it. That's the swamp. If the swamp is winning, here's how to start building your foundation.

Exhibit A: the Anthropic postmortem

In April 2026, Anthropic published a postmortem explaining why Claude Code had felt off for several weeks. Three separate changes had shipped through their normal process. Each affected different users at different times. The combined effect looked like Claude was randomly getting worse. Their internal evals missed all of it.

The first was a March change that made Claude think less to respond faster. The second was a caching fix with a bug that kept wiping Claude's memory of earlier decisions. The third was an April system prompt tweak that capped output length and hurt coding quality. The bug passed code review, unit tests, end-to-end tests, and dogfooding. The prompt change passed weeks of internal testing and every eval they had.

The eval suite stayed silent through all of it. When Anthropic eventually ran broader tests, one of them showed a 3% drop. They only ran those tests because users had been complaining for weeks.

Look at the fix list. Per-model eval suites for every prompt change. Testing each prompt line on its own. Time to soak before rolling wider. Broader eval coverage. Gradual rollouts. The fix for a model provider's quality crisis was better testing.

If the team building the model needs that, every team building on top of it needs it more.

Exhibit B: a smarter model can break your harness

When Anthropic released Opus 4.7, the migration guide flagged that the model interprets prompts more literally, calibrates response length to task complexity, and tokenises text differently from Opus 4.6. The docs recommended a prompt and harness review as part of migration. A team that swaps claude-opus-4-6 for claude-opus-4-7 and ships gets a smarter model behaving differently. Without a harness, you cannot tell whether your application improved, regressed, or quietly shifted in ways your users will notice next week.

Exhibit C: silent provider updates

You can also sit still and lose ground. Most major providers offer model aliases that point to whatever the current best version is. The point is convenience: you do not have to update your code to get improvements. The cost is that the model behind the alias can change without your code noticing. Even pinning to a specific dated snapshot just buys time. Providers deprecate snapshots eventually, on a schedule you do not control. Your prompts have not moved. Your code has not moved. The substrate has.

What harness eval actually is

A harness eval has four parts. A golden dataset of inputs paired with expected outputs or expected behaviours. An output contract that says what shape the response must take. A scorer that compares actual output to expected, deterministically where possible and with a model-graded fallback where not. And a CI gate that fails the build when the score drops below threshold.

The dataset is the asset. Everything else is plumbing. A good harness fails loudly when something changes, even when the change looks like an improvement. That is the whole point. The harness is the early warning, not the verdict.

What good looks like

Start with one workflow. Pick the highest-stakes prompt in your application, the one whose output you would notice if it shifted. Write enough examples to cover the cases you actually care about catching: the happy path, the edge cases you have already seen break, and the failure modes you fear. Make the dataset small enough that you finish it in an afternoon. That is your golden dataset. It is small. It is yours. It is enough to start.

Add a scorer. Exact match where the output is structured. Model-graded with a fixed grader prompt where it is not. Wire it into CI. Fail the build if the score drops below the threshold you set today.

That is the whole foundation. A starter dataset and a CI gate. You have not solved harness eval. You have made it impossible to silently regress on the workflow you cared most about.

Then add the second workflow. Then a third. Treat the dataset as code, reviewed in pull requests, owned by the team that owns the prompt. The dataset is the asset. The asset will rot. Prevention is the next step.


The model providers will keep shipping. The substrate will keep moving. A harness is your foundation: every regression you have seen, written down so it cannot surprise you twice. Acting on what it tells you, without stopping when the ground shifts, is the next problem.

Swift: Initialising a 2D Array

I have a struct called Tile, which has (for now) a position defined as a tuple:

struct Tile {
    let pos: (Int, Int)
}

And a class called Board, which has a 2D array of Tile objects:

class Board {
    let tiles: [[Tile]]
    
    init() {
        var tilesArray = [[Tile]]()
        for row in 0..<Board.rows {
            var rowTiles = [Tile]()
            for column in 0..<Board.columns {
                let tile = Tile(pos:(column, row))
                rowTiles.append(tile)
            }
            tilesArray.append(rowTiles)
        }
        
        tiles = tilesArray      
    }
}

This works, though it feels a little messy... I'll have to come back and look at this again.

Xcode 7 and Swift 2: Unit Testing (again)

Some follow up from creating a new project and adding tests.

This turned out to be important...

This turned out to be important...

I hadn't really noticed in the last one but I hadn't added the new classes to the test target, as I would under Obj-C. In Swift 2 there's a new @testable keyword. I found it blogged by Natasha the Robot when I started looking to find out why I wasn't seeing any code coverage showing up for my classes.

Then I started wondering why I was getting Undefined Symbol errors. I could resolve them by including the classes, but then I wouldn't get coverage and everything I saw on @testable assured me I didn't need to include them. Finally, I remembered I'd been getting a bit click happy earlier. I'd disabled Allow testing Host Application APIs.

One checkbox later and I'm a happy camper...

Okay, not a lot done tonight but I feel like a few pieces fell into place.

Xcode Plugins

Install the Alcatraz (http://alcatraz.io) package manager to get these.  

https://github.com/neonichu/BBUFullIssueNavigator
Shows the whole error/warning in the issue navigator instead of a single line.

https://github.com/yuhua-chen/MCLog
Allows you to filter the console by a regular expression. 

https://github.com/markohlebar/Peckham
Add imports from anywhere in the code base - press Cmd+Ctrl+P to pop up a window which has autocompletion for your headers. 


https://github.com/onevcat/VVDocumenter-Xcode
Fill in a quick documentation template.


https://github.com/kinwahlai/XcodeRefactoringPlus
Additional refactoring tools. Still not as full-featured as some IDEs but it's a case where every little helps.