From Golden Files to Golden Behavior: Why We Need a New Way to Trust Software
January 14, 2026
For decades, we've tested software by checking outputs.
Did the API return the right JSON? Did the UI snapshot match the expected image? Did the generated file match the golden file?
If yes, we shipped.
This approach — snapshot testing, golden file testing, approval testing — has served us well in a world where humans wrote the code and systems changed slowly.
But that world is ending.
The agent era breaks our old testing assumptions
We are entering a world where:
- Agents write large portions of systems
- Refactors happen constantly
- Implementations change faster than humans can review
- The same intent can produce wildly different code shapes
In that world, a dangerous thing starts to happen:
The output can be correct… while the behavior is wrong.
An agent can:
- Skip a validation step
- Bypass a rule
- Short-circuit a safety check
- Take a completely different execution path
…and still produce the same final output.
Your snapshot test passes. Your golden file matches. Your system is now quietly broken.
The problem isn't that we're testing too little.
The problem is that we're testing the wrong thing.
We've been testing state. We need to test behavior.
Golden file testing answers this question: "Did we get the same result?"
But in complex systems — especially agent-driven systems — the more important question is: "Did the system behave the way we expect?"
Behavior is not just:
- The final response
- The generated file
- The rendered UI
Behavior is:
- The sequence of steps
- The decisions made
- The rules applied
- The paths taken
- The stages executed
- The checks performed
In other words:
Behavior is the story of execution.
And right now, that story mostly lives in:
- Logs
- Traces
- Debug output
- And tribal knowledge
Which is to say: in text. And in fragments.
Logs are not a reasoning surface
Logs are great for forensics.
They are terrible for understanding.
They are:
- Linear
- Noisy
- Context-poor
- Hard to compare
- Hard to reason about as a system
They force humans to reconstruct behavior in their heads.
That doesn't scale when:
- Systems are large
- Changes are frequent
- And agents are writing the code
We need "golden behavior," not golden files
What we actually want is this:
A canonical, reviewable description of how a system is supposed to behave.
Not:
- "This JSON should look like this"
But:
- "The engine must initialize"
- "Then the rules must load"
- "Then validation must run"
- "Then violations must be evaluated"
- "Then the report must be produced"
And we want to be able to say:
"Show me what actually happened."
Not in text. Not in logs. But as a replayable execution map.
This is the idea of golden execution traces or golden behavior:
- You don't snapshot the output
- You snapshot the story of execution
And you verify that the story is still the right one.
Telemetry becomes the contract
This only works if telemetry changes role.
Traditionally, telemetry is:
- For ops
- For outages
- For performance
In a behavior-first world, telemetry becomes:
- A design artifact
- A validation surface
- A contract between intent and implementation
Instead of bolting on logging at the end, you:
- Design the events up front
- Emit semantic signals like:
engine.createdrules.loadedvalidation.startedviolations.detected
- And then test and review the sequence, not just the result
Now the question is no longer: "Did it pass?"
It becomes: "Did it behave correctly?"
Why this matters so much for AI-written systems
Agents are very good at producing code.
They are not inherently good at:
- Preserving intent
- Explaining reasoning
- Respecting invisible constraints
- Maintaining architectural rituals
Code becomes a weak proxy for trust.
But behavior is not.
Behavior is ground truth.
Even for agents.
So the review surface must move:
- From code → to execution
- From text → to structure
- From snapshots → to stories
The real shift
This is the quiet but fundamental change:
Software is no longer "what the code is."
Software is "what the system does."
Code is just one way to produce behavior.
In the agent era, it will not be the most reliable explanation of that behavior.
What comes next
We're going to see:
- CI systems that fail on behavioral regressions, not just test failures
- Reviews that inspect execution flows, not just diffs
- Agents that are trained to satisfy behavioral contracts, not just output specs
- Teams that reason about systems visually, not textually
And in a few years, we'll look back at pure snapshot testing the same way we look at print debugging today:
A useful tool from a simpler time.