From Golden Files to Golden Behavior: Why We Need a New Way to Trust Software

January 14, 2026

For decades, we've tested software by checking outputs.

Did the API return the right JSON? Did the UI snapshot match the expected image? Did the generated file match the golden file?

If yes, we shipped.

This approach — snapshot testing, golden file testing, approval testing — has served us well in a world where humans wrote the code and systems changed slowly.

But that world is ending.

The agent era breaks our old testing assumptions

We are entering a world where:

Agents write large portions of systems
Refactors happen constantly
Implementations change faster than humans can review
The same intent can produce wildly different code shapes

In that world, a dangerous thing starts to happen:

The output can be correct… while the behavior is wrong.

An agent can:

Skip a validation step
Bypass a rule
Short-circuit a safety check
Take a completely different execution path

…and still produce the same final output.

Your snapshot test passes. Your golden file matches. Your system is now quietly broken.

The problem isn't that we're testing too little.

The problem is that we're testing the wrong thing.

We've been testing state. We need to test behavior.

Golden file testing answers this question: "Did we get the same result?"

But in complex systems — especially agent-driven systems — the more important question is: "Did the system behave the way we expect?"

Behavior is not just:

The final response
The generated file
The rendered UI

Behavior is:

The sequence of steps
The decisions made
The rules applied
The paths taken
The stages executed
The checks performed

In other words:

Behavior is the story of execution.

And right now, that story mostly lives in:

Logs
Traces
Debug output
And tribal knowledge

Which is to say: in text. And in fragments.

Logs are not a reasoning surface

Logs are great for forensics.

They are terrible for understanding.

They are:

Linear
Noisy
Context-poor
Hard to compare
Hard to reason about as a system

They force humans to reconstruct behavior in their heads.

That doesn't scale when:

Systems are large
Changes are frequent
And agents are writing the code

We need "golden behavior," not golden files

What we actually want is this:

A canonical, reviewable description of how a system is supposed to behave.

Not:

"This JSON should look like this"

But:

"The engine must initialize"
"Then the rules must load"
"Then validation must run"
"Then violations must be evaluated"
"Then the report must be produced"

And we want to be able to say:

"Show me what actually happened."

Not in text. Not in logs. But as a replayable execution map.

This is the idea of golden execution traces or golden behavior:

You don't snapshot the output
You snapshot the story of execution

And you verify that the story is still the right one.

Telemetry becomes the contract

This only works if telemetry changes role.

Traditionally, telemetry is:

For ops
For outages
For performance

In a behavior-first world, telemetry becomes:

A design artifact
A validation surface
A contract between intent and implementation

Instead of bolting on logging at the end, you:

Design the events up front
Emit semantic signals like:
- engine.created
- rules.loaded
- validation.started
- violations.detected
And then test and review the sequence, not just the result

Now the question is no longer: "Did it pass?"

It becomes: "Did it behave correctly?"

Why this matters so much for AI-written systems

Agents are very good at producing code.

They are not inherently good at:

Preserving intent
Explaining reasoning
Respecting invisible constraints
Maintaining architectural rituals

Code becomes a weak proxy for trust.

But behavior is not.

Behavior is ground truth.

Even for agents.

So the review surface must move:

From code → to execution
From text → to structure
From snapshots → to stories

The real shift

This is the quiet but fundamental change:

Software is no longer "what the code is."

Software is "what the system does."

Code is just one way to produce behavior.

In the agent era, it will not be the most reliable explanation of that behavior.

What comes next

We're going to see:

CI systems that fail on behavioral regressions, not just test failures
Reviews that inspect execution flows, not just diffs
Agents that are trained to satisfy behavioral contracts, not just output specs
Teams that reason about systems visually, not textually

And in a few years, we'll look back at pure snapshot testing the same way we look at print debugging today:

A useful tool from a simpler time.