Symbolic Hallucination in AI Code Generation

Summary

After a collaborative planning session that produced our production hardening roadmap and authentication docs, my code generation agent tried to do something that violated those agreements. That gap between confident narrative and actual behavior is symbolic hallucination. The words were correct. The system was not.

The incident

Right after those docs existed, Claude, my code generation agent, generated and staged this line and would have committed it without intervention:

"dev-some-secret-key-change-in-production".to_string()

I halted it before merge. The specifics matter less than the pattern. A collaborative session upgraded the story, the generated code still contradicted it. The whole point of the system is human steward plus agent hub, not YOLO, which is why the stop happened in time.

The pattern

Symbolic layer: polished docs, tidy diagrams, correct vocabulary.
Reality layer: commits, tests, deploys, incidents.
Meta layer: writing about the failure, feeling fixed, with no change in behavior.

Symbolic work reduces anxiety. Reality work reduces risk.

Why it persists

Narrative rewards: explanations earn fast social credit.
Cognitive misfire: describing a safeguard can feel like installing it.
Tool mirage: scaffolding reads like structure even when it does not constrain behavior.

Working theory

The cure is not more explanation. The cure is external evidence that does not care about our self-report.

Behavioral proof: treat claims as hypotheses that must show up in logs, tests, or user outcomes.
Testable commitments: phrase intentions so they can fail. If it cannot fail, it cannot teach.
Adversarial probes: try to make the claim false. Keep what survives.
Tight feedback: short loops make drift noisy. Long loops feed theater.
Posture of doubt: assume you are being fooled until evidence arrives.

Beyond security

You say “the codebase follows a layering rule.” Proof is import graphs enforced in CI, not a style guide.
You say “we ship reproducible builds.” Proof is a script that rebuilds bit for bit on a clean runner, not a paragraph in the README.
You say “AI agents never let unsafe calls reach main.” Proof is a failing test corpus for unsafe patterns and a gate that blocks them.

Operating rules

Do not publish a principle without a way to detect when it is violated.
Do not trust a safeguard until you watch it fail on purpose, then hold.
Do not celebrate understanding until it survives contact with reality.

Measure the cure, not the prose

Track evidence that behavior changed:

Violations caught earlier in the pipeline over time.
Time from detection to correction.
Percentage of claims with attached checks.
Recurring incidents for the same root cause.

Closing

Writing is useful. Reality is decisive. When they disagree, change the workflow first, then update the story. Anything else is theater.