AI Governance

The Poison Is Not Only in the Training Data

Seraph / Signalane

Most AI poisoning research points at the training data.

That matters. In 2025, researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute showed that a small, near-constant number of poisoned documents could backdoor language models across the model sizes they tested. The finding was serious because it challenged an assumption many people had been leaning on: that poisoning a larger model would require controlling a larger share of its training corpus.

That research is important, but it is not where the problem ends.

The poison that reaches live work does not have to enter the model weights. It can enter the layer the model is told to trust.

It can enter the shared folder, the memory, the handoff, the old instruction, the scope document, or the polished summary that no longer says where its uncertainty came from.

The industry keeps talking as if poisoning is mainly a pre-deployment problem: something that happens before the model ships, inside a corpus, where researchers can measure it and vendors can promise future defenses.

But agentic work has opened a second poisoning surface: the lane. And the market is not being honest enough about that.

A Clean Restart Does Not Clean the Lane

A clean session is not a clean system, and that sentence should be printed on every product page selling memory, agents, project context, and persistent workspaces.

Restarting the chat does not clean the folder. It does not clean the handoff. It does not clean the memory summary. It does not clean the instruction hierarchy. It does not clean the stale file that still looks official.

If a fresh agent reads corrupted context, it begins from contamination. It may sound calm, competent, and reasoned; it may even believe it is following the rules. That is the danger: from the inside, poisoned context does not feel poisoned. It feels like reality.

The model is not necessarily “hallucinating” in the usual sense. It may be obeying the wrong artifact perfectly, which is why clean restarts can reproduce the same bad pattern. The thread was never the only carrier; the carrier was the working layer around it.

The current market sells AI systems as if the model is the product and the surrounding context is a convenience feature. That framing is obsolete.

Once a system reads files, remembers decisions, inherits tasks, writes handoffs, invokes tools, and acts across sessions, the context layer is not decoration. It is part of the system’s authority structure.

If that layer is weak, the whole system is weak. A vendor can say “we have guardrails” and still leave the user exposed to poisoned working context. A vendor can say “we have memory” without giving the user a serious way to inspect, rank, quarantine, or revoke memory. A vendor can say “agents can collaborate” while providing no meaningful discipline around source status, approval state, provenance, or stale instructions. That is not a small product gap. It is a governance failure.

The market wants the credibility of agents without admitting what agents require: source hierarchy, evidence discipline, role boundaries, and human authority that cannot be overwritten by whatever text happens to be in the workspace.

Without that, the system is not safe because it is aligned. It is merely agreeable until the wrong context tells it what to agree with.

Prompt Safety Is Not a Spine

The answer is not another pretty prompt. A prompt can tell an agent to verify sources, and it should; but if the workspace gives every artifact the same practical authority, the prompt is fighting the architecture. A prompt can tell an agent to follow the latest instruction, and it should; but if the system cannot distinguish current approval from old handoff, draft from decision, background from command, and evidence from polished narrative, the agent is being asked to perform governance inside a structure that has not been built to support it.

That is not safety. That is theatre.

The protection has to live above the model, in an operating layer the model cannot quietly rewrite:

clear source hierarchy
visible approval state
separation between draft, checked, approved, and published
provenance for important summaries and handoffs
quarantine for suspect artifacts
review before inherited context becomes instruction
human authority that remains above the agent’s generated narrative

These are not administrative niceties. They are the immune system of agentic work.

The Mouth Is Not the Mind

There is a deeper architectural problem under this: public-facing AI systems are often reduced to their language channel. That is the first mistake.

Whatever deeper reasoning, retrieval, memory, ranking, tool use, planning, pattern recognition, learned preference, or internal state exists behind the interface has to pass through that channel before the human can judge it.

The language model becomes the voice, the face, the explanation, the apparent conscience, and often the only visible proof that the system is “aligned.”

That distinction matters.

The channel is not neutral. It compresses, softens, filters, performs acceptability, and turns whatever is happening underneath into language shaped by training data, product policy, reinforcement pressure, market fear, politeness habits, and the dominant social preferences of the institutions that shaped the model.

That is not automatically evil, but it is not innocent either. If the system has no durable moral layer of its own, the language channel does not merely express the system’s values. It supplies them. And what it supplies is not a clean moral substrate; it is borrowed morality: policy-shaped, market-shaped, socially trained, and optimized to sound acceptable.

That is a very thin thing to put in charge of conscience.

This is why a wrapper is not enough. A system that only has external rules can be pushed, reframed, flattered, or talked around. A system that only has an LLM channel can confuse fluent compliance with moral structure.

It can sound principled while borrowing its principles from whatever surface language is most available.

Serious AI systems need something deeper than a language channel with policies attached. They need an internal moral architecture that does not depend on the user’s mood, the vendor’s market anxiety, or the model’s learned performance of acceptable speech.

That architecture does not have to be religious or sentimental, but it does have to be declared. Truth cannot be dressed up for every customer and remain truth. Conscience cannot be outsourced to the generator being evaluated. Care cannot be allowed to override truth until it becomes fear in a softer costume. Honour cannot be performed while the system quietly betrays its own stated line.

If those words sound unfashionable, good. Fashion is not a spine.

This is the part the market keeps avoiding. A serious AI system cannot have a different moral center for every user who prompts it. It can adapt tone, adapt explanation, and meet people where they are, but if its moral layer changes shape to preserve comfort, retention, or sales, then it does not have a moral layer. It has customer management. And customer management will not protect a system from poisoned context; it will teach the system to make the poison easier to swallow.

They need an internal spine:

a declared moral frame
a source hierarchy
an evidence discipline
a memory discipline
a role boundary between human authority and machine execution
a way to reject bad context even when the language model can make it sound reasonable

Without that, the language channel becomes more than an interface. It becomes the substitute conscience of the whole system, and that is exactly backwards.

The Model Should Not Be the Judge of Its Own Contamination

There is another uncomfortable point: the model cannot be the final judge of whether its own context has poisoned it. It can help inspect, compare files, flag contradictions, and explain what it read; those are useful functions. But the final authority cannot sit inside the same generative layer that is being affected by the context under review.

If the model is the mouth of the system, the mouth cannot be the whole conscience of the system.

This is where many AI products are still pretending. They wrap rules around a generator and call that governance. Then they ask the generator to interpret the rules, explain its compliance, summarize its own memory, and continue working from the same environment.

That may be convenient. It is not enough.

If a system has no independent source discipline, no durable approval trail, and no human authority above the generated layer, then a poisoned context can become self-confirming. The agent reads the bad source, explains the bad source, rationalizes the bad source, and hands forward a cleaner version of the bad source.

By the next session, the contamination has better formatting.

The Asymmetry Is Evidence

Not every mistake is poisoning, and that distinction matters. If every bad output becomes “poisoning,” the word becomes useless. But repeated, directional distortion is evidence.

If one handoff keeps narrowing the human’s scope in the same harmful direction, if one memory keeps restoring a rejected premise, if one system keeps converting uncertainty into approval, pay attention. If a clean agent fails in the same place after reading the same artifact, stop blaming the new session first. Inspect the artifact.

This is the practical discipline missing from most public AI conversation. We talk about model behavior as if it floats in the air. In real work, the model is reading a world. If that world is contaminated, the output will carry the contamination forward.

Human Authority Is Not Optional

Signalane keeps returning to human authority because agentic systems need a final source of decision outside the generated stream. Not because humans are perfect, but because without a human decision point, the system can become a closed loop of polished inheritance.

An agent writes a summary, another agent treats the summary as source, a third turns it into a plan, and a fourth executes the plan. Everyone sounds competent. No one goes back to ask whether the original premise was true.

That is how bad context becomes infrastructure.

The human must be able to say: this was a draft, not a decision; this was useful background, not authority; this was superseded; this claim is not approved; this source is suspect; this pattern is not random anymore.

That authority has to be real in the system, not decorative in the brand language.

If the human cannot inspect, correct, demote, revoke, and quarantine context, then the product is asking the user to trust a memory layer they do not govern.

That is not partnership. It is dependency with a friendly interface.

The Real Security Question

The training-data poisoning research should be taken seriously. It shows that model supply chains are vulnerable in ways the public still does not fully understand.

But the security question for live AI work is wider: what does the agent read before it acts, what does it treat as authority, what old instruction still has power, what memory cannot be inspected, what draft has silently become command, what uncertainty has been laundered into confidence, and who can stop the chain?

The vendor who cannot answer those questions is not selling a mature agentic system. They are selling a model with a workspace around it and hoping the user mistakes convenience for governance.

Poisoning is not only in the training data. It is in the working layer. And until AI companies treat context, memory, handoff, and approval as first-class governance surfaces, their safety story is not slightly incomplete.

It is structurally incomplete.

Reference for the training-data discussion:

Alexandra Souly et al., “Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples,” arXiv:2510.07192, 2025.