Separate small attackers can poison AI training data together, study finds

A team of academic researchers has shown that the people trying to sabotage an artificial intelligence model do not have to work alone, and that several attackers tampering with different parts of the training process can quietly combine into a much bigger threat than any of them poses by itself. The finding, laid out in a new paper on sequential data poisoning in large language model (LLM) post training, challenges a common assumption baked into how companies vet the data they use to build and tune AI systems.

Modern chatbots and AI assistants are not trained in a single step. After an initial round of learning, they go through post training, where the model is refined using curated examples (a stage called supervised fine tuning, or SFT) and then nudged toward preferred answers using preference data (techniques known as DPO and PPO). Each of these stages often pulls data from a different supplier, and any of those suppliers could be compromised or malicious. Data poisoning is the practice of slipping carefully crafted bad examples into that training data so the finished model behaves the way the attacker wants.

What the researchers found

Most security analyses look at one poisoned dataset at a time, and on that basis a small amount of tampering usually looks negligible. The researchers call this the single attacker illusion. When they instead modeled several adversaries, each poisoning a different stage, the picture changed sharply.

SFT to DPO pipelines: the effects are additive. Splitting a fixed poisoning budget across both the fine tuning data and the preference data was more effective than spending it all in one place.
SFT to PPO pipelines: the effects are complementary. Poisoning the fine tuning data alone failed, and poisoning the reward model alone failed, yet doing both together succeeded.

In other words, two interventions that each pass a security review as harmless can join up to reliably corrupt the final model. The authors, Jack Sanderson, Yihan Wang, Xiaoqian Lu, Gautam Kamath, and Yiwei Lu, frame this as a blind spot in how AI supply chain risk is currently measured, and they released code alongside the original paper.

Why it matters

Organizations increasingly build products on top of LLMs they fine tune themselves, often using datasets, preference labels, and reward models sourced from third parties or scraped from the open web. The research suggests that auditing each of those sources in isolation gives a false sense of safety, because a determined adversary, or several uncoordinated ones, can spread a poisoning campaign thinly across multiple stages and stay under the radar of any single check.

What you should do

The paper does not offer a turnkey fix and treats the result mainly as a warning to defenders. For teams training or fine tuning their own models, the practical takeaways are to treat the entire post training pipeline as one trust boundary rather than a series of separate ones, to track provenance for SFT data, preference data, and reward models together, and to test finished models for unexpected behavior rather than relying only on per dataset screening. Because the staged nature of the attack is the whole point, defenses that inspect just one stage will tend to miss it.

This briefing is provided by IntelFusions for informational and defensive purposes only. It is based on sources assessed to be reliable at the time of writing, and analytic judgments carry the confidence levels indicated. Indicators of compromise are defanged; re-arm them only in controlled environments. IntelFusions is not affiliated with the organizations named and makes no warranty as to completeness or accuracy.