Poisoned data

2025-03-28

According to this article in Le Monde, some genAI bots include Russian propaganda in their training data.

"Les IA conversationnelles utilisent des sites de propagande russe comme sources"

There's a lot to say about this. One point in particular is that it's an interesting way to poison a model:

Instead of a bad actor infiltrating an organization to alter the training data ...

... they could leave tainted data in public places where the import pipeline is likely to pick it up.

("Hey we didn't ask you to use this data. You just grabbed it. That's on you.")

So if threats of copyright infringement won't stop model providers from hoovering up content, maybe the threat of picking up poisoned content will give them pause?

Machines making sense of human words

Three challenges of AI-based content moderation

Thinking leadership

A recent study finds that leaders should exercise ambivalence