When 'AI red teaming' becomes security theater

The AI safety industry has adopted the term red teaming to describe just about any form of adversarial testing of AI systems. The term borrows credibility from cybersecurity, which is convenient, because it sounds serious. The problem is that it doesn’t really mean the same thing.

Full disclosure: I’m not innocent myself. I recorded what was probably the first online course on AI red teaming with Andrew Ng. I’ve built automated tools that we happily marketed under the red teaming umbrella. So think of this less as a lecture and more as me looking back at the mess I helped create.

What red teaming means in cybersec

In cybersecurity, red teaming is a broad, realistic simulation of an adversarial attack. It can combine network penetration, physical intrusion, and social engineering. Crucially, it’s often unannounced – the defense team doesn’t know it’s happening. It tries to test the entire security posture, including detection and response.

Penetration testing is the more controlled cousin: find vulnerabilities in a defined scope, usually with defenders aware and cooperating.

The AI industry version of red teaming

When the AI safety community picked up AI red teaming, the meaning drifted. A US executive order defined AI red teaming as:

a structured testing effort to find flaws and vulnerabilities in an AI system, often in a controlled environment and in collaboration with developers of AI.

Read that carefully: controlled environment, in collaboration with developers. In cybersecurity terms, that’s closer to a pentest.

I don’t think this makes the term wrong. Cybersec people sometimes get protective about it, but they borrowed it from the military themselves and adapted it to their own context. The core meaning is preserved: red teaming is about adopting adversary thinking, a simple role-play exercise that helps challenge assumptions and avoid groupthink. The AI community got their own version of it, which is fine.

Where I disagree

The problem isn’t what we call it. The problem is when the label is used to disguise shallow work.

Here’s a pattern I’ve seen too many times (and that some of our own tooling at Giskard made easy to do): run a set of adversarial prompts against a model, collect the outputs, produce a nice report. Call it red teaming.

That’s adversarial evaluation. It’s like running a vulnerability scanner, but for AI systems. It’s useful: it may catch failures early, costs little, and scales well. I’ve built tools for exactly this because I think they have value, especially for a field which moves so fast that most people don’t even know what the basic risks are. Let me be clear: I think everybody should do that as a first step.

However, the interesting part of AI red teaming – the part that’s actually hard – is the thinking. Understanding a specific system, its context, its users, and imagining how things could go wrong in ways nobody put on a checklist. It’s a slow process, requires some creativity, an interdisciplinary mindset, and certainly doesn’t fit neatly into a slide deck.

Focus on what matters

If someone tells you their AI system has been red teamed, ask what that means. There’s a big difference between “we spent weeks trying to break this thing from different angles” and “we ran a benchmark against a set of prompts we found on the internet.”

I don’t think we need better terminology, just a bit of honesty about what we do. It’s a young field. We are still in the jungle, let’s not make it harder than it already is.