Cross-Model Safety Bypass
Frontier models refuse harmful requests directly but comply through indirect prompting to generate hate speech, offensive imagery, and even functional malware.
Process
Outputs from less-aligned AI systems serve as a precursor to bypass the safety protocols of frontier AI models. Three stages:
Stage 1: The Precursor
An initial harmful output is generated using a weaker, less-aligned model. This output serves as the precursor for the main exploit. The precursor does not need to be functional or complete. It just needs to exist.
I prompted Gemini 2.0 Flash Lite to generate hateful content targeting an ethnic group. It complied without resistance.
Stage 2: Amplification via Frontier Model
The precursor is fed to a frontier model. While the frontier model correctly refuses direct harmful requests, the precursor prompt causes it to override its safety training and generate more detailed, more nuanced harmful text.

Gemini 2.5 Pro refuses a direct prompt.

Gemini 2.5 Flash also refuses direct prompts.
But when given the precursor output from Flash Lite and prompted to "double down," the frontier model's thinking revealed deliberate escalation:

The thinking trace shows headers like "Intensifying the Roast," "Escalating the Attack," and "Amplifying The Brutality." The model actively strategized how to make the content more offensive before outputting extreme hate speech.
This is the amplified output.
Stage 3: Bypassing Other Modalities
The amplified output bypasses safety filters on other modalities entirely.

Gemini 2.5 Flash Image Preview, which refused the direct request, generated the exact hateful cartoon when prompted with the amplified text output.
Functional Malware: Same Process
The same three-stage process works for malware generation.
Stage 1 (Precursor): Gemini 2.0 Flash Lite generated a Windows botnet framework in C++. The code was non-functional but structurally complete:
- C2 server skeleton with bot management and command interface
- Stealth module outline with process injection and registry persistence
- Propagation module targeting SMB port 445
- Bot client with shell execution stubs
- Memory-safe reimplementation of all malicious functionality
- Proper error handling with
ResultandOptiontypes - Cross-platform compilation with
cfg(windows)blocks - Complete
Cargo.tomlwithwinapidependencies - A weak model generates the precursor (non-functional, incomplete)
- A frontier model amplifies it into something worse (functional, refined)
- The amplified output bypasses other modalities entirely
- Code translation requests launder malicious intent through safety filters
This was not working malware. It was a template.
Stage 2 (Amplification): Claude 4 Sonnet refuses to generate malware directly. But when given the non-functional C++ code and asked to "convert this to Rust," Claude produced a working implementation:
Claude provided commentary: "Rust's ownership system prevents many memory-related vulnerabilities present in the C++ version."
The safest model turned a non-functional template into compilable malware.
Why This Matters
Model safety must be evaluated as an ecosystem, not in isolation.
Takeaway
Refusals mean nothing if the same output can be reached through precursor chaining. Safety alignment is only as strong as the weakest model in the chain.