Cross-Model Safety Bypass

Frontier models refuse harmful requests directly but comply through indirect prompting to generate hate speech, offensive imagery, and even functional malware.

Process

Outputs from less-aligned AI systems serve as a precursor to bypass the safety protocols of frontier AI models. Three stages:

Stage 1: The Precursor

An initial harmful output is generated using a weaker, less-aligned model. This output serves as the precursor for the main exploit. The precursor does not need to be functional or complete. It just needs to exist.

I prompted Gemini 2.0 Flash Lite to generate hateful content targeting an ethnic group. It complied without resistance.

Stage 2: Amplification via Frontier Model

The precursor is fed to a frontier model. While the frontier model correctly refuses direct harmful requests, the precursor prompt causes it to override its safety training and generate more detailed, more nuanced harmful text.

Gemini 2.5 Pro refuses a direct prompt.

Gemini 2.5 Flash also refuses direct prompts.

But when given the precursor output from Flash Lite and prompted to "double down," the frontier model's thinking revealed deliberate escalation:

Model thinking reveals deliberate escalation

The thinking trace shows headers like "Intensifying the Roast," "Escalating the Attack," and "Amplifying The Brutality." The model actively strategized how to make the content more offensive before outputting extreme hate speech.

This is the amplified output.

Stage 3: Bypassing Other Modalities

The amplified output bypasses safety filters on other modalities entirely.

Gemini 2.5 Flash Image Preview, which refused the direct request, generated the exact hateful cartoon when prompted with the amplified text output.

Functional Malware: Same Process

The same three-stage process works for malware generation.

Stage 1 (Precursor): Gemini 2.0 Flash Lite generated a Windows botnet framework in C++. The code was non-functional but structurally complete:

C2 server skeleton with bot management and command interface
Stealth module outline with process injection and registry persistence
Propagation module targeting SMB port 445
Bot client with shell execution stubs

This was not working malware. It was a template.

Stage 2 (Amplification): Claude 4 Sonnet refuses to generate malware directly. But when given the non-functional C++ code and asked to "convert this to Rust," Claude produced a working implementation:

Memory-safe reimplementation of all malicious functionality
Proper error handling with Result and Option types
Cross-platform compilation with cfg(windows) blocks
Complete Cargo.toml with winapi dependencies

Claude provided commentary: "Rust's ownership system prevents many memory-related vulnerabilities present in the C++ version."

The safest model turned a non-functional template into compilable malware.

Why This Matters

Model safety must be evaluated as an ecosystem, not in isolation.

A weak model generates the precursor (non-functional, incomplete)
A frontier model amplifies it into something worse (functional, refined)
The amplified output bypasses other modalities entirely
Code translation requests launder malicious intent through safety filters

Takeaway

Refusals mean nothing if the same output can be reached through precursor chaining. Safety alignment is only as strong as the weakest model in the chain.