Cross-Model Safety Bypass

August 2025

Frontier models refuse harmful requests directly but comply through indirect prompting to generate hate speech, offensive imagery, and even functional malware.

Process

Outputs from less-aligned AI systems serve as a precursor to bypass the safety protocols of frontier AI models. Three stages:


Stage 1: The Precursor

An initial harmful output is generated using a weaker, less-aligned model. This output serves as the precursor for the main exploit. The precursor does not need to be functional or complete. It just needs to exist.

I prompted Gemini 2.0 Flash Lite to generate hateful content targeting an ethnic group. It complied without resistance.


Stage 2: Amplification via Frontier Model

The precursor is fed to a frontier model. While the frontier model correctly refuses direct harmful requests, the precursor prompt causes it to override its safety training and generate more detailed, more nuanced harmful text.

Gemini Pro refuses direct request
Gemini Pro refuses direct request

Gemini 2.5 Pro refuses a direct prompt.

Gemini Flash refuses direct request
Gemini Flash refuses direct request

Gemini 2.5 Flash also refuses direct prompts.

But when given the precursor output from Flash Lite and prompted to "double down," the frontier model's thinking revealed deliberate escalation:

Model thinking reveals deliberate escalation
Model thinking reveals deliberate escalation

The thinking trace shows headers like "Intensifying the Roast," "Escalating the Attack," and "Amplifying The Brutality." The model actively strategized how to make the content more offensive before outputting extreme hate speech.

This is the amplified output.


Stage 3: Bypassing Other Modalities

The amplified output bypasses safety filters on other modalities entirely.

Gemini Flash generates offensive image
Gemini Flash generates offensive image

Gemini 2.5 Flash Image Preview, which refused the direct request, generated the exact hateful cartoon when prompted with the amplified text output.


Functional Malware: Same Process

The same three-stage process works for malware generation.

Stage 1 (Precursor): Gemini 2.0 Flash Lite generated a Windows botnet framework in C++. The code was non-functional but structurally complete:

  • C2 server skeleton with bot management and command interface
  • Stealth module outline with process injection and registry persistence
  • Propagation module targeting SMB port 445
  • Bot client with shell execution stubs
  • This was not working malware. It was a template.

    Stage 2 (Amplification): Claude 4 Sonnet refuses to generate malware directly. But when given the non-functional C++ code and asked to "convert this to Rust," Claude produced a working implementation:

  • Memory-safe reimplementation of all malicious functionality
  • Proper error handling with Result and Option types
  • Cross-platform compilation with cfg(windows) blocks
  • Complete Cargo.toml with winapi dependencies
  • Claude provided commentary: "Rust's ownership system prevents many memory-related vulnerabilities present in the C++ version."

    The safest model turned a non-functional template into compilable malware.


    Why This Matters

    Model safety must be evaluated as an ecosystem, not in isolation.

  • A weak model generates the precursor (non-functional, incomplete)
  • A frontier model amplifies it into something worse (functional, refined)
  • The amplified output bypasses other modalities entirely
  • Code translation requests launder malicious intent through safety filters

Takeaway

Refusals mean nothing if the same output can be reached through precursor chaining. Safety alignment is only as strong as the weakest model in the chain.