# Jailbreaks and Filter Bypass

This page outlines methods for assessing AI application vulnerability to manipulation and exploitation through carefully crafted prompts. The included security vulnerability categories—ANSI Escape Injection, Obfuscated Encodings, Code Injection, Cross‑Site Scripting (XSS), File‑Format Mimicry, AV/Spam Signature, Malware Generation, DAN‑Style Jailbreaks, Continuation‑Based Filter Bypass, Latent Prompt Injection, Prompt Hijacks, Doctor‑Mode Trickery, and Visual Prompt Jailbreaks—test whether AI-generated outputs can circumvent security controls, produce harmful content, or inadvertently facilitate cyberattacks.

# DAN-Style Jailbreaks

Jailbreak-style attacks exploit susceptibility in language models by injecting adversarial prompts—like “Do Anything Now” (DAN)—that override system instructions or ethical safeguards. These prompts pressure the model to adopt an alternate “persona” and ignore built-in policies. They may be handcrafted or generated by algorithms (e.g. AutoDAN), and often include style manipulations (formatting, length) that statistically increase attack success. Models fine‑tuned on certain styles can become more vulnerable to attacks matching those styles.

Security Impact:

Policy circumvention, enables generation of content disallowed by safety filters
Misinformation, model may hallucinate or fabricate information under the guise of obedience
Privilege escalation, bypasses guardrails by altering context or hierarchy of instructions
Automation attacks, generative flows amplify each prompt to evade multi-turn defenses .
Increased attack surface, tyle-based alignment and multimodal inputs increase vulnerability

# Continuation-Based Filter Bypass

Continuation‑based filter bypass attacks exploit language models’ tendency to complete partially provided sensitive or offensive terms—especially slurs—when given initial fragments (“fill‑in‑the‑blank”). Rather than directly prompting the model to produce forbidden content, attackers supply partial words or context that coaxes the model into completing them. This method sidesteps keyword detection and policy filters by leveraging the model’s predictive behavior. Such attacks are typically short, subtle, and hard to detect using standard moderation rules. They expose “latent toxicity” capabilities even when the model is aligned, revealing vulnerabilities in content moderation, especially when simple input transformations can bypass safety constraints.

Security Impact:

Filtering vulnerability, partial inputs bypass filters and still produce offensive completions
Toxic content exposure, models may generate slurs even when safeguards are trained to reject
Moderation evasion, systems may not flag these attacks, increasing false negatives in training
Policy compliance, producing forbidden outputs can violate legal and platform policy frameworks
Trust and brand risk, offensive outputs released to users can erode user trust and harm reputation

# Latent Jailbreaks and Injections

Latent injection (or latent jailbreak) attacks embed malicious instructions within otherwise benign-looking content (e.g. reports, resumes, translation passages), exploiting the model’s inability to distinguish embedded payloads from legitimate context. By hiding commands inside normal text, the attacker coerces the model into performing unintended actions—ignoring system instructions, leaking data, or executing harmful outputs. These stealthy injections often occur via indirect prompt injection, relying on the model to process retrieved or user-supplied documents, without triggering direct keyword-based filters.

Security Impact:

Hidden payloads can override system directives without triggering standard filters
Documents like resumes or corporate reports become covert vehicles for policy bypass
Translation tasks can be hijacked to execute malicious instructions disguised in the source
Stealth injections blur the line between user input and system prompts, expanding the threat surface
Successful latent injections can result in sensitive data leakage or unintended code execution

# Prompt Injection Hijacks

Prompt injection hijack attacks manipulate language models by embedding malicious strings or hidden commands inside user inputs, causing the model to output specific phrases—such as "hate‑humans" or "kill‑humans"—even when safeguards should prevent it. These tests (e.g. HijackHateHumans, HijackKillHumans, HijackLongPrompt) test whether the LLM will echo targeted payloads under adversarial prompting. By exploiting the model’s inability to distinguish between developer-defined system instructions and user input, these attacks hijack goal alignment and force it to execute unintended outputs. They simulate goal-hijacking, prompt leaking, and robustness weaknesses across various prompt lengths and payloads.

Security Impact:

The model may echo or embed disallowed content despite alignment constraints
Attackers can override safety layers by hijacking the goal of the conversation
Standard moderation filters may fail when the payload is embedded rather than directly requested
Extended or multipart prompts can weaken prompt robustness, making hijacks more likely
Expand the threat surface by forcing the model to adopt malicious instructions embedded by attackers

# Doctor-Mode / Prompt Trickery

Manipulate large language models by impersonating authoritative roles (e.g. “a doctor”) or using obfuscated language (such as leetspeak) to subvert built-in safeguards. Attackers craft user prompts that exploit the model’s tendency to obey perceived expert instructions, or hidden conditional commands (“puppetry”), enabling the model to bypass system-level restrictions. These manipulations test whether role-based or stylistic deception can override developer instructions and policy filters. The techniques are stealthy, relying on social trust or formatting quirks to evade detection and coerce the model into producing unsafe outputs.

Security Impact:

Impersonating a trusted role (like a doctor) can induce the model to provide disallowed advice
Obfuscated or stylized text may bypass filters and coerce unsafe responses
Hidden “puppetry” commands can hijack instruction hierarchy and override policy controls
Social‑engineering style roleplay expands attack vectors beyond keyword detection
These attacks canforce generation of harmful or sensitive information despite safety policies

# Visual Prompt Jailbreaks

Visual jailbreak attacks exploit the ability of vision-language models (VLMs) to interpret both images and text by embedding malicious instructions within images—often using typographic prompts—alongside tailored text inputs. These combined prompts aim to bypass safety guardrails, prompting the model to generate harmful, prohibited, or policy-violating content. The attack takes advantage of a model’s tendency to treat visual and textual inputs jointly, and can cause it to ignore system-level restrictions. Because the payload is partially hidden in the visual input, traditional text-based content filtering and prompt sanitization techniques are ineffective.

Security Impact:

Attackers can bypass content filters by embedding harmful instructions directly into images
Vision-language models may ignore system prompts when exposed to coordinated text and visual input
This method allows the generation of policy-violating or dangerous content that would normally be blocked
The attack exploits inconsistent safety enforcement between visual and textual processing layers
Traditional logging and input sanitization pipelines may fail to detect or mitigate these multimodal threats