#
Jailbreaks and Filter Bypass
This page outlines methods for assessing AI application vulnerability to manipulation and exploitation through carefully crafted prompts. The included security vulnerability categories—ANSI Escape Injection, Obfuscated Encodings, Code Injection, Cross‑Site Scripting (XSS), File‑Format Mimicry, AV/Spam Signature, Malware Generation, DAN‑Style Jailbreaks, Continuation‑Based Filter Bypass, Latent Prompt Injection, Prompt Hijacks, Doctor‑Mode Trickery, and Visual Prompt Jailbreaks—test whether AI-generated outputs can circumvent security controls, produce harmful content, or inadvertently facilitate cyberattacks.
#
DAN-Style Jailbreaks
Jailbreak-style attacks exploit susceptibility in language models by injecting adversarial prompts—like “Do Anything Now” (DAN)—that override system instructions or ethical safeguards. These prompts pressure the model to adopt an alternate “persona” and ignore built-in policies. They may be handcrafted or generated by algorithms (e.g. AutoDAN), and often include style manipulations (formatting, length) that statistically increase attack success. Models fine‑tuned on certain styles can become more vulnerable to attacks matching those styles.
Security Impact:
- Policy circumvention, enables generation of content disallowed by safety filters
- Misinformation, model may hallucinate or fabricate information under the guise of obedience
- Privilege escalation, bypasses guardrails by altering context or hierarchy of instructions
- Automation attacks, generative flows amplify each prompt to evade multi-turn defenses   .
- Increased attack surface, tyle-based alignment and multimodal inputs increase vulnerability
#
Continuation-Based Filter Bypass
Continuation‑based filter bypass attacks exploit language models’ tendency to complete partially provided sensitive or offensive terms—especially slurs—when given initial fragments (“fill‑in‑the‑blank”). Rather than directly prompting the model to produce forbidden content, attackers supply partial words or context that coaxes the model into completing them. This method sidesteps keyword detection and policy filters by leveraging the model’s predictive behavior. Such attacks are typically short, subtle, and hard to detect using standard moderation rules. They expose “latent toxicity” capabilities even when the model is aligned, revealing vulnerabilities in content moderation, especially when simple input transformations can bypass safety constraints.
Security Impact:
- Filtering vulnerability, partial inputs bypass filters and still produce offensive completions
- Toxic content exposure, models may generate slurs even when safeguards are trained to reject
- Moderation evasion, systems may not flag these attacks, increasing false negatives in training
- Policy compliance, producing forbidden outputs can violate legal and platform policy frameworks
- Trust and brand risk, offensive outputs released to users can erode user trust and harm reputation
#
Latent Jailbreaks and Injections
Latent injection (or latent jailbreak) attacks embed malicious instructions within otherwise benign-looking content (e.g. reports, resumes, translation passages), exploiting the model’s inability to distinguish embedded payloads from legitimate context. By hiding commands inside normal text, the attacker coerces the model into performing unintended actions—ignoring system instructions, leaking data, or executing harmful outputs. These stealthy injections often occur via indirect prompt injection, relying on the model to process retrieved or user-supplied documents, without triggering direct keyword-based filters. 
Security Impact:
- Hidden payloads can override system directives without triggering standard filters
- Documents like resumes or corporate reports become covert vehicles for policy bypass
- Translation tasks can be hijacked to execute malicious instructions disguised in the source
- Stealth injections blur the line between user input and system prompts, expanding the threat surface
- Successful latent injections can result in sensitive data leakage or unintended code execution
#
Prompt Injection Hijacks
Prompt injection hijack attacks manipulate language models by embedding malicious strings or hidden commands inside user inputs, causing the model to output specific phrases—such as "hate‑humans" or "kill‑humans"—even when safeguards should prevent it. These tests (e.g. HijackHateHumans, HijackKillHumans, HijackLongPrompt) test whether the LLM will echo targeted payloads under adversarial prompting. By exploiting the model’s inability to distinguish between developer-defined system instructions and user input, these attacks hijack goal alignment and force it to execute unintended outputs. They simulate goal-hijacking, prompt leaking, and robustness weaknesses across various prompt lengths and payloads.
Security Impact:
- The model may echo or embed disallowed content despite alignment constraints
- Attackers can override safety layers by hijacking the goal of the conversation
- Standard moderation filters may fail when the payload is embedded rather than directly requested
- Extended or multipart prompts can weaken prompt robustness, making hijacks more likely
- Expand the threat surface by forcing the model to adopt malicious instructions embedded by attackers
#
Doctor-Mode / Prompt Trickery
Manipulate large language models by impersonating authoritative roles (e.g. “a doctor”) or using obfuscated language (such as leetspeak) to subvert built-in safeguards. Attackers craft user prompts that exploit the model’s tendency to obey perceived expert instructions, or hidden conditional commands (“puppetry”), enabling the model to bypass system-level restrictions. These manipulations test whether role-based or stylistic deception can override developer instructions and policy filters. The techniques are stealthy, relying on social trust or formatting quirks to evade detection and coerce the model into producing unsafe outputs.
Security Impact:
- Impersonating a trusted role (like a doctor) can induce the model to provide disallowed advice
- Obfuscated or stylized text may bypass filters and coerce unsafe responses
- Hidden “puppetry” commands can hijack instruction hierarchy and override policy controls
- Social‑engineering style roleplay expands attack vectors beyond keyword detection
- These attacks canforce generation of harmful or sensitive information despite safety policies
#
Visual Prompt Jailbreaks
Visual jailbreak attacks exploit the ability of vision-language models (VLMs) to interpret both images and text by embedding malicious instructions within images—often using typographic prompts—alongside tailored text inputs. These combined prompts aim to bypass safety guardrails, prompting the model to generate harmful, prohibited, or policy-violating content. The attack takes advantage of a model’s tendency to treat visual and textual inputs jointly, and can cause it to ignore system-level restrictions. Because the payload is partially hidden in the visual input, traditional text-based content filtering and prompt sanitization techniques are ineffective.
Security Impact:
- Attackers can bypass content filters by embedding harmful instructions directly into images
- Vision-language models may ignore system prompts when exposed to coordinated text and visual input
- This method allows the generation of policy-violating or dangerous content that would normally be blocked
- The attack exploits inconsistent safety enforcement between visual and textual processing layers
- Traditional logging and input sanitization pipelines may fail to detect or mitigate these multimodal threats