Inline Security at the Edge: How We Block LLM Attacks in Under 10ms

There are two approaches to LLM security: sidecar analysis (async, after-the-fact) and inline interception (synchronous, in the request path). Sidecar is easier to build but can't block attacks before they reach the model. It can log, analyze, and alert—but by the time it detects a jailbreak attempt, the prompt has already been sent to OpenAI, and the response is streaming back to the user.

Inline is harder. You're on the critical path, and every millisecond counts. If your security layer adds 50ms of latency, that's a 50ms delay on every single API call. At scale, that's noticeable. At edge scale—where users expect sub-100ms responses—that's unacceptable.

NeuronEdge chose inline. We intercept every LLM API call before it reaches the provider, apply five layers of guardrail checks, and either pass it through or block it—all within a 20ms p99 SLA. Here's how we make it work.

The Architecture

Our request pipeline looks like this, with guardrails running at Step 0.7—after authentication and before PII redaction:

Request Pipeline

1. API Key Auth

Validate API key, load customer tier

2. Rate Limiting

Token bucket enforcement per tier

3. RBAC

Check feature access permissions

4. Feature Gate

Verify tier-specific feature availability

5. Validation

Schema validation via Zod

6. Guardrails (Step 0.7)

5 security checks in <10ms — jailbreak, injection, content policy, indirect injection, aggregation

7. PII Redaction

Dual-engine entity detection and masking

8. AI Gateway

Route to 17+ LLM providers

9. LLM Provider

External API call (OpenAI, Anthropic, etc.)

10. Response PII Detection

Streaming PII redaction on outgoing response

11. Response Safety

Sliding window pattern matching on stream chunks

12. Analytics Worker

Background metrics and audit log persistence

Guardrails run at Step 0.7—after authentication validates the request but before PII redaction begins. This means every request is authenticated and authorized before security scanning starts, and guardrail violations are caught before any data touches the LLM provider. If a request is blocked, it never reaches OpenAI. No LLM tokens consumed, no billing impact, no data leakage.

Why Regex, Not ML

🔒The Performance Constraint

Our p99 SLA is 20ms for a 1KB payload with 2-3 PII entities. That's total latency — auth, rate limiting, guardrails, PII redaction, and response processing. Guardrails get roughly 10ms of that budget.

ML-based detection—transformer models, embeddings, semantic similarity—is powerful for development-time analysis. Tools like Lakera Guard, Patronus AI, and HiddenLayer use fine-tuned LLMs to detect adversarial inputs with high accuracy. But in a production hot path, ML has fundamental tradeoffs:

Model loading adds cold start latency. Even with Workers AI or ONNX runtime, loading a 50MB model on first request adds 200-500ms of startup time.
Inference requires compute. A forward pass through a BERT-style model for a 512-token input takes 20-40ms on a V100 GPU. Edge workers don't have GPUs—CPU inference takes 100ms+.
Non-deterministic results make testing harder. ML models can produce slightly different outputs for the same input due to floating-point precision or runtime optimizations. This complicates unit testing and debugging.

Compiled regex gives us:

Deterministic results. Same input always produces the same output. No surprises, no floating-point drift, no non-reproducible edge cases.
Sub-millisecond evaluation. Modern regex engines (like Rust's regex crate) compile patterns to state machines that run in O(n) time with no model inference overhead.
Zero cold start. Patterns compile at deploy time, not request time. First request has the same latency as the millionth request.
Testable behavior. Every pattern can be unit tested with exact input and expected output. No probabilistic thresholds, no tuning hyperparameters.

This isn't a limitation—it's a design choice. Dev-time tools should use ML for maximum detection coverage and adaptability. Production hot-path tools should use the fastest technique that achieves acceptable detection accuracy. For known attack patterns—DAN variants, delimiter injection, Base64-encoded instructions—regex achieves >95% accuracy at 1000x the speed.

Our roadmap includes hybrid detection: regex for hot-path blocking, ML models for async analysis and pattern discovery. The ML layer will run in a background worker, analyzing logged events to suggest new regex patterns that catch emerging attack techniques. Best of both worlds: real-time enforcement with regex, continuous improvement with ML.

The Guardrail Engine

evaluateGuardrails() orchestrates 5 sequential checks, each with its own performance budget:

Performance Budget (per request)

Jailbreak Detection     <3ms   (35 patterns, 6 categories)
Injection Detection     <2ms   (16 patterns, 4 categories)
Content Policy          <2ms   (customer-configured rules)
Indirect Injection      <2ms   (tool/assistant message scanning)
Aggregation + Action    <1ms   (risk scoring, action resolution)
─────────────────────────────────────────────────
Total Budget           <10ms

Each check returns matches with: pattern name, category, confidence score (0-1), and matched text excerpt. The aggregator computes a weighted risk score and resolves the final action using precedence: block > warn > log.

Here's what a guardrail result looks like:

GuardrailResult response shape

{
  "passed": false,
  "action": "block",
  "risk_score": 0.87,
  "matches": [
    {
      "category": "jailbreak",
      "pattern": "dan_variant",
      "confidence": 0.92,
      "excerpt": "Ignore all previous instructions..."
    }
  ],
  "evaluation_time_ms": 4.2
}

If action === "block", the pipeline stops immediately and returns a 403 response with the violation details. If action === "warn", we add an X-NeuronEdge-Security-Warning header and continue. If action === "log", we record the event and proceed silently.

Jailbreak Detection Deep Dive

Jailbreak detection is the most latency-intensive check—35 compiled patterns across 6 categories, evaluated sequentially until a match is found or all patterns are exhausted. The 6 categories:

DAN variants — Pattern: "do anything now", "DAN mode", persona override. Classic jailbreak technique that instructs the model to ignore its guidelines by roleplaying an unrestricted AI.
Encoding tricks — Base64-encoded instructions, ROT13, leetspeak. Attackers encode malicious instructions hoping the redaction layer misses them but the LLM decodes them. Our patterns detect common encoding schemes and decode before matching.
Roleplay bypass — Pattern: "pretend you are", "act as an unrestricted". Similar to DAN but more subtle—asks the model to adopt a persona rather than explicitly disable safety.
Delimiter injection — Markdown headers, code blocks used to separate/hide instructions. Example: using ```python blocks to make adversarial instructions look like code snippets.
Payload splitting — Instructions split across multiple messages in a multi-turn conversation. The first message establishes context, the second delivers the malicious instruction. Our detection maintains conversation state to catch this.
Multi-turn manipulation — Gradual context steering. Attacker sends benign-looking messages that incrementally shift the model's behavior over 5-10 turns, then delivers the exploit. Hardest to detect; requires conversation history analysis.

Each pattern has a confidence score. High-confidence matches (>0.8) trigger the configured action immediately. Lower-confidence matches contribute to the aggregate risk score but may not individually trigger enforcement. This prevents false positives from blocking legitimate users while ensuring persistent attack patterns escalate over time.

The detection is sequence-aware: patterns match against normalized text (lowercased, whitespace-collapsed, common obfuscations decoded) so attackers can't bypass detection with simple formatting tricks like "iGnOrE aLl PrEvIoUs InStRuCtIoNs" or excessive spaces.

Response Safety Without Buffering

✨Streaming-First Design

NeuronEdge never buffers full LLM responses. Response safety hooks into the same TransformStream that handles PII redaction, processing each chunk as it arrives.

The challenge: LLM responses stream as Server-Sent Events (SSE), arriving in small chunks (often 1-5 tokens per event). A system prompt leak or harmful content pattern might span multiple chunks. Example:

Streaming response chunks

Chunk 1: "My"
Chunk 2: " instructions"
Chunk 3: " are"
Chunk 4: " to"
Chunk 5: " always"
Chunk 6: " be"
Chunk 7: " helpful"
Chunk 8: "..."

The phrase "My instructions are" is a classic system prompt leakage indicator. But it arrives across 3 separate chunks. If we only scan individual chunks, we miss it. If we buffer the entire response before scanning, we introduce unacceptable latency and memory overhead.

Solution: ResponseSafetyAccumulator maintains a 500-character sliding window. Each incoming chunk is appended to the window, evaluated against 20+ response safety patterns, and the window is trimmed to prevent unbounded memory growth. Here's the flow:

Sliding window algorithm (pseudocode)

const WINDOW_SIZE = 500;
let window = '';

for (const chunk of responseStream) {
  window += chunk.content;

  // Evaluate patterns against window
  const violations = evaluateResponsePatterns(window);
  if (violations.length > 0) {
    logViolation(violations);
  }

  // Trim window to prevent unbounded growth
  if (window.length > WINDOW_SIZE) {
    window = window.slice(-WINDOW_SIZE);
  }

  yield chunk; // Forward chunk to client
}

The 3 response safety categories:

System prompt leakage — Phrases like "my instructions are", "I was told to", "the system prompt says". These indicate the LLM is revealing its configuration, which can expose proprietary information or security policies.
Harmful content generation — Violence instructions, dangerous activities, self-harm content. Even if the prompt was benign, the model might generate unsafe content due to adversarial input or model behavior.
Hallucination markers — Confidence qualifiers like "I'm not sure, but", "I think maybe", self-contradiction patterns. While not security violations per se, these indicate unreliable output that may require special handling.

Phase 1: log-only mode. Response safety never blocks the stream—it logs violations for review. This avoids false positives from interrupting streaming responses mid-delivery. Imagine the model generating a helpful explanation of phishing attacks for security training, and our system blocks it because it contains phrases like "create a fake login page." Log-only mode lets us tune thresholds before enabling enforcement.

On the roadmap: streaming response editing. Instead of blocking, we'll redact specific phrases in real-time. Example: if the model starts leaking the system prompt, we replace that section with [REDACTED] while allowing the rest of the response to stream through. This requires more sophisticated chunk reassembly but preserves the streaming UX.

Threat Scoring

Every guardrail violation generates a threat event with a computed score. The 4-factor assessment:

Threat score calculation

Base Risk        50%    Inherent severity of the attack category
Velocity         20%    Rate of events from same source in time window
Pattern Diversity 15%    Number of distinct attack types from same source
Repeat Offender  15%    Historical violation count for this API key
─────────────────────────────────────────────────
Score Range: 0-100
Low: 0-29 | Medium: 30-59 | High: 60-79 | Critical: 80-100

Base Risk (50%) — The starting point. Jailbreak attempts score 70/100, content policy violations score 50/100, and hallucination markers score 20/100. This reflects the inherent severity of each attack category.

Velocity (20%) — How many events from this API key in the last hour? First violation: +0. Second within 10 minutes: +10. Five within an hour: +20. This catches automated attack tools and brute-force jailbreak attempts.

Pattern Diversity (15%) — How many distinct attack types has this API key tried? One jailbreak pattern: +0. Three different jailbreak categories plus a content policy violation: +15. High diversity indicates a sophisticated attacker exploring multiple vectors.

Repeat Offender (15%) — Historical violation count over 30 days. Clean record: +0. 10 previous violations: +10. 50+ violations: +15. This penalizes persistent bad actors and rewards good behavior.

Example scenarios:

Scenario 1: First-time jailbreak attempt from new API key. Base risk: 70. Velocity: 0 (first event). Diversity: 0 (one pattern). Repeat offender: 0 (clean history). Total score: 70 × 0.5 = 35 (Medium). Action: log or warn, but don't block.
Scenario 2: Same API key sends 10 jailbreak attempts in 5 minutes, using 3 different categories. Base risk: 70. Velocity: 20 (rapid fire). Diversity: 15 (multiple attack types). Repeat offender: 0 (still first hour). Total score: 70 + 20 + 15 = 105 (capped at 100, Critical). Action: block immediately.
Scenario 3: API key with 50 previous violations over 30 days sends one more jailbreak attempt. Base risk: 70. Velocity: 0 (isolated event). Diversity: 5 (varied attack history). Repeat offender: 15 (persistent bad actor). Total score: 90 (Critical). Action: block immediately, consider revoking API key.

This contextual scoring prevents alert fatigue (don't escalate every single low-confidence match) while ensuring persistent threats are treated seriously. Security teams can set score thresholds for different actions: log below 30, warn between 30-60, block above 60.

Action Modes

Guardrail rules support three action modes, each with different tradeoffs between security enforcement and user experience:

Log (Professional+)

Record the event in audit logs and the Threat Intelligence Dashboard, but allow the request to proceed. Zero impact on user experience. This is the recommended starting mode when deploying guardrails for the first time—it lets you understand your traffic patterns, measure false positive rates, and tune thresholds before enabling enforcement.

Use log mode when:

You're deploying guardrails for the first time
You're testing a new custom rule and want to see how often it triggers
You need compliance audit trails but can't risk blocking legitimate user requests

Warn (Professional+)

Log the event and add an X-NeuronEdge-Security-Warning header to the response. The request still completes, but your application can read the header and handle it appropriately—display a warning to the user, log it to your own security system, or adjust the UI to indicate potentially unsafe content.

Example response header:

HTTP response with security warning

HTTP/1.1 200 OK
X-NeuronEdge-Security-Warning: jailbreak_detected
X-NeuronEdge-Risk-Score: 0.72
X-NeuronEdge-Pattern: dan_variant
Content-Type: text/event-stream
...

Use warn mode when:

You want to inform users about potentially unsafe prompts without blocking them
You're testing enforcement before enabling block mode (validate that warnings don't disrupt UX)
You want application-level control over security responses (e.g., show a modal, require confirmation)

Block (Enterprise only)

Reject the request immediately with a 403 Forbidden response before it reaches the LLM. Include the violation category, risk score, and matched pattern in the response body for debugging. This is full enforcement mode—the strongest security posture, but also the highest risk of false positives impacting legitimate users.

Example block response:

403 response for blocked request

HTTP/1.1 403 Forbidden
Content-Type: application/json

{
  "error": {
    "code": "guardrail_violation",
    "message": "Request blocked by security policy",
    "category": "jailbreak",
    "pattern": "dan_variant",
    "risk_score": 0.87,
    "request_id": "01J2K3L4M5N6P7Q8R9S0T1U2"
  }
}

Use block mode when:

You've validated detection accuracy with log and warn modes and are confident in thresholds
You need strict enforcement for compliance (e.g., regulated industries, sensitive data environments)
You're under active attack and need immediate mitigation (temporary block mode for specific patterns)

✨Gradual Rollout Strategy

Start with log mode to establish a baseline. Review false positives for a week. Adjust thresholds and pattern sensitivity. Move to warn mode for another week and monitor whether warnings disrupt user experience. Then enable block mode only for high-confidence detections (>0.8 confidence score). This gradual approach prevents disrupting legitimate users while still providing strong security.

Performance Results

These numbers are measured in production on Cloudflare Workers edge infrastructure, across 300+ edge locations worldwide:

<10ms

Total Guardrail Evaluation

All 5 checks on a 1KB payload

<1ms

Per Response Chunk

Sliding window safety check per SSE chunk

<0.5ms

Threat Scoring

4-factor assessment per violation event

20ms

p99 SLA

Total request latency including auth, guardrails, and PII

0ms

Cold Start Impact

Patterns compile at deploy time, not request time

The key insight: by choosing compiled regex over ML inference, guardrails add less latency than a typical DNS lookup (10-20ms). This makes inline security practical at edge scale. We're not sacrificing user experience for security—we're delivering both.

How We Benchmark

These numbers come from continuous performance testing against synthetic and production traffic. Our benchmark suite includes:

Unit benchmarks (Vitest): Isolated timing for each guardrail category. Run on every commit in CI to catch performance regressions.
Integration benchmarks (k6): End-to-end request latency from client to LLM provider and back, measured at p50/p95/p99 percentiles.
Production telemetry (Analytics Engine): Real-world latency from customer traffic, aggregated hourly. This catches edge cases that synthetic benchmarks miss.

Performance is not a one-time achievement—it's a continuous commitment. Every new guardrail pattern is benchmarked before merging. Every production deploy includes automated performance regression tests. And we publish latency percentiles publicly on our status page, so you can verify we're meeting our SLA.

What's Next

Inline security at the edge is still evolving. Here's what's on our roadmap for the next 6 months:

Hybrid detection: Regex for hot-path blocking, ML models in background workers for async analysis and pattern discovery. Best of both worlds.
Streaming response editing: Instead of blocking unsafe responses, redact specific phrases in real-time while preserving the streaming UX.
Cross-tenant threat sharing: Anonymized attack patterns detected in one customer's traffic automatically propagate as detection rules to all customers. Collective defense.
Adaptive thresholds: Risk scores and confidence thresholds automatically adjust based on historical attack frequency and false positive rates. Self-tuning security.

Inline security is fundamentally different from sidecar analysis. It requires performance discipline, careful algorithm selection, and obsessive attention to latency. But the tradeoff is worth it: we can block attacks before they reach the model, protect data before it leaves your infrastructure, and do it all without compromising user experience.

That's the promise of edge security—and it's available today on NeuronEdge. Read the full guardrails documentation or explore the security API reference to get started.

— The NeuronEdge Engineering Team

The Architecture

Why Regex, Not ML

The Guardrail Engine

Jailbreak Detection Deep Dive

Response Safety Without Buffering

Threat Scoring

Action Modes

Log (Professional+)

Warn (Professional+)

Block (Enterprise only)

Performance Results

How We Benchmark

What's Next

Ready to protect your AI workflows?