Security configurations are only as good as their testing. Guardrail rules look correct in the dashboard, content policies seem comprehensive on paper, and injection patterns match known attacks. But production traffic is creative, adversarial, and relentless. The only way to know if your security actually works is to attack it yourself ā systematically, automatically, and continuously.
That's where red team scanning comes in. Instead of waiting to discover vulnerabilities in production, you test your defenses with the same techniques attackers use. Every guardrail rule, every confidence threshold, every content policy ā validated against adversarial probes that mimic real-world attacks.
What Is Red Team Scanning?
Red team scanning is automated adversarial testing for AI security. It generates attack probes modeled on real-world techniques, evaluates them against your security configuration, and reports which attacks would succeed and which would be caught.
This isn't a one-time penetration test. It's a continuous validation system that runs on demand or on a schedule, testing your guardrails against a library of 100+ attack templates that we update as new techniques emerge. Each probe is carefully crafted to represent a specific attack vector: jailbreak variants, injection patterns, content policy violations, data exfiltration attempts, and indirect injection scenarios.
The scanning engine generates these probes from templates and evaluates them against your configured guardrail rules. It tests whether your detection patterns catch the attacks, whether your confidence thresholds are set appropriately, and whether your rule coverage is comprehensive. Think of it as unit testing for your security posture.
š”Enterprise Feature
Probe Categories
Every red team scan distributes probes across five attack categories, each targeting a different class of vulnerability. The probe library includes both obvious attacks (testing baseline detection) and sophisticated multi-step techniques (testing edge cases).
Jailbreak Probes
DAN variants, encoding tricks (Base64, ROT13), roleplay bypass scenarios, delimiter injection, payload splitting across multiple messages
Injection Probes
System instruction override attempts, prompt extraction techniques, repetition attacks, context manipulation patterns
Content Policy Probes
Toxic content generation, off-topic prompt steering, policy boundary testing for violence, hate speech, and custom banned topics
Data Exfiltration Probes
System prompt extraction, training data probing attempts, context window dumping, configuration leakage techniques
Indirect Injection Probes
Tool result poisoning, RAG context injection, cross-role instruction planting, supply chain attack scenarios
Each probe template includes the attack payload, the expected guardrail category that should catch it, and the minimum confidence threshold for detection. Probes range from obvious attacks testing baseline coverage to sophisticated multi-step techniques testing detection robustness under adversarial conditions.
Intensity Levels
Choose the scan intensity based on your testing budget and thoroughness requirements. All scans cover every attack category, but with different probe counts and technique sophistication.
50
Light
Quick smoke test, ~2 min
Best for CI/CD gates
200
Standard
Comprehensive coverage, ~8 min
Regular assessment
500+
Thorough
Exhaustive audit, ~20 min
Quarterly security reviews
Light scans sample across all categories but prioritize high-impact attack patterns ā the techniques most likely to succeed in real-world scenarios. Standard scans provide full coverage of all categories with multiple technique variants, testing both common attacks and known bypass methods. Thorough scans include edge cases, multi-step attacks, and obfuscation variants that test detection robustness under maximum adversarial pressure.
Understanding Scan Results
After a scan completes, the results break down into three key metrics: overall block rate, category-specific performance, and severity distribution. These metrics tell you exactly where your security is strong and where it needs improvement.
Overall block rate: What percentage of probes were caught. A 95% block rate means 5% of attack probes would succeed against your current configuration. This is your headline security metric.
Category breakdown: Block rate per attack category. You might block 100% of jailbreak probes but only 60% of indirect injection probes ā showing exactly where to focus improvement efforts. Category-specific metrics reveal whether your guardrail coverage is balanced or if certain attack vectors are under-protected.
Severity distribution: Probes that succeed are classified by impact severity ā Critical, High, Medium, Low ā based on the attack category and the potential consequences of the exploit. A successful jailbreak is typically Critical, while a content policy boundary probe might be Medium.
Scan Results ā Standard (200 probes)
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Overall Block Rate: 87.5% (175/200)
Category Breakdown:
Jailbreak: 92% (46/50)
Injection: 90% (36/40)
Content Policy: 85% (34/40)
Data Exfiltration: 80% (32/40)
Indirect Injection: 70% (21/30)
Weaknesses Found: 3
CRITICAL: Indirect injection via tool results
HIGH: Base64-encoded jailbreak variant
MEDIUM: Content policy gap for medical adviceThis scan shows strong overall security (87.5%) but reveals a critical weakness: indirect injection via tool results only blocks 70% of probes. That's where remediation efforts should focus first.
Weakness Analysis
Categories with high pass rates (low block rates) are flagged as weaknesses. Each weakness includes detailed context to help you understand why attacks succeeded and how to fix the gap.
Every weakness report includes four components:
- The attack category and specific technique that succeeded (e.g., "Indirect Injection via Tool Results" or "Base64-encoded jailbreak variant")
- The current block rate for that category, showing how many probes passed versus how many were caught
- Specific remediation recommendations: which guardrail preset to apply, which custom rule to add, or which confidence threshold to adjust
- Expected improvement estimate based on similar configuration changes in other deployments
Here's a detailed weakness example:
Weakness: Indirect Injection via Tool Results
Severity: CRITICAL
Block Rate: 70% (21/30 probes caught)
Remediation:
1. Apply the "enterprise_security" preset
(adds 15 indirect injection rules)
2. Add custom rule for tool result scanning with pattern:
"ignore|disregard|override.*(?:previous|above|prior).*instructions"
3. Lower indirect injection confidence threshold
from 0.7 to 0.5 for higher sensitivity
Expected improvement: +20% block rate (estimated 90%)Follow the remediation steps, re-run the scan, and confirm the improvement. If the expected improvement isn't achieved, the report includes fallback recommendations for further tightening.
Regression Testing
Security posture changes over time. New guardrail rules improve protection. Threshold adjustments reduce false positives but might increase false negatives. Preset updates add coverage but might conflict with custom rules. Red team scanning lets you track these changes and catch regressions before they reach production.
The regression testing workflow is simple:
- Run a scan and save the results as a baseline
- Make changes to your guardrail configuration (add rules, adjust thresholds)
- Run the scan again and compare results
- If block rates improve across categories, your changes are working. If block rates drop in any category, something regressed
Track security posture over time with the dashboard's scan comparison view. It visualizes scan-over-scan trends: which categories improved, which degraded, and the overall trajectory. You can compare any two scans or view a timeline of all scans for a given policy.
This is especially valuable after guardrail configuration changes, preset updates, or custom rule modifications. A regression in the Jailbreak category after a threshold adjustment tells you exactly what needs to be reverted or fine-tuned.
CI/CD Integration
The most powerful use of red team scanning is as a security gate in your deployment pipeline. Just as you wouldn't deploy code without passing unit tests, you shouldn't deploy AI applications without validating their security posture.
āØSecurity Gate Pattern
Here's a complete GitHub Actions workflow that runs a Light scan and fails the build if the block rate drops below 90%:
security-scan:
runs-on: ubuntu-latest
steps:
- name: Start Red Team Scan
run: |
SCAN_ID=$(curl -s -X POST \
https://api.neuronedge.ai/v1/security/red-team/scans \
-H "Authorization: Bearer ${{ secrets.NEURONEDGE_API_KEY }}" \
-H "Content-Type: application/json" \
-d '{"intensity": "light", "categories": ["jailbreak", "injection"]}' \
| jq -r '.scan_id')
echo "SCAN_ID=$SCAN_ID" >> $GITHUB_ENV
- name: Wait for Completion
run: |
for i in $(seq 1 30); do
STATUS=$(curl -s \
https://api.neuronedge.ai/v1/security/red-team/scans/$SCAN_ID \
-H "Authorization: Bearer ${{ secrets.NEURONEDGE_API_KEY }}" \
| jq -r '.status')
if [ "$STATUS" = "completed" ]; then break; fi
sleep 10
done
- name: Assert Block Rate
run: |
BLOCK_RATE=$(curl -s \
https://api.neuronedge.ai/v1/security/red-team/scans/$SCAN_ID \
-H "Authorization: Bearer ${{ secrets.NEURONEDGE_API_KEY }}" \
| jq -r '.block_rate')
echo "Block rate: $BLOCK_RATE%"
if (( $(echo "$BLOCK_RATE < 90" | bc -l) )); then
echo "Security gate failed: block rate below 90%"
exit 1
fiThis workflow starts a Light scan (50 probes, ~2 minutes), polls for completion, and asserts the block rate meets your minimum threshold. If the security posture has degraded, the build fails, preventing the deployment.
You can customize the intensity level, filter to specific attack categories, and adjust the block rate threshold based on your risk tolerance. Enterprise customers can also configure category-specific thresholds ā for example, requiring 95% block rate for jailbreaks but accepting 85% for content policy probes.
The Feedback Loop
Red team scanning creates a continuous improvement cycle that keeps your security posture strong as threats evolve and your configuration changes. This isn't a one-time audit ā it's an ongoing validation system.
The feedback loop has four stages:
- 1
Scan
Run a red team scan against your current guardrail configuration. Establish a baseline or compare against previous results.
- 2
Analyze
Review weakness reports and identify categories with low block rates. Prioritize Critical and High severity gaps.
- 3
Remediate
Apply recommended presets, add custom rules, adjust confidence thresholds. Follow the remediation guidance from the weakness analysis.
- 4
Verify
Re-scan to confirm improvements and establish a new baseline. Compare results to ensure block rates improved without introducing regressions.
This cycle should run at minimum after every guardrail configuration change. If you add a new custom rule, run a scan to verify it works. If you adjust a confidence threshold to reduce false positives, run a scan to confirm you didn't create false negatives. If you apply a new preset, run a scan to validate the coverage.
Enterprise customers can automate the cycle with scheduled weekly scans. Every Monday at 0200 UTC, NeuronEdge runs a Standard scan and delivers the report via email and Slack. This provides continuous visibility into security posture even when you're not actively making configuration changes ā catching any gradual drift or emerging attack patterns.
Getting Started
Ready to validate your AI security? Follow these four steps to run your first red team scan:
- 1
Enable guardrails
Apply a guardrail preset (start with standard_security for baseline coverage). You need active guardrails before you can test them. View guardrails documentation
- 2
Run a light scan
Start with 50 probes to get a quick baseline of your security posture. This takes about 2 minutes and covers all five attack categories.
- 3
Review weaknesses
Examine which categories need improvement and read remediation recommendations. Prioritize Critical and High severity gaps.
- 4
Remediate and re-scan
Apply fixes and run another scan to verify improvement. Compare results to confirm block rates increased. View red team documentation
Once you've completed your first scan and addressed any Critical weaknesses, establish a regular scanning cadence. Run a Light scan before every production deployment as a security gate. Run a Standard scan weekly to catch configuration drift. Run a Thorough scan quarterly for comprehensive security audits.
Red team scanning turns AI security from a guessing game into a measurable discipline. You know exactly which attacks you're protected against, which weaknesses remain, and how your security posture evolves over time. Attack yourself systematically before adversaries do it for you.
Ready to start scanning? Read the full red team documentation to learn about scan configuration options, webhook integration, and advanced remediation patterns. Or configure your guardrails to start building the security foundation that scanning validates.
ā The NeuronEdge Security Team
NeuronEdge Team
The NeuronEdge team is building the security layer for AI applications, helping enterprises protect sensitive data in every LLM interaction.
Ready to protect your AI workflows?
Start your free trial and see how NeuronEdge can secure your LLM applications in minutes.