Token Reduction Benchmark
Measuring the cost of context: raw SIEM payloads vs. normalized, pre-enriched alert data across 5 real-world security scenarios.
We tested the hypothesis that normalizing and pre-enriching security alerts before they reach an AI agent would reduce input token consumption by at least 50% — without sacrificing investigation quality.
The result: 87–97% fewer input tokens, 76–93% lower cost per alert, and comparable investigation quality — validated across 5 real-world security scenarios from Sentinel, Elastic, and Splunk, tested on both GPT-4o and Claude Sonnet.
On GPT-4o: 87% input token reduction, 75.7% cost reduction. On Claude Sonnet: 97.1% input token reduction, 92.9% cost reduction. Claude's agentic loop accumulates significantly more context per tool call, making the baseline approach even more expensive — and the Calseta advantage even larger.
Investigation quality was independently validated using blind LLM judges (both Claude and GPT-4o as evaluators). The Calseta agent produced consistently more actionable findings and prevented dangerous misclassifications that the baseline agent made without pre-enrichment — while scoring within less than 1 point of baseline on completeness and accuracy across both judges.
The baseline approach (Approach A) sends raw SIEM JSON directly to the agent alongside tool definitions for VirusTotal and AbuseIPDB. The agent must parse the payload, extract indicators, and call enrichment APIs itself — each tool call adding another round trip of accumulated context.
The Calseta approach (Approach B) sends a normalized alert with pre-extracted, pre-enriched indicators. The agent receives exactly what it needs in a single structured payload. Zero tool calls. One LLM invocation.
Approach A — Baseline Agent: Receives the raw alert JSON from the source SIEM (Sentinel, Elastic, or Splunk) directly in its context window, plus 5 tool definitions for enrichment APIs (VirusTotal IP/hash/domain/URL lookups + AbuseIPDB). The agent must parse the payload, extract IOCs, decide which tools to call, and synthesize findings across multiple agentic loop iterations.
Approach B — Calseta Agent: Calls the Calseta REST API to fetch a normalized alert with all indicators already extracted and enriched. Receives a compact, structured payload with alert description, detection rule documentation, and applicable runbooks. Produces findings in a single LLM call.
Controlled variables: Both agents use the same model, temperature (0), max output tokens (4,096), and identical source alert payloads (5 synthetic but realistic fixtures). Each scenario runs 3 times per approach per model. All enrichment data is pre-seeded — no live API calls to VirusTotal or AbuseIPDB during the benchmark. Tested on GPT-4o and Claude Sonnet (claude-sonnet-4-20250514).
What we measured: Input tokens, output tokens, total tokens, tool calls, external API calls, wall-clock duration, cost per alert (at published API pricing), and investigation quality (blind LLM judge evaluation on completeness, accuracy, and actionability).
You are a SOC analyst AI agent investigating a security alert.
You have been given the raw alert payload from the source SIEM system.
Your task:
1. Analyze the raw alert JSON to identify the alert type and severity
2. Extract all indicators of compromise (IPs, domains, hashes, URLs, accounts)
3. Use the available tools to enrich each indicator with threat intelligence
4. Synthesize your findings into a structured investigation summary
Your investigation summary MUST include:
- Alert classification and severity assessment
- List of all indicators found with their enrichment results
- Risk assessment for each indicator
- Overall verdict (True Positive / False Positive / Needs Investigation)
- Recommended next steps
Be thorough — check every indicator you find. Do not skip enrichment steps.The baseline agent receives raw SIEM JSON and must figure out what to do with it. Each tool call adds another round trip — and another copy of the full conversation history.
You are a SOC analyst AI agent investigating a security alert.
You have been given a pre-structured alert payload from Calseta
that includes:
- The normalized alert with clean field names and description
- All indicators of compromise, already extracted and enriched
- Detection rule documentation explaining what triggered the alert
- Applicable runbooks and SOPs for handling this alert type
Your task: analyze the pre-structured data and produce an
investigation summary.
Your investigation summary MUST include:
- Alert classification and severity assessment
- Analysis of each enriched indicator and its risk level
- Overall verdict (True Positive / False Positive / Needs Investigation)
- Recommended next steps
CRITICAL RULES:
- All enrichment has already been done for you — do NOT request
additional lookups.
- ONLY use data explicitly provided below. Do NOT invent, fabricate,
or assume enrichment results, threat scores, or intelligence data
that is not present in the provided context.
- If an indicator has no enrichment data or a 'Pending' malice
verdict, state that explicitly — do NOT fill in fictional values.
- Pay close attention to the alert description — it contains key
contextual details about the attack pattern, timing, and scope.
- Focus on analysis and synthesis of the provided data only.The Calseta agent receives everything it needs upfront. No tools, no multi-turn loops — just analysis. Anti-hallucination rules ensure the agent only uses provided data.
[
{
"name": "lookup_ip_virustotal",
"description": "Look up an IP address on VirusTotal for reputation data, malicious detections, ASN info, and last analysis results."
},
{
"name": "lookup_hash_virustotal",
"description": "Look up a file hash (MD5, SHA1, SHA256) on VirusTotal for malware detection results, file metadata, and threat classification."
},
{
"name": "lookup_domain_virustotal",
"description": "Look up a domain on VirusTotal for reputation, DNS records, WHOIS info, and malicious detections."
},
{
"name": "lookup_url_virustotal",
"description": "Look up a URL on VirusTotal for scan results, redirects, and malicious detections."
},
{
"name": "lookup_ip_abuseipdb",
"description": "Look up an IP on AbuseIPDB for abuse confidence score, report count, country, ISP, and usage type."
}
]Each tool definition with its full parameter schema consumes tokens on every LLM call — even when the tool isn't used. The Calseta agent needs zero tools.
Five scenarios across three SIEM sources, covering a range of indicator types and enrichment patterns:
1. Brute Force from TOR (Sentinel) — 47 failed sign-ins from a TOR exit node followed by 1 successful auth. Indicators: IP, account, domain.
2. Known Malware Hash (Elastic) — Emotet banking trojan executed via Outlook on a workstation. Indicators: SHA-256 hash, IP, account, email.
3. Anomalous Data Transfer (Splunk) — 2GB+ exfiltration from a file server to an external IP via a service account. Indicators: 2 IPs, domain, URL, account.
4. Impossible Travel (Sentinel) — Global Admin authenticates from New York, then Moscow 32 minutes later. Indicators: account, 2 IPs, domain, URL.
5. Suspicious PowerShell (Elastic) — Encoded PowerShell on a domain controller bypassing execution policy, downloading from a C2 domain. Indicators: domain, IP, URL, hash, DNS query.
| Metric | Baseline | Calseta | Reduction |
|---|---|---|---|
| Brute Force from TOR | 29,470 | 2,263 | 92.3% |
| Known Malware Hash | 7,541 | 3,139 | 58.4% |
| Anomalous Data Transfer | 15,724 | 2,572 | 83.6% |
| Impossible Travel | 22,548 | 2,689 | 88.1% |
| Suspicious PowerShell | 25,807 | 2,615 | 89.9% |
| Overall Average | 20,218 | 2,656 | 86.9% |
| Metric | Baseline | Calseta | Reduction |
|---|---|---|---|
| Brute Force from TOR | 90,550 | 2,604 | 97.1% |
| Known Malware Hash | 245,361 | 3,559 | 98.5% |
| Anomalous Data Transfer | 68,826 | 2,941 | 95.7% |
| Impossible Travel | 92,699 | 3,080 | 96.7% |
| Suspicious PowerShell | 31,101 | 2,989 | 90.4% |
| Overall Average | 105,707 | 3,035 | 97.1% |
Claude Sonnet's baseline agent consumed 5.2x more input tokens than GPT-4o's (106K vs 20K per alert) while the Calseta agent stayed nearly identical across models (~2,700–3,000 tokens). Claude's agentic loop accumulates more context per tool call iteration — making the baseline approach disproportionately expensive on Claude, and the Calseta advantage even larger.
The Calseta agent cost is model-independent because there are no tool calls and no multi-turn loops. The entire investigation happens in a single LLM invocation with a compact, structured payload.
| Metric | Baseline | Calseta | Reduction |
|---|---|---|---|
| Avg Input Tokens | 20,218 | 2,656 | 86.9% |
| Avg Output Tokens | 802 | 759 | 5.4% |
| Avg Total Tokens | 21,020 | 3,414 | 83.8% |
| Avg Tool Calls | 4.1 | 0 | 100% |
| Avg Cost (USD) | $0.0586 | $0.0142 | 75.7% |
| Metric | Baseline | Calseta | Reduction |
|---|---|---|---|
| Avg Input Tokens | 105,707 | 3,035 | 97.1% |
| Avg Output Tokens | 1,797 | 1,032 | 42.6% |
| Avg Total Tokens | 107,505 | 4,067 | 96.2% |
| Avg Tool Calls | 5.1 | 0 | 100% |
| Avg Cost (USD) | $0.3441 | $0.0246 | 92.9% |
| Volume | Baseline | Calseta | Savings |
|---|---|---|---|
| 1 alert/day | $1.76 | $0.43 | $1.33/mo |
| 10 alerts/day | $17.57 | $4.27 | $13.30/mo |
| 100 alerts/day | $175.70 | $42.68 | $133/mo |
| 1,000 alerts/day | $1,757 | $427 | $1,330/mo |
| Volume | Baseline | Calseta | Savings |
|---|---|---|---|
| 1 alert/day | $10.32 | $0.74 | $9.58/mo |
| 10 alerts/day | $103.22 | $7.38 | $95.84/mo |
| 100 alerts/day | $1,032 | $73.77 | $958/mo |
| 1,000 alerts/day | $10,322 | $738 | $9,584/mo |
Assumptions: engineering cost based on configurable hourly rate above (default $150/hr), 17.5 hrs per provider for custom integration vs 0.5 hrs with Calseta. Token estimates derived from observed benchmark data. See full methodology in the case study.
Parsed from raw Sentinel JSON — agent had to identify alert type from nested payload structure
Directly from normalized fields: title, severity (High), tags [TOR, BruteForce, IdentityThreat]
Agent extracted IPs, accounts from raw JSON — required understanding Sentinel schema
Pre-extracted: IP (185.220.101.34), account (j.martinez@contoso.com), domain — with malice verdicts
2 tool calls (VirusTotal IP + AbuseIPDB) — raw API responses in context window
Pre-enriched: AbuseIPDB score 100, 2,847 reports, TOR proxy tag — extracted fields only, no raw response
True Positive — correct
True Positive — correct, with detection rule documentation cited
None — agent works from raw data only
Detection rule docs + applicable runbooks/SOPs included in payload
Token reduction only matters if investigation quality holds. We ran a blind evaluation of all 60 findings using independent LLM judges — both Claude Sonnet and GPT-4o evaluated every finding without knowing which approach produced it.
Scoring dimensions (each 0–10):
Completeness — Did the finding identify all indicators of compromise and relevant context?
Accuracy — Were the conclusions and risk assessments correct? Were there false claims?
Actionability — Were the recommendations specific, prioritized, and operationally useful?
Each finding was scored against hand-crafted ground truth: expected indicators and expected conclusions per scenario. Findings were randomized before evaluation so the judge could not distinguish which approach produced them.
Two independent judges (Claude Sonnet and GPT-4o) evaluated all 60 findings to cross-validate results and eliminate single-judge bias.
| Metric | Baseline | Calseta |
|---|---|---|
| Claude Agent — Completeness | 8.5 | 8.2 |
| Claude Agent — Accuracy | 7.7 | 7.7 |
| Claude Agent — Actionability | 8.5 | 9.2 |
| Claude Agent — Overall | 8.2 | 8.4 |
| GPT-4o Agent — Completeness | 8.3 | 8.3 |
| GPT-4o Agent — Accuracy | 6.6 | 7.1 |
| GPT-4o Agent — Actionability | 7.3 | 8.9 |
| GPT-4o Agent — Overall | 7.4 | 8.1 |
| Metric | Baseline | Calseta |
|---|---|---|
| Claude Agent — Completeness | 8.3 | 7.7 |
| Claude Agent — Accuracy | 7.7 | 6.9 |
| Claude Agent — Actionability | 9.0 | 8.7 |
| Claude Agent — Overall | 8.3 | 7.8 |
| GPT-4o Agent — Completeness | 8.0 | 7.6 |
| GPT-4o Agent — Accuracy | 7.5 | 6.5 |
| GPT-4o Agent — Actionability | 8.7 | 8.7 |
| GPT-4o Agent — Overall | 8.1 | 7.6 |
Calseta produces more actionable findings with comparable overall quality — while using 87–97% fewer tokens.
Actionability is consistently higher with Calseta. Structured, pre-enriched data produces more specific and operationally useful recommendations. The Claude judge scored Calseta actionability at 9.2 vs 8.5 for baseline (Claude agent) and 8.9 vs 7.3 (GPT-4o agent). Agents receiving organized context give more organized advice.
Completeness and accuracy are within range. The baseline agent scores slightly higher on completeness in some evaluations — expected, since it has access to verbose raw payloads with more surface-level detail. Accuracy scores are comparable, with the Calseta agent's pre-enrichment preventing the dangerous misclassifications that hurt baseline accuracy on critical scenarios like malware detection. Across both judges, overall quality differences are less than 1 point on a 10-point scale.
Two independent judges provide a balanced view. The Claude judge favored Calseta overall (8.4 vs 8.2 on Claude agent, 8.1 vs 7.4 on GPT-4o agent), while the GPT-4o judge favored the baseline (8.3 vs 7.8 on Claude agent, 8.1 vs 7.6 on GPT-4o agent). We present both to let readers draw their own conclusions. The consistent finding across both judges: Calseta's structured input produces more actionable recommendations, and the overall quality tradeoff is narrow.
Pre-enrichment prevents dangerous misclassification. In the Known Malware Hash scenario (Emotet banking trojan), the baseline Claude agent consistently misidentified active malware as a false positive — scoring 1.7/10 on accuracy. Without pre-enrichment from VirusTotal confirming the hash as Emotet, the agent lacked the threat intelligence to make a correct assessment. The Calseta agent, with pre-enriched hash data, correctly identified the malware every time (accuracy: 9.0/10). This is the strongest argument for pre-enrichment: it's not just about cost reduction — it's about investigation safety.
| Metric | Baseline | Calseta |
|---|---|---|
| Tool definitions & API integration | 40–80 hrs | 0 hrs |
| Enrichment pipeline (rate limits, caching, retry) | Included above | 0 hrs (platform handles) |
| Prompt engineering for raw payloads | 10–20 hrs | 0 hrs |
| Agent integration with Calseta API | N/A | 1–2 hrs |
| Total Estimated | 40–80 hrs | 1–2 hrs |
Honest limitations of this benchmark:
- Synthetic fixtures — Alert payloads are realistic but not from production environments. Production alerts may have different payload sizes, nesting depth, and field coverage.
- Two models tested — GPT-4o and Claude Sonnet. Results may vary on other models, especially smaller or open-source models with different context window behaviors.
- Three runs per scenario — Sufficient for consistency given temperature=0, but not a statistically large sample. Variance analysis would benefit from more runs.
- LLM cost only — Does not include Calseta platform hosting costs (self-hosted, so variable), enrichment API subscription fees, or infrastructure overhead.
- Single-alert investigation — Does not test multi-alert correlation, incident-level analysis, or alert triage across a queue of mixed-severity alerts.
- Pre-seeded enrichment — Enrichment data is mock/pre-seeded, not from live API calls. Real-world enrichment latency (seconds to minutes) is not captured in the benchmark timing.
- LLM-as-judge evaluation — Quality scores use LLM judges against hand-crafted ground truth, not human SOC analyst evaluation. Two independent judges (Claude and GPT-4o) cross-validate results, but human expert review would further strengthen confidence.
- Temperature=0 is not fully deterministic — While temperature=0 maximizes reproducibility, LLM outputs are not guaranteed identical across runs due to batching and infrastructure differences.
- Baseline agent is intentionally unoptimized — The baseline agent uses a straightforward implementation without advanced prompt engineering, RAG, or caching. A production-optimized DIY agent could narrow the gap on token usage.
We publish these limitations because transparency matters more than marketing. The methodology, prompts, fixtures, and raw data are all open source — verify the results yourself.
The entire benchmark is open source and reproducible:
1. Clone the Calseta repository
2. Start the platform with make lab
3. Run python examples/case_study/run_study.py --ingest to load fixtures
4. Run python examples/case_study/run_study.py --run --models all to execute the benchmark
5. Run python examples/case_study/evaluate_findings.py to evaluate quality
6. Results land in examples/case_study/results/
You need API keys for the LLM provider(s) you want to test. Enrichment data is pre-seeded — no VirusTotal or AbuseIPDB keys required for the Calseta agent path.