March 2026

Token Reduction Benchmark

Measuring the cost of context: raw SIEM payloads vs. normalized, pre-enriched alert data across 5 real-world security scenarios.

92%Avg input token reduction (cross-model)

token-reductioncost-analysisquality-evaluationcross-model

Executive Summary

We tested the hypothesis that normalizing and pre-enriching security alerts before they reach an AI agent would reduce input token consumption by at least 50% — without sacrificing investigation quality.

The result: 87–97% fewer input tokens, 76–93% lower cost per alert, and comparable investigation quality — validated across 5 real-world security scenarios from Sentinel, Elastic, and Splunk, tested on both GPT-4o and Claude Sonnet.

On GPT-4o: 87% input token reduction, 75.7% cost reduction. On Claude Sonnet: 97.1% input token reduction, 92.9% cost reduction. Claude's agentic loop accumulates significantly more context per tool call, making the baseline approach even more expensive — and the Calseta advantage even larger.

Investigation quality was independently validated using blind LLM judges (both Claude and GPT-4o as evaluators). The Calseta agent produced consistently more actionable findings and prevented dangerous misclassifications that the baseline agent made without pre-enrichment — while scoring within less than 1 point of baseline on completeness and accuracy across both judges.

The baseline approach (Approach A) sends raw SIEM JSON directly to the agent alongside tool definitions for VirusTotal and AbuseIPDB. The agent must parse the payload, extract indicators, and call enrichment APIs itself — each tool call adding another round trip of accumulated context.

The Calseta approach (Approach B) sends a normalized alert with pre-extracted, pre-enriched indicators. The agent receives exactly what it needs in a single structured payload. Zero tool calls. One LLM invocation.

Methodology

Approach A — Baseline Agent: Receives the raw alert JSON from the source SIEM (Sentinel, Elastic, or Splunk) directly in its context window, plus 5 tool definitions for enrichment APIs (VirusTotal IP/hash/domain/URL lookups + AbuseIPDB). The agent must parse the payload, extract IOCs, decide which tools to call, and synthesize findings across multiple agentic loop iterations.

Approach B — Calseta Agent: Calls the Calseta REST API to fetch a normalized alert with all indicators already extracted and enriched. Receives a compact, structured payload with alert description, detection rule documentation, and applicable runbooks. Produces findings in a single LLM call.

Controlled variables: Both agents use the same model, temperature (0), max output tokens (4,096), and identical source alert payloads (5 synthetic but realistic fixtures). Each scenario runs 3 times per approach per model. All enrichment data is pre-seeded — no live API calls to VirusTotal or AbuseIPDB during the benchmark. Tested on GPT-4o and Claude Sonnet (claude-sonnet-4-20250514).

What we measured: Input tokens, output tokens, total tokens, tool calls, external API calls, wall-clock duration, cost per alert (at published API pricing), and investigation quality (blind LLM judge evaluation on completeness, accuracy, and actionability).

Baseline Agent System Prompt

text

You are a SOC analyst AI agent investigating a security alert.
You have been given the raw alert payload from the source SIEM system.

Your task:
1. Analyze the raw alert JSON to identify the alert type and severity
2. Extract all indicators of compromise (IPs, domains, hashes, URLs, accounts)
3. Use the available tools to enrich each indicator with threat intelligence
4. Synthesize your findings into a structured investigation summary

Your investigation summary MUST include:
- Alert classification and severity assessment
- List of all indicators found with their enrichment results
- Risk assessment for each indicator
- Overall verdict (True Positive / False Positive / Needs Investigation)
- Recommended next steps

Be thorough — check every indicator you find. Do not skip enrichment steps.

The baseline agent receives raw SIEM JSON and must figure out what to do with it. Each tool call adds another round trip — and another copy of the full conversation history.

Calseta Agent System Prompt

text

You are a SOC analyst AI agent investigating a security alert.
You have been given a pre-structured alert payload from Calseta
that includes:
- The normalized alert with clean field names and description
- All indicators of compromise, already extracted and enriched
- Detection rule documentation explaining what triggered the alert
- Applicable runbooks and SOPs for handling this alert type

Your task: analyze the pre-structured data and produce an
investigation summary.

Your investigation summary MUST include:
- Alert classification and severity assessment
- Analysis of each enriched indicator and its risk level
- Overall verdict (True Positive / False Positive / Needs Investigation)
- Recommended next steps

CRITICAL RULES:
- All enrichment has already been done for you — do NOT request
  additional lookups.
- ONLY use data explicitly provided below. Do NOT invent, fabricate,
  or assume enrichment results, threat scores, or intelligence data
  that is not present in the provided context.
- If an indicator has no enrichment data or a 'Pending' malice
  verdict, state that explicitly — do NOT fill in fictional values.
- Pay close attention to the alert description — it contains key
  contextual details about the attack pattern, timing, and scope.
- Focus on analysis and synthesis of the provided data only.

The Calseta agent receives everything it needs upfront. No tools, no multi-turn loops — just analysis. Anti-hallucination rules ensure the agent only uses provided data.

Baseline Agent Tool Definitions (5 tools)

json

[
  {
    "name": "lookup_ip_virustotal",
    "description": "Look up an IP address on VirusTotal for reputation data, malicious detections, ASN info, and last analysis results."
  },
  {
    "name": "lookup_hash_virustotal",
    "description": "Look up a file hash (MD5, SHA1, SHA256) on VirusTotal for malware detection results, file metadata, and threat classification."
  },
  {
    "name": "lookup_domain_virustotal",
    "description": "Look up a domain on VirusTotal for reputation, DNS records, WHOIS info, and malicious detections."
  },
  {
    "name": "lookup_url_virustotal",
    "description": "Look up a URL on VirusTotal for scan results, redirects, and malicious detections."
  },
  {
    "name": "lookup_ip_abuseipdb",
    "description": "Look up an IP on AbuseIPDB for abuse confidence score, report count, country, ISP, and usage type."
  }
]

Each tool definition with its full parameter schema consumes tokens on every LLM call — even when the tool isn't used. The Calseta agent needs zero tools.

Test Scenarios

Five scenarios across three SIEM sources, covering a range of indicator types and enrichment patterns:

1. Brute Force from TOR (Sentinel) — 47 failed sign-ins from a TOR exit node followed by 1 successful auth. Indicators: IP, account, domain.

2. Known Malware Hash (Elastic) — Emotet banking trojan executed via Outlook on a workstation. Indicators: SHA-256 hash, IP, account, email.

3. Anomalous Data Transfer (Splunk) — 2GB+ exfiltration from a file server to an external IP via a service account. Indicators: 2 IPs, domain, URL, account.

4. Impossible Travel (Sentinel) — Global Admin authenticates from New York, then Moscow 32 minutes later. Indicators: account, 2 IPs, domain, URL.

5. Suspicious PowerShell (Elastic) — Encoded PowerShell on a domain controller bypassing execution policy, downloading from a C2 domain. Indicators: domain, IP, URL, hash, DNS query.

Input Tokens Per Scenario — GPT-4o (Avg of 3 Runs)

Metric	Baseline	Calseta	Reduction
Brute Force from TOR	29,470	2,263	92.3%
Known Malware Hash	7,541	3,139	58.4%
Anomalous Data Transfer	15,724	2,572	83.6%
Impossible Travel	22,548	2,689	88.1%
Suspicious PowerShell	25,807	2,615	89.9%
Overall Average	20,218	2,656	86.9%

Input Tokens Per Scenario — Claude Sonnet (Avg of 3 Runs)

Metric	Baseline	Calseta	Reduction
Brute Force from TOR	90,550	2,604	97.1%
Known Malware Hash	245,361	3,559	98.5%
Anomalous Data Transfer	68,826	2,941	95.7%
Impossible Travel	92,699	3,080	96.7%
Suspicious PowerShell	31,101	2,989	90.4%
Overall Average	105,707	3,035	97.1%

Why Claude Sonnet Shows Higher Reduction

Claude Sonnet's baseline agent consumed 5.2x more input tokens than GPT-4o's (106K vs 20K per alert) while the Calseta agent stayed nearly identical across models (~2,700–3,000 tokens). Claude's agentic loop accumulates more context per tool call iteration — making the baseline approach disproportionately expensive on Claude, and the Calseta advantage even larger.

The Calseta agent cost is model-independent because there are no tool calls and no multi-turn loops. The entire investigation happens in a single LLM invocation with a compact, structured payload.

Cost Per Alert — GPT-4o

Metric	Baseline	Calseta	Reduction
Avg Input Tokens	20,218	2,656	86.9%
Avg Output Tokens	802	759	5.4%
Avg Total Tokens	21,020	3,414	83.8%
Avg Tool Calls	4.1	0	100%
Avg Cost (USD)	$0.0586	$0.0142	75.7%

Cost Per Alert — Claude Sonnet

Metric	Baseline	Calseta	Reduction
Avg Input Tokens	105,707	3,035	97.1%
Avg Output Tokens	1,797	1,032	42.6%
Avg Total Tokens	107,505	4,067	96.2%
Avg Tool Calls	5.1	0	100%
Avg Cost (USD)	$0.3441	$0.0246	92.9%

Monthly Cost at Scale — GPT-4o (LLM Only)

Volume	Baseline	Calseta	Savings
1 alert/day	$1.76	$0.43	$1.33/mo
10 alerts/day	$17.57	$4.27	$13.30/mo
100 alerts/day	$175.70	$42.68	$133/mo
1,000 alerts/day	$1,757	$427	$1,330/mo

Monthly Cost at Scale — Claude Sonnet (LLM Only)

Volume	Baseline	Calseta	Savings
1 alert/day	$10.32	$0.74	$9.58/mo
10 alerts/day	$103.22	$7.38	$95.84/mo
100 alerts/day	$1,032	$73.77	$958/mo
1,000 alerts/day	$10,322	$738	$9,584/mo

Estimate Your Savings

Alerts per day

Avg indicators per alert

Enrichment providers

Engineering hourly rate

$/hr

LLM Model

LLM Cost (Monthly)

Baseline$141.41

Calseta$22.47

Savings

$118.9484%

Engineering Cost (One-Time)

Baseline$5,250.00

Calseta$150.00

Savings

$5,100.0097%

Year 1 Total Cost

Baseline$6,946.86

Calseta$419.64

Savings

$6,527.2294%

Assumptions: engineering cost based on configurable hourly rate above (default $150/hr), 17.5 hrs per provider for custom integration vs 0.5 hrs with Calseta. Token estimates derived from observed benchmark data. See full methodology in the case study.

Sample Finding: Brute Force from TOR

Alert Classification

Baseline

Parsed from raw Sentinel JSON — agent had to identify alert type from nested payload structure

Calseta

Directly from normalized fields: title, severity (High), tags [TOR, BruteForce, IdentityThreat]

Indicator Discovery

Baseline

Agent extracted IPs, accounts from raw JSON — required understanding Sentinel schema

Calseta

Pre-extracted: IP (185.220.101.34), account (j.martinez@contoso.com), domain — with malice verdicts

Enrichment

Baseline

2 tool calls (VirusTotal IP + AbuseIPDB) — raw API responses in context window

Calseta

Pre-enriched: AbuseIPDB score 100, 2,847 reports, TOR proxy tag — extracted fields only, no raw response

Verdict

Baseline

True Positive — correct

Calseta

True Positive — correct, with detection rule documentation cited

Context Available

Baseline

None — agent works from raw data only

Calseta

Detection rule docs + applicable runbooks/SOPs included in payload

Investigation Quality Evaluation

Token reduction only matters if investigation quality holds. We ran a blind evaluation of all 60 findings using independent LLM judges — both Claude Sonnet and GPT-4o evaluated every finding without knowing which approach produced it.

Scoring dimensions (each 0–10):

Completeness — Did the finding identify all indicators of compromise and relevant context?

Accuracy — Were the conclusions and risk assessments correct? Were there false claims?

Actionability — Were the recommendations specific, prioritized, and operationally useful?

Each finding was scored against hand-crafted ground truth: expected indicators and expected conclusions per scenario. Findings were randomized before evaluation so the judge could not distinguish which approach produced them.

Two independent judges (Claude Sonnet and GPT-4o) evaluated all 60 findings to cross-validate results and eliminate single-judge bias.

Overall Quality Scores — Evaluated by Claude Sonnet (0–10)

Metric	Baseline	Calseta
Claude Agent — Completeness	8.5	8.2
Claude Agent — Accuracy	7.7	7.7
Claude Agent — Actionability	8.5	9.2
Claude Agent — Overall	8.2	8.4
GPT-4o Agent — Completeness	8.3	8.3
GPT-4o Agent — Accuracy	6.6	7.1
GPT-4o Agent — Actionability	7.3	8.9
GPT-4o Agent — Overall	7.4	8.1

Overall Quality Scores — Evaluated by GPT-4o (0–10)

Metric	Baseline	Calseta
Claude Agent — Completeness	8.3	7.7
Claude Agent — Accuracy	7.7	6.9
Claude Agent — Actionability	9.0	8.7
Claude Agent — Overall	8.3	7.8
GPT-4o Agent — Completeness	8.0	7.6
GPT-4o Agent — Accuracy	7.5	6.5
GPT-4o Agent — Actionability	8.7	8.7
GPT-4o Agent — Overall	8.1	7.6

Quality Analysis

Calseta produces more actionable findings with comparable overall quality — while using 87–97% fewer tokens.

Actionability is consistently higher with Calseta. Structured, pre-enriched data produces more specific and operationally useful recommendations. The Claude judge scored Calseta actionability at 9.2 vs 8.5 for baseline (Claude agent) and 8.9 vs 7.3 (GPT-4o agent). Agents receiving organized context give more organized advice.

Completeness and accuracy are within range. The baseline agent scores slightly higher on completeness in some evaluations — expected, since it has access to verbose raw payloads with more surface-level detail. Accuracy scores are comparable, with the Calseta agent's pre-enrichment preventing the dangerous misclassifications that hurt baseline accuracy on critical scenarios like malware detection. Across both judges, overall quality differences are less than 1 point on a 10-point scale.

Two independent judges provide a balanced view. The Claude judge favored Calseta overall (8.4 vs 8.2 on Claude agent, 8.1 vs 7.4 on GPT-4o agent), while the GPT-4o judge favored the baseline (8.3 vs 7.8 on Claude agent, 8.1 vs 7.6 on GPT-4o agent). We present both to let readers draw their own conclusions. The consistent finding across both judges: Calseta's structured input produces more actionable recommendations, and the overall quality tradeoff is narrow.

Pre-enrichment prevents dangerous misclassification. In the Known Malware Hash scenario (Emotet banking trojan), the baseline Claude agent consistently misidentified active malware as a false positive — scoring 1.7/10 on accuracy. Without pre-enrichment from VirusTotal confirming the hash as Emotet, the agent lacked the threat intelligence to make a correct assessment. The Calseta agent, with pre-enriched hash data, correctly identified the malware every time (accuracy: 9.0/10). This is the strongest argument for pre-enrichment: it's not just about cost reduction — it's about investigation safety.

Engineering Time Investment

Metric	Baseline	Calseta
Tool definitions & API integration	40–80 hrs	0 hrs
Enrichment pipeline (rate limits, caching, retry)	Included above	0 hrs (platform handles)
Prompt engineering for raw payloads	10–20 hrs	0 hrs
Agent integration with Calseta API	N/A	1–2 hrs
Total Estimated	40–80 hrs	1–2 hrs

Honest limitations of this benchmark:

- Synthetic fixtures — Alert payloads are realistic but not from production environments. Production alerts may have different payload sizes, nesting depth, and field coverage.

- Two models tested — GPT-4o and Claude Sonnet. Results may vary on other models, especially smaller or open-source models with different context window behaviors.

- Three runs per scenario — Sufficient for consistency given temperature=0, but not a statistically large sample. Variance analysis would benefit from more runs.

- LLM cost only — Does not include Calseta platform hosting costs (self-hosted, so variable), enrichment API subscription fees, or infrastructure overhead.

- Single-alert investigation — Does not test multi-alert correlation, incident-level analysis, or alert triage across a queue of mixed-severity alerts.

- Pre-seeded enrichment — Enrichment data is mock/pre-seeded, not from live API calls. Real-world enrichment latency (seconds to minutes) is not captured in the benchmark timing.

- LLM-as-judge evaluation — Quality scores use LLM judges against hand-crafted ground truth, not human SOC analyst evaluation. Two independent judges (Claude and GPT-4o) cross-validate results, but human expert review would further strengthen confidence.

- Temperature=0 is not fully deterministic — While temperature=0 maximizes reproducibility, LLM outputs are not guaranteed identical across runs due to batching and infrastructure differences.

- Baseline agent is intentionally unoptimized — The baseline agent uses a straightforward implementation without advanced prompt engineering, RAG, or caching. A production-optimized DIY agent could narrow the gap on token usage.

We publish these limitations because transparency matters more than marketing. The methodology, prompts, fixtures, and raw data are all open source — verify the results yourself.

Reproduce It Yourself

The entire benchmark is open source and reproducible:

1. Clone the Calseta repository

2. Start the platform with make lab

3. Run python examples/case_study/run_study.py --ingest to load fixtures

4. Run python examples/case_study/run_study.py --run --models all to execute the benchmark

5. Run python examples/case_study/evaluate_findings.py to evaluate quality

6. Results land in examples/case_study/results/

You need API keys for the LLM provider(s) you want to test. Enrichment data is pre-seeded — no VirusTotal or AbuseIPDB keys required for the Calseta agent path.

Source Code

Baseline Agent (naive_agent.py)Calseta Agent (calseta_agent.py)Study Runner (run_study.py)Quality Evaluator (evaluate_findings.py)Test Fixtures Full Methodology (VALIDATION_CASE_STUDY.md)