TraceOps
Record, replay, and analyze LLM agent traces for deterministic regression testing.
Behavioral analysis sub-package: PatternDetector, GapAnalyzer, SkillsGenerator, and PRFetcher — inspired by agent-pr-replay. See the Behavioral Analysis section.
What is TraceOps?
TraceOps brings the VCR.py pattern to LLM agents — but at the SDK level, not the HTTP level. It intercepts openai.chat.completions.create, anthropic.messages.create, tool calls, and agent decisions, recording the full execution trace as a YAML cassette.
On replay, it injects recorded responses without making any real API calls — giving you zero-cost, millisecond-execution, fully deterministic agent tests.
Why not just use VCR.py?
VCR.py records HTTP traffic. TraceOps records agent behavior:
| Capability | VCR.py | TraceOps |
|---|---|---|
| Records at | HTTP layer | SDK layer |
| Understands agent semantics | ❌ | ✔ LLM calls, tools, decisions |
| Trajectory tracking | ❌ | ✔ |
| Semantic diff | Binary match | ✔ “model changed”, “new tool” |
| Cost tracking | ❌ | ✔ per-call tokens + USD |
| RAG + MCP recording | ❌ | ✔ |
| Behavioral analysis | ❌ | ✔ PatternDetector, GapAnalyzer |
Key features
🏙 SDK-Level Recording
Intercepts OpenAI, Anthropic, LiteLLM, LangChain, CrewAI — not raw HTTP.
▶ Deterministic Replay
Zero API calls, millisecond execution, fully deterministic CI tests.
🔍 Semantic Diff
Detect model changes, new tools, extra LLM calls — not just binary match.
💸 Budget Assertions
Guard against cost overruns, token bloat, and infinite tool loops.
📚 RAG Recording
Capture retrieval queries, chunks, scores, and drift across versions.
📈 Pattern Analysis
Tool n-gram heatmaps, model stats, error rates across entire cassette libraries.
🔴 Gap Analyzer
Compare agent vs golden baseline — auto-detect inflation, missing tools, model mismatch.
📝 AGENTS.md Gen
Auto-generate steering guidance from behavioral gaps.
Supported providers
Quickstart
Go from zero to a recorded, replayed, and tested agent in under 5 minutes.
Install TraceOps with pip install traceops. Requires Python ≥ 3.10.
1. Install
pip install traceops
2. Record your first trace
Wrap your agent code in a Recorder context manager. TraceOps intercepts all LLM calls and saves them to a YAML cassette.
from trace_ops import Recorder import openai client = openai.OpenAI() with Recorder(save_to="cassettes/test_math.yaml") as rec: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is 2+2?"}], ) print(response.choices[0].message.content) print(f"Recorded {rec.trace.total_llm_calls} LLM call(s)") print(f"Tokens used: {rec.trace.total_tokens}") print(f"Cost: ${rec.trace.total_cost_usd:.4f}")
3. Replay deterministically
Use a Replayer context to inject recorded responses. No API calls are made.
from trace_ops import Replayer with Replayer("cassettes/test_math.yaml"): response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What is 2+2?"}], ) assert response.choices[0].message.content
4. Add to your test suite
Use the built-in cassette pytest fixture. It auto-records on the first run and replays on every subsequent run.
# Records on first run, replays automatically after that def test_summarize(cassette): result = agent.run("Summarize the quarterly report") assert "revenue" in result.lower() # pytest --record (first time) # pytest (replays, zero cost)
Add budget assertions, explore trace diffing, or try behavioral pattern analysis.
Installation
TraceOps requires Python ≥ 3.10. Install the core package or add optional provider extras.
Basic install
pip install traceops
Optional extras
pip install traceops[openai] # OpenAI SDK pip install traceops[anthropic] # Anthropic SDK pip install traceops[langchain] # LangChain + LangGraph pip install traceops[crewai] # CrewAI pip install traceops[rag] # RAG scoring (ragas, deepeval) pip install traceops[all] # Everything
The trace_ops.analysis and trace_ops.github sub-packages use only Python stdlib — no additional packages required.
Recording
Use Recorder as a context manager to capture all LLM calls, tool invocations, and agent decisions in a portable YAML cassette.
Synchronous recording
from trace_ops import Recorder with Recorder(save_to="cassettes/run.yaml") as rec: result = agent.run("Do the task") print(rec.trace.total_llm_calls, rec.trace.total_cost_usd)
Async recording
async with Recorder(save_to="cassettes/async_run.yaml") as rec: result = await async_agent.run("Do the task")
Streaming
TraceOps automatically captures streaming chunks and reassembles them into a single recorded event. On replay, chunks are re-emitted at the same rate.
Replay
Use Replayer to inject recorded responses without any real API calls. Execution takes under a millisecond.
Basic replay
from trace_ops import Replayer with Replayer("cassettes/run.yaml"): result = agent.run("Do the task") # zero API calls, <1ms
Strict vs lenient mode
By default, Replayer is in strict mode — it raises ReplayMismatchError if the agent tries to make a call that wasn’t recorded. Pass strict=False to fall through to real API calls for unrecorded events.
Replayer("cassettes/run.yaml", strict=False)
Trace Diffing
Compare two traces semantically — detect trajectory changes, new tools, model upgrades, and token inflation.
Compare two cassettes
from trace_ops import diff_traces, load_cassette old = load_cassette("cassettes/v1.yaml") new = load_cassette("cassettes/v2.yaml") diff = diff_traces(old, new) print(diff.summary())
⚠ TRAJECTORY CHANGED Old: llm_call:gpt-4o → tool:search → llm_call:gpt-4o New: llm_call:gpt-4o → tool:browse → tool:search → llm_call:gpt-4o ⚠ TOKENS INCREASED by 23% (1,200 → 1,476) ℹ MODEL UNCHANGED (gpt-4o)
Via CLI
traceops diff cassettes/v1.yaml cassettes/v2.yaml
Assertions
Guard against cost overruns, token bloat, and infinite tool loops with built-in assertion helpers and the @budget pytest marker.
Assertion helpers
from trace_ops import Recorder from trace_ops.assertions import ( assert_cost_under, assert_tokens_under, assert_max_llm_calls, assert_no_loops, ) with Recorder() as rec: agent.run("Analyze this document") assert_cost_under(rec.trace, max_usd=0.50) assert_tokens_under(rec.trace, max_tokens=10_000) assert_max_llm_calls(rec.trace, max_calls=5) assert_no_loops(rec.trace, max_consecutive_same_tool=3)
@budget marker
Use the @pytest.mark.budget marker for inline budget constraints:
import pytest @pytest.mark.budget(max_usd=0.50, max_tokens=10_000, max_llm_calls=5) def test_agent_budget(cassette): agent.run("Analyze this document")
pytest Plugin
The built-in pytest plugin provides the cassette fixture plus --record and --record-mode flags.
cassette fixture
Any test that takes a cassette parameter automatically gets record/replay behavior. Cassettes are stored alongside the test file in a cassettes/ subdirectory.
def test_agent(cassette): result = agent.run("What is the capital of France?") assert "Paris" in result
CLI flags
pytest --record # Record all cassettes pytest # Replay from cassettes (zero API calls) pytest --record-mode=all # Re-record everything pytest --record-mode=new # Record only missing cassettes
RAG Recording & Scoring
Record retrieval queries, chunks, and vector store events alongside LLM calls. Assert on chunk count, relevance scores, and retrieval drift between versions.
Install optional scoring deps: pip install traceops[rag]
Recording retrieval events
from trace_ops import Recorder from trace_ops.rag import assert_chunk_count, assert_min_relevance_score with Recorder(save_to="cassettes/rag_test.yaml") as rec: answer = rag_pipeline.run("What is the refund policy?") assert_chunk_count(rec.trace, min_chunks=3) assert_min_relevance_score(rec.trace, min_score=0.75)
Retrieval drift detection
from trace_ops.rag import assert_no_retrieval_drift from trace_ops import load_cassette baseline = load_cassette("cassettes/rag_baseline.yaml") with Recorder() as rec: rag_pipeline.run("What is the refund policy?") assert_no_retrieval_drift(baseline, rec.trace, tolerance=0.1)
Scoring integrations
from trace_ops.rag import RagasScorer scores = RagasScorer().score(rec.trace) # {"faithfulness": 0.92, "answer_relevancy": 0.88, "context_precision": 0.85}
MCP Tool Recording
Record Model Context Protocol server connections, tool calls, and results as first-class trace events.
MCP recording is enabled by default when the MCP SDK is installed.
Recording MCP calls
with Recorder(intercept_mcp=True) as rec: result = await mcp_client.call_tool( "search_files", {"query": "config.py"} ) print(rec.trace.mcp_events) # [MCPEvent(tool="search_files", duration_ms=12, result=[...])]
Semantic Regression Detection
Catch meaning-level regressions that exact-string diffs miss — using embedding cosine similarity.
assert_semantic_similarity
from trace_ops.semantic import assert_semantic_similarity assert_semantic_similarity(baseline_trace, rec.trace, threshold=0.85) # Raises SemanticRegressionError if cosine similarity < threshold
Fine-Tune Export
Export recorded traces as OpenAI or Anthropic fine-tuning JSONL datasets. Cassettes become training data.
OpenAI format
from trace_ops.export.finetune import export_finetune_jsonl, ExportFormat export_finetune_jsonl("cassettes/", "dataset.jsonl", fmt=ExportFormat.OPENAI)
Anthropic format
export_finetune_jsonl("cassettes/", "dataset.jsonl", fmt=ExportFormat.ANTHROPIC)
Pattern Detector
Analyze tool n-gram heatmaps, model usage statistics, and error rates across an entire cassette directory.
No extra dependencies required.
Basic usage
from trace_ops.analysis import PatternDetector detector = PatternDetector(window_size=3, top_n=10) report = detector.analyze_dir("cassettes/") print(report.summary()) # Analyzed 47 traces | Avg: 3.2 LLM calls, 1,450 tokens, $0.012/run
Tool sequences
for seq in report.top_tool_sequences: print(f" {' -> '.join(seq.sequence)} ×{seq.count}") # search -> read_file -> write_file x31 # search -> read_file x9 import json json.dump(report.to_dict(), open("patterns.json", "w"))
Constructor parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
window_size | int | 3 | N-gram window size for tool sequences |
top_n | int | 10 | Number of top sequences to return |
Gap Analyzer
Compare agent traces against a golden baseline to surface systematic behavioral differences.
The CLI command traceops gap-report exits with code 1 if any critical gaps are found — making it a drop-in CI gate.
Basic comparison
from pathlib import Path from trace_ops import load_cassette from trace_ops.analysis import GapAnalyzer golden = [(p.name, load_cassette(p)) for p in Path("golden/").glob("*.yaml")] agent = [(p.name, load_cassette(p)) for p in Path("runs/").glob("*.yaml")] report = GapAnalyzer().compare(golden, agent) print(report.summary()) # Found 3 behavioral gap(s): 1 critical, 2 warnings
Detected gap types
| Category | Severity | Description |
|---|---|---|
token_inflation | critical | Agent uses significantly more tokens |
cost_inflation | critical | Agent costs significantly more |
missing_tool | warning | Golden uses a tool the agent rarely calls |
extra_tool | warning | Agent uses a tool not in golden |
model_mismatch | info | Agent and golden use different models |
error_rate | critical | Agent has significantly higher error rate |
llm_call_inflation | warning | Agent makes many more LLM calls per task |
Skills Generator
Auto-generate AGENTS.md or CLAUDE.md steering guidance from gap or pattern reports.
From a gap report
from trace_ops.analysis import GapAnalyzer, SkillsGenerator report = GapAnalyzer().compare(golden, agent) gen = SkillsGenerator() md = gen.from_gap_report( report, output_path="AGENTS.md", title="Agent Behavioral Guidance", ) print(md)
From a pattern report
from trace_ops.analysis import PatternDetector, SkillsGenerator report = PatternDetector().analyze_dir("cassettes/") SkillsGenerator().from_pattern_report(report, output_path="PATTERNS.md")
GitHub PR Fetcher
Fetch merged GitHub PRs as human-validated golden baselines via the GitHub REST API. Uses only Python stdlib — no extra deps.
Set GITHUB_TOKEN in your environment or pass token= explicitly. Without a token you get 60 unauthenticated requests/hour.
Fetch a single PR
from trace_ops.github import PRFetcher fetcher = PRFetcher() # reads GITHUB_TOKEN env var automatically pr = fetcher.fetch("https://github.com/owner/repo/pull/123") print(pr.title) # "Fix: agent skips write step on empty search" print(pr.total_additions) # 33 print(pr.extract_task_prompt()) # plain-English task description
Fetch recent merged PRs
recent = fetcher.fetch_recent(
"https://github.com/owner/repo",
limit=10,
)
print(f"Fetched {len(recent)} golden PRs")OpenAI Integration
TraceOps auto-intercepts openai.chat.completions.create for sync, async, and streaming.
Just wrap your code in Recorder() — TraceOps patches the OpenAI SDK automatically.
pip install traceops[openai]
import openai from trace_ops import Recorder client = openai.OpenAI() with Recorder(save_to="cassettes/openai_test.yaml") as rec: resp = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hi"}] )
Anthropic Integration
Full support for Anthropic messages, tool_use blocks, and streaming delta events.
pip install traceops[anthropic]
import anthropic from trace_ops import Recorder client = anthropic.Anthropic() with Recorder() as rec: msg = client.messages.create( model="claude-opus-4-5", max_tokens=1024, messages=[{"role": "user", "content": "Hi"}], )
LangChain / LangGraph
Intercept LangChain BaseChatModel calls and LangGraph Pregel graph execution with a single flag.
with Recorder(intercept_langchain=True, intercept_langgraph=True): result = graph.invoke({"messages": [...]})
CrewAI Integration
Record CrewAI task pipelines and agent outputs end-to-end.
pip install traceops[crewai]
from trace_ops import Recorder with Recorder(intercept_crewai=True) as rec: crew.kickoff(inputs={"topic": "AI agents"})
CLI Reference
The traceops command provides tools for inspecting cassettes, running analysis, and generating reports.
Cassette commands
traceops inspect cassettes/test.yaml # Inspect a cassette traceops diff old.yaml new.yaml # Compare two cassettes traceops export test.yaml --format json # Export to JSON / JSONL traceops ls cassettes/ # List all cassettes with stats traceops stats cassettes/ # Aggregate statistics traceops prune cassettes/ --older-than 30d # Delete stale cassettes traceops validate cassettes/test.yaml # Validate integrity traceops report cassettes/test.yaml -o r.html # HTML cost report traceops debug cassettes/test.yaml # Time-travel debugger
Analysis commands
These commands require TraceOps ≥ 0.6.0.
# Pattern analysis across all cassettes traceops analyze cassettes/ --window 3 --top 10 traceops analyze cassettes/ -o patterns.json --skills AGENTS.md # Gap analysis vs golden baseline (exits 1 on critical gaps) traceops gap-report golden/ runs/ traceops gap-report golden/ runs/ --skills AGENTS.md --json # Fetch a GitHub PR as golden baseline traceops pr-diff https://github.com/owner/repo/pull/123 traceops pr-diff https://github.com/owner/repo/pull/123 --task traceops pr-diff https://github.com/owner/repo/pull/123 --files
Changelog
v0.6.0 — Current
- Add
trace_ops.analysis:PatternDetector,GapAnalyzer,BehavioralGap,GapReport,SkillsGenerator - Add
trace_ops.github:PRFetcher,PRDiff,PRFile - 3 new CLI commands:
traceops analyze,traceops gap-report,traceops pr-diff - 714 tests passing
v0.5.0
- RAG recording + scoring (ragas, DeepEval integrations)
- Semantic regression detection (
assert_semantic_similarity) - MCP tool call recording and replay
- Fine-tune dataset export (OpenAI + Anthropic JSONL)
v0.3.0 — v0.4.0
- LangGraph Pregel interceptor
- Anthropic
tool_useblock support - Provider-agnostic response normalization
- Cost dashboard reporter
v0.2.0
- Async + streaming support
- Budget assertions +
@budgetmarker - Time-travel debugger (
traceops debug) - HTML report generator
- GitHub Action integration
v0.1.0
- Initial release: Record/replay for OpenAI, Anthropic, LiteLLM
- pytest plugin with
cassettefixture - Semantic diff engine
- Core CLI