TraceOps

Record, replay, and analyze LLM agent traces for deterministic regression testing.

🆕 What’s new in v0.6

Behavioral analysis sub-package: PatternDetector, GapAnalyzer, SkillsGenerator, and PRFetcher — inspired by agent-pr-replay. See the Behavioral Analysis section.

What is TraceOps?

TraceOps brings the VCR.py pattern to LLM agents — but at the SDK level, not the HTTP level. It intercepts openai.chat.completions.create, anthropic.messages.create, tool calls, and agent decisions, recording the full execution trace as a YAML cassette.

On replay, it injects recorded responses without making any real API calls — giving you zero-cost, millisecond-execution, fully deterministic agent tests.

Why not just use VCR.py?

VCR.py records HTTP traffic. TraceOps records agent behavior:

CapabilityVCR.pyTraceOps
Records atHTTP layerSDK layer
Understands agent semantics✔ LLM calls, tools, decisions
Trajectory tracking
Semantic diffBinary match✔ “model changed”, “new tool”
Cost tracking✔ per-call tokens + USD
RAG + MCP recording
Behavioral analysis✔ PatternDetector, GapAnalyzer

Key features

🏙 SDK-Level Recording

Intercepts OpenAI, Anthropic, LiteLLM, LangChain, CrewAI — not raw HTTP.

▶ Deterministic Replay

Zero API calls, millisecond execution, fully deterministic CI tests.

🔍 Semantic Diff

Detect model changes, new tools, extra LLM calls — not just binary match.

💸 Budget Assertions

Guard against cost overruns, token bloat, and infinite tool loops.

📚 RAG Recording

Capture retrieval queries, chunks, scores, and drift across versions.

📈 Pattern Analysis

Tool n-gram heatmaps, model stats, error rates across entire cassette libraries.

🔴 Gap Analyzer

Compare agent vs golden baseline — auto-detect inflation, missing tools, model mismatch.

📝 AGENTS.md Gen

Auto-generate steering guidance from behavioral gaps.

Supported providers

OpenAI Anthropic LiteLLM LangChain LangGraph CrewAI MCP

Quickstart

Go from zero to a recorded, replayed, and tested agent in under 5 minutes.

✓ Prerequisites

Install TraceOps with pip install traceops. Requires Python ≥ 3.10.

1. Install

bash
pip install traceops

2. Record your first trace

Wrap your agent code in a Recorder context manager. TraceOps intercepts all LLM calls and saves them to a YAML cassette.

python
from trace_ops import Recorder
import openai

client = openai.OpenAI()

with Recorder(save_to="cassettes/test_math.yaml") as rec:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    print(response.choices[0].message.content)

print(f"Recorded {rec.trace.total_llm_calls} LLM call(s)")
print(f"Tokens used: {rec.trace.total_tokens}")
print(f"Cost: ${rec.trace.total_cost_usd:.4f}")

3. Replay deterministically

Use a Replayer context to inject recorded responses. No API calls are made.

python
from trace_ops import Replayer

with Replayer("cassettes/test_math.yaml"):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    assert response.choices[0].message.content

4. Add to your test suite

Use the built-in cassette pytest fixture. It auto-records on the first run and replays on every subsequent run.

python — test_agent.py
# Records on first run, replays automatically after that
def test_summarize(cassette):
    result = agent.run("Summarize the quarterly report")
    assert "revenue" in result.lower()

# pytest --record   (first time)
# pytest             (replays, zero cost)
✓ Next steps

Add budget assertions, explore trace diffing, or try behavioral pattern analysis.

Installation

TraceOps requires Python ≥ 3.10. Install the core package or add optional provider extras.

Basic install

bash
pip install traceops

Optional extras

bash
pip install traceops[openai]       # OpenAI SDK
pip install traceops[anthropic]    # Anthropic SDK
pip install traceops[langchain]    # LangChain + LangGraph
pip install traceops[crewai]       # CrewAI
pip install traceops[rag]          # RAG scoring (ragas, deepeval)
pip install traceops[all]          # Everything
ℹ No extra dependencies for analysis

The trace_ops.analysis and trace_ops.github sub-packages use only Python stdlib — no additional packages required.

Recording

Use Recorder as a context manager to capture all LLM calls, tool invocations, and agent decisions in a portable YAML cassette.

Synchronous recording

python
from trace_ops import Recorder

with Recorder(save_to="cassettes/run.yaml") as rec:
    result = agent.run("Do the task")

print(rec.trace.total_llm_calls, rec.trace.total_cost_usd)

Async recording

python
async with Recorder(save_to="cassettes/async_run.yaml") as rec:
    result = await async_agent.run("Do the task")

Streaming

TraceOps automatically captures streaming chunks and reassembles them into a single recorded event. On replay, chunks are re-emitted at the same rate.

Replay

Use Replayer to inject recorded responses without any real API calls. Execution takes under a millisecond.

Basic replay

python
from trace_ops import Replayer

with Replayer("cassettes/run.yaml"):
    result = agent.run("Do the task")  # zero API calls, <1ms

Strict vs lenient mode

By default, Replayer is in strict mode — it raises ReplayMismatchError if the agent tries to make a call that wasn’t recorded. Pass strict=False to fall through to real API calls for unrecorded events.

python
Replayer("cassettes/run.yaml", strict=False)

Trace Diffing

Compare two traces semantically — detect trajectory changes, new tools, model upgrades, and token inflation.

Compare two cassettes

python
from trace_ops import diff_traces, load_cassette

old = load_cassette("cassettes/v1.yaml")
new = load_cassette("cassettes/v2.yaml")
diff = diff_traces(old, new)
print(diff.summary())
output
⚠ TRAJECTORY CHANGED
  Old: llm_call:gpt-4o → tool:search → llm_call:gpt-4o
  New: llm_call:gpt-4o → tool:browse → tool:search → llm_call:gpt-4o
⚠ TOKENS INCREASED by 23%  (1,200 → 1,476)
ℹ MODEL UNCHANGED (gpt-4o)

Via CLI

bash
traceops diff cassettes/v1.yaml cassettes/v2.yaml

Assertions

Guard against cost overruns, token bloat, and infinite tool loops with built-in assertion helpers and the @budget pytest marker.

Assertion helpers

python
from trace_ops import Recorder
from trace_ops.assertions import (
    assert_cost_under,
    assert_tokens_under,
    assert_max_llm_calls,
    assert_no_loops,
)

with Recorder() as rec:
    agent.run("Analyze this document")

assert_cost_under(rec.trace, max_usd=0.50)
assert_tokens_under(rec.trace, max_tokens=10_000)
assert_max_llm_calls(rec.trace, max_calls=5)
assert_no_loops(rec.trace, max_consecutive_same_tool=3)

@budget marker

Use the @pytest.mark.budget marker for inline budget constraints:

python
import pytest

@pytest.mark.budget(max_usd=0.50, max_tokens=10_000, max_llm_calls=5)
def test_agent_budget(cassette):
    agent.run("Analyze this document")

pytest Plugin

The built-in pytest plugin provides the cassette fixture plus --record and --record-mode flags.

cassette fixture

Any test that takes a cassette parameter automatically gets record/replay behavior. Cassettes are stored alongside the test file in a cassettes/ subdirectory.

python
def test_agent(cassette):
    result = agent.run("What is the capital of France?")
    assert "Paris" in result

CLI flags

bash
pytest --record            # Record all cassettes
pytest                     # Replay from cassettes (zero API calls)
pytest --record-mode=all   # Re-record everything
pytest --record-mode=new   # Record only missing cassettes

RAG Recording & Scoring

Record retrieval queries, chunks, and vector store events alongside LLM calls. Assert on chunk count, relevance scores, and retrieval drift between versions.

ℹ Available since v0.5

Install optional scoring deps: pip install traceops[rag]

Recording retrieval events

python
from trace_ops import Recorder
from trace_ops.rag import assert_chunk_count, assert_min_relevance_score

with Recorder(save_to="cassettes/rag_test.yaml") as rec:
    answer = rag_pipeline.run("What is the refund policy?")

assert_chunk_count(rec.trace, min_chunks=3)
assert_min_relevance_score(rec.trace, min_score=0.75)

Retrieval drift detection

python
from trace_ops.rag import assert_no_retrieval_drift
from trace_ops import load_cassette

baseline = load_cassette("cassettes/rag_baseline.yaml")
with Recorder() as rec:
    rag_pipeline.run("What is the refund policy?")
assert_no_retrieval_drift(baseline, rec.trace, tolerance=0.1)

Scoring integrations

python
from trace_ops.rag import RagasScorer

scores = RagasScorer().score(rec.trace)
# {"faithfulness": 0.92, "answer_relevancy": 0.88, "context_precision": 0.85}

MCP Tool Recording

Record Model Context Protocol server connections, tool calls, and results as first-class trace events.

ℹ Available since v0.5

MCP recording is enabled by default when the MCP SDK is installed.

Recording MCP calls

python
with Recorder(intercept_mcp=True) as rec:
    result = await mcp_client.call_tool(
        "search_files", {"query": "config.py"}
    )

print(rec.trace.mcp_events)
# [MCPEvent(tool="search_files", duration_ms=12, result=[...])]

Semantic Regression Detection

Catch meaning-level regressions that exact-string diffs miss — using embedding cosine similarity.

assert_semantic_similarity

python
from trace_ops.semantic import assert_semantic_similarity

assert_semantic_similarity(baseline_trace, rec.trace, threshold=0.85)
# Raises SemanticRegressionError if cosine similarity < threshold

Fine-Tune Export

Export recorded traces as OpenAI or Anthropic fine-tuning JSONL datasets. Cassettes become training data.

OpenAI format

python
from trace_ops.export.finetune import export_finetune_jsonl, ExportFormat

export_finetune_jsonl("cassettes/", "dataset.jsonl", fmt=ExportFormat.OPENAI)

Anthropic format

python
export_finetune_jsonl("cassettes/", "dataset.jsonl", fmt=ExportFormat.ANTHROPIC)

Pattern Detector

Analyze tool n-gram heatmaps, model usage statistics, and error rates across an entire cassette directory.

ℹ Available since v0.6

No extra dependencies required.

Basic usage

python
from trace_ops.analysis import PatternDetector

detector = PatternDetector(window_size=3, top_n=10)
report = detector.analyze_dir("cassettes/")

print(report.summary())
# Analyzed 47 traces | Avg: 3.2 LLM calls, 1,450 tokens, $0.012/run

Tool sequences

python
for seq in report.top_tool_sequences:
    print(f"  {' -> '.join(seq.sequence)} ×{seq.count}")
# search -> read_file -> write_file   x31
# search -> read_file                 x9

import json
json.dump(report.to_dict(), open("patterns.json", "w"))

Constructor parameters

ParameterTypeDefaultDescription
window_sizeint3N-gram window size for tool sequences
top_nint10Number of top sequences to return

Gap Analyzer

Compare agent traces against a golden baseline to surface systematic behavioral differences.

⚠ CI-friendly exit codes

The CLI command traceops gap-report exits with code 1 if any critical gaps are found — making it a drop-in CI gate.

Basic comparison

python
from pathlib import Path
from trace_ops import load_cassette
from trace_ops.analysis import GapAnalyzer

golden = [(p.name, load_cassette(p)) for p in Path("golden/").glob("*.yaml")]
agent  = [(p.name, load_cassette(p)) for p in Path("runs/").glob("*.yaml")]

report = GapAnalyzer().compare(golden, agent)
print(report.summary())
# Found 3 behavioral gap(s): 1 critical, 2 warnings

Detected gap types

CategorySeverityDescription
token_inflationcriticalAgent uses significantly more tokens
cost_inflationcriticalAgent costs significantly more
missing_toolwarningGolden uses a tool the agent rarely calls
extra_toolwarningAgent uses a tool not in golden
model_mismatchinfoAgent and golden use different models
error_ratecriticalAgent has significantly higher error rate
llm_call_inflationwarningAgent makes many more LLM calls per task

Skills Generator

Auto-generate AGENTS.md or CLAUDE.md steering guidance from gap or pattern reports.

From a gap report

python
from trace_ops.analysis import GapAnalyzer, SkillsGenerator

report = GapAnalyzer().compare(golden, agent)
gen = SkillsGenerator()
md = gen.from_gap_report(
    report,
    output_path="AGENTS.md",
    title="Agent Behavioral Guidance",
)
print(md)

From a pattern report

python
from trace_ops.analysis import PatternDetector, SkillsGenerator

report = PatternDetector().analyze_dir("cassettes/")
SkillsGenerator().from_pattern_report(report, output_path="PATTERNS.md")

GitHub PR Fetcher

Fetch merged GitHub PRs as human-validated golden baselines via the GitHub REST API. Uses only Python stdlib — no extra deps.

ℹ Authentication

Set GITHUB_TOKEN in your environment or pass token= explicitly. Without a token you get 60 unauthenticated requests/hour.

Fetch a single PR

python
from trace_ops.github import PRFetcher

fetcher = PRFetcher()  # reads GITHUB_TOKEN env var automatically
pr = fetcher.fetch("https://github.com/owner/repo/pull/123")

print(pr.title)                  # "Fix: agent skips write step on empty search"
print(pr.total_additions)        # 33
print(pr.extract_task_prompt())  # plain-English task description

Fetch recent merged PRs

python
recent = fetcher.fetch_recent(
    "https://github.com/owner/repo",
    limit=10,
)
print(f"Fetched {len(recent)} golden PRs")

OpenAI Integration

TraceOps auto-intercepts openai.chat.completions.create for sync, async, and streaming.

✓ Zero config

Just wrap your code in Recorder() — TraceOps patches the OpenAI SDK automatically.

bash
pip install traceops[openai]
python
import openai
from trace_ops import Recorder

client = openai.OpenAI()
with Recorder(save_to="cassettes/openai_test.yaml") as rec:
    resp = client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": "Hi"}]
    )

Anthropic Integration

Full support for Anthropic messages, tool_use blocks, and streaming delta events.

bash
pip install traceops[anthropic]
python
import anthropic
from trace_ops import Recorder

client = anthropic.Anthropic()
with Recorder() as rec:
    msg = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": "Hi"}],
    )

LangChain / LangGraph

Intercept LangChain BaseChatModel calls and LangGraph Pregel graph execution with a single flag.

python
with Recorder(intercept_langchain=True, intercept_langgraph=True):
    result = graph.invoke({"messages": [...]})

CrewAI Integration

Record CrewAI task pipelines and agent outputs end-to-end.

bash
pip install traceops[crewai]
python
from trace_ops import Recorder

with Recorder(intercept_crewai=True) as rec:
    crew.kickoff(inputs={"topic": "AI agents"})

CLI Reference

The traceops command provides tools for inspecting cassettes, running analysis, and generating reports.

Cassette commands

bash
traceops inspect cassettes/test.yaml          # Inspect a cassette
traceops diff old.yaml new.yaml               # Compare two cassettes
traceops export test.yaml --format json       # Export to JSON / JSONL
traceops ls cassettes/                        # List all cassettes with stats
traceops stats cassettes/                     # Aggregate statistics
traceops prune cassettes/ --older-than 30d    # Delete stale cassettes
traceops validate cassettes/test.yaml         # Validate integrity
traceops report cassettes/test.yaml -o r.html # HTML cost report
traceops debug cassettes/test.yaml            # Time-travel debugger

Analysis commands

ℹ v0.6

These commands require TraceOps ≥ 0.6.0.

bash
# Pattern analysis across all cassettes
traceops analyze cassettes/ --window 3 --top 10
traceops analyze cassettes/ -o patterns.json --skills AGENTS.md

# Gap analysis vs golden baseline (exits 1 on critical gaps)
traceops gap-report golden/ runs/
traceops gap-report golden/ runs/ --skills AGENTS.md --json

# Fetch a GitHub PR as golden baseline
traceops pr-diff https://github.com/owner/repo/pull/123
traceops pr-diff https://github.com/owner/repo/pull/123 --task
traceops pr-diff https://github.com/owner/repo/pull/123 --files

Changelog

v0.6.0 — Current

  • Add trace_ops.analysis: PatternDetector, GapAnalyzer, BehavioralGap, GapReport, SkillsGenerator
  • Add trace_ops.github: PRFetcher, PRDiff, PRFile
  • 3 new CLI commands: traceops analyze, traceops gap-report, traceops pr-diff
  • 714 tests passing

v0.5.0

  • RAG recording + scoring (ragas, DeepEval integrations)
  • Semantic regression detection (assert_semantic_similarity)
  • MCP tool call recording and replay
  • Fine-tune dataset export (OpenAI + Anthropic JSONL)

v0.3.0 — v0.4.0

  • LangGraph Pregel interceptor
  • Anthropic tool_use block support
  • Provider-agnostic response normalization
  • Cost dashboard reporter

v0.2.0

  • Async + streaming support
  • Budget assertions + @budget marker
  • Time-travel debugger (traceops debug)
  • HTML report generator
  • GitHub Action integration

v0.1.0

  • Initial release: Record/replay for OpenAI, Anthropic, LiteLLM
  • pytest plugin with cassette fixture
  • Semantic diff engine
  • Core CLI