Experiments

How experiments fit into the loop

To systematically understand and improve how your system behaves, you need a way to isolate cause and effect. That's what experiments give you. You pick a variable or a set of variables, run your dataset through two versions of your system, and compare what comes out. The result tells you whether a change actually helped, and by how much. To understand by how much, you also need to evaluate your experiment outputs (see Evaluate). This section covers the systematic experimentation part before evaluation.

Deploy

Online

Trace

traces · sessions · agents · prompts

Online

Monitor

dashboards · LLM-as-judge · feedback

Offline

Build datasets

datasets · features-as-tests

Offline

Experiment

prompts · models · code variants

Offline

Evaluate

judges · custom evals · annotation

The anatomy of an experiment

Every experiment has four components.

Component	What it is
Baseline	Your current production system — the control condition everything else gets measured against. Keep it fixed while you vary one thing.
Dataset	The inputs you run both conditions against. Keep the same dataset across experiments so results are comparable over time.
Variable	The configuration you're changing — model, prompt, context, tool access, or agent architecture. See Variables below.
Outputs to compare	What your system produces under each condition. Comparing these is the actual work of running an experiment.

It often helps to change only one variable at a time. However, variables interact, and some configurations require changing multiple variables at once.

Variables

Model. The AI model you are using. There are reasoning-heavy models, cheap models, and fast models, and they all come with different tradeoffs in result quality, speed, and cost.
Prompt. The most common lever. Before running a prompt experiment, ask: is the failure a specification problem (ambiguous or incomplete prompt) or a generalization problem (model applies clear instructions inconsistently)? The latter is worth measuring.
Context. What information you include in the prompt: retrieved documents, conversation history, user metadata.
Tool access. Adding or removing tools changes what paths your system can take.
Agent architecture. Single agent vs. multi-agent, which framework, how tasks are decomposed. The biggest bets, the hardest to isolate.

How experiments are used

The core flow: pick a variable and form a hypothesis, run both conditions against your dataset, compare outputs, learn something, and repeat.

Typical attempts can include:

There is a new model: Will it improve the performance of our system?
Does my prompt change improve the output quality of our system?
Is our new agent harness architecture creating better results than our multi-agent system?

Start qualitative: same input, both conditions, traces side by side. That is how you learn what "better" means for your app; without reading real outputs regularly, metrics are easy to misread.

Scores then make comparison concrete—win rates, whether wins are spread across inputs or concentrated, and cost or latency tradeoffs. Quality, price, and speed rarely move together; experiments show those pulls in your data instead of in the abstract.

Where to start

Start with a small, manual comparison before building more infrastructure. A few examples with traces side by side will teach you more in the first hour than a week of setup work.

Get 20–30 real examples. Pull them from production traces or come up with realistic examples. They don't need to cover everything, just a real slice of what your application handles.
Change your configuration and run both versions. Keep everything else identical.
Read traces side by side. No evaluator needed yet. Just read. What's different? Which one is actually better and why? Pay attention to the type of failure — is the prompt unclear, or is the model applying clear instructions inconsistently? That distinction tells you what kind of fix to try next.
Add an evaluator once you have intuition. After a few manual rounds you'll know what you're looking for. Encode it. Now you can scale.

What comes next

To see whether your experiment led to an improvement, you need to evaluate your results. Learn more about evaluation methods in the next section.

Was this page helpful?

Experiments

How experiments fit into the loop

The anatomy of an experiment

Variables

How experiments are used

Where to start

What comes next

On this page