A visual guide to six strategies for building confidence when your system's output is different every time.
Traditional testing assumes the same input always produces the same output. But LLMs, stochastic simulations, and randomized algorithms break that assumption entirely.
When you call add(2, 3), you expect 5. Every time. But when you ask an LLM "summarize this document," you get a different summary on every call. So how do you write a test for that?
The key insight: stop testing for exact outputs. Start testing for behavioral contracts.
Fix the seed. Set temperature to zero. Eliminate variance at the source.
Many non-deterministic systems have a controllable source of randomness. For RNGs, you fix the seed. For LLMs, you set temperature=0 and use a fixed seed parameter. This won't give you perfect reproducibility (floating-point order, GPU scheduling), but it gets you close.
When to use: CI pipelines where you need reproducible snapshots. Great as a baseline, but don't rely on it alone — model updates will still break your tests.
Assert what must always be true, regardless of the specific output.
Instead of assert output == "expected", define invariants — properties that every valid output must satisfy. The output varies, but the contract doesn't.
When to use: Always. This is the foundational layer. Every non-deterministic system should have property tests as guardrails. They catch regressions without being brittle.
Use a second model to evaluate the first — with structured rubrics.
Some quality dimensions (coherence, helpfulness, factual accuracy) are hard to reduce to code. So you use another LLM to grade the output against a rubric. This introduces its own variance, but running multiple evaluations and taking the consensus mitigates it.
When to use: Quality regression testing, comparing model versions, evaluating subjective dimensions. Tools like Braintrust, Promptfoo, and DeepEval formalize this pattern.
Run it N times. Assert over the distribution, not any single run.
Treat non-determinism as a first-class citizen. Run the same input through the system multiple times and make assertions about the aggregate: pass rates, score distributions, variance bounds.
When to use: Nightly eval suites, pre-deployment gates, comparing prompt versions. Expensive to run (N × cost), so typically reserved for scheduled pipelines rather than every PR.
Like snapshot tests, but with a fuzzy match using embedding distance.
Store a reference output. On each test run, compare the new output to the reference using cosine similarity on embeddings or edit distance. If it drifts beyond a threshold, the test fails — alerting you to unexpected behavioral changes.
When to use: Detecting regressions after prompt changes, model upgrades, or system config changes. Especially useful when you have a "golden" output you want to stay close to.
Shrink the non-deterministic surface. Test the deterministic parts normally.
Most systems aren't 100% non-deterministic. The prompt template, the parsing logic, the tool selection, the output formatter — all of these are deterministic and can be tested with standard unit tests. Isolate them, and reserve the fuzzy strategies for the truly random core.
The smaller you make the non-deterministic surface area, the more of your system you can test with fast, cheap, reliable unit tests — and the more targeted your expensive eval strategies can be.
Mature teams combine all six strategies into a testing pipeline.
The mindset shift: from "assert exact correctness" to "assert behavioral contracts hold within acceptable tolerance."