🎲 🎲 🎲

How Do You Deterministically Test a Non-Deterministic System?

A visual guide to six strategies for building confidence when your system's output is different every time.

Scroll to begin

Traditional testing assumes the same input always produces the same output. But LLMs, stochastic simulations, and randomized algorithms break that assumption entirely.

When you call add(2, 3), you expect 5. Every time. But when you ask an LLM "summarize this document," you get a different summary on every call. So how do you write a test for that?

Deterministic

add(2, 3)
5
5
5
5

Non-Deterministic

summarize(doc)
"The report covers…"
"This document outlines…"
"Key findings include…"
"In summary, the…"

The key insight: stop testing for exact outputs. Start testing for behavioral contracts.

Strategy 01

Pin the Randomness

Fix the seed. Set temperature to zero. Eliminate variance at the source.

Many non-deterministic systems have a controllable source of randomness. For RNGs, you fix the seed. For LLMs, you set temperature=0 and use a fixed seed parameter. This won't give you perfect reproducibility (floating-point order, GPU scheduling), but it gets you close.

Input
"Summarize X"
+
Seed
42
+
Temp
0.0
Output
≈ Same
Python # OpenAI-style API with seed pinning response = client.chat.completions.create( model="gpt-4", temperature=0, seed=42, # pin randomness messages=[{"role": "user", "content": prompt}] ) # Check system_fingerprint to verify same backend assert response.system_fingerprint == expected_fingerprint

When to use: CI pipelines where you need reproducible snapshots. Great as a baseline, but don't rely on it alone — model updates will still break your tests.

Strategy 02

Test Properties, Not Values

Assert what must always be true, regardless of the specific output.

Instead of assert output == "expected", define invariants — properties that every valid output must satisfy. The output varies, but the contract doesn't.

📐
Valid JSON
✓ schema match
📏
Length ≤ 200 tokens
✓ 142 tokens
🚫
No PII leaked
✓ clean
🏷️
Contains key entities
✓ 3/3 found
🌡️
Sentiment positive
✓ score: 0.82
🔤
Language = English
✓ en (0.99)
Python def test_summary_properties(llm_output): # Structure data = json.loads(llm_output) # must be valid JSON validate(data, summary_schema) # must match schema # Content guardrails assert len(data["summary"]) <= 500 # bounded length assert detect_pii(data) == [] # no PII leakage assert data["language"] == "en" # correct language # Semantic checks for entity in required_entities: assert entity in data["summary"] # key facts present

When to use: Always. This is the foundational layer. Every non-deterministic system should have property tests as guardrails. They catch regressions without being brittle.

Strategy 03

LLM-as-Judge

Use a second model to evaluate the first — with structured rubrics.

Some quality dimensions (coherence, helpfulness, factual accuracy) are hard to reduce to code. So you use another LLM to grade the output against a rubric. This introduces its own variance, but running multiple evaluations and taking the consensus mitigates it.

🤖 Model Under Test
📄 Output
⚖️ Judge Model
Score: 4.2 / 5
Python rubric = """ Score the following summary from 1-5 on each dimension: - Accuracy: Are all facts correct? - Completeness: Are key points covered? - Coherence: Does it read well? - Conciseness: Is it appropriately brief? Respond as JSON: {"accuracy": N, "completeness": N, ...} """ # Run judge 3 times, take median scores scores = [judge(rubric, output) for _ in range(3)] median_scores = median_per_key(scores) assert all(v >= 3.5 for v in median_scores.values())

When to use: Quality regression testing, comparing model versions, evaluating subjective dimensions. Tools like Braintrust, Promptfoo, and DeepEval formalize this pattern.

Strategy 04

Statistical Assertions

Run it N times. Assert over the distribution, not any single run.

Treat non-determinism as a first-class citizen. Run the same input through the system multiple times and make assertions about the aggregate: pass rates, score distributions, variance bounds.

Run 1
Run 2
Run 3
Run 4
Run 5
Run 6
Run 7
Run 8
Run 9
Run 10
90% pass threshold
Python N = 20 results = [evaluate(prompt, expected_props) for _ in range(N)] pass_rate = sum(r.passed for r in results) / N avg_score = mean([r.score for r in results]) variance = var([r.score for r in results]) assert pass_rate >= 0.90, f"Pass rate {pass_rate:.0%} below 90%" assert avg_score >= 0.85, f"Avg score {avg_score:.2f} below 0.85" assert variance <= 0.05, f"Variance {variance:.3f} too high"

When to use: Nightly eval suites, pre-deployment gates, comparing prompt versions. Expensive to run (N × cost), so typically reserved for scheduled pipelines rather than every PR.

Strategy 05

Snapshot Testing with Similarity

Like snapshot tests, but with a fuzzy match using embedding distance.

Store a reference output. On each test run, compare the new output to the reference using cosine similarity on embeddings or edit distance. If it drifts beyond a threshold, the test fails — alerting you to unexpected behavioral changes.

Reference (stored)
"The quarterly report shows revenue growth of 12% driven by expansion in APAC markets and the launch of Product X."
0.94 cosine
New output
"Revenue grew 12% this quarter, primarily due to APAC expansion and the new Product X launch."
Python from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") ref_embedding = model.encode(reference_output) new_embedding = model.encode(new_output) similarity = cosine_similarity(ref_embedding, new_embedding) assert similarity >= 0.85, ( f"Output drifted: similarity {similarity:.2f} < 0.85" )

When to use: Detecting regressions after prompt changes, model upgrades, or system config changes. Especially useful when you have a "golden" output you want to stay close to.

Strategy 06

Decompose & Isolate

Shrink the non-deterministic surface. Test the deterministic parts normally.

Most systems aren't 100% non-deterministic. The prompt template, the parsing logic, the tool selection, the output formatter — all of these are deterministic and can be tested with standard unit tests. Isolate them, and reserve the fuzzy strategies for the truly random core.

Input validation & parsing Unit tests
Prompt template rendering Unit tests
Tool / function selection Unit tests
LLM inference Property + Eval tests
Output parsing & formatting Unit tests
Response validation Unit tests

The smaller you make the non-deterministic surface area, the more of your system you can test with fast, cheap, reliable unit tests — and the more targeted your expensive eval strategies can be.

Putting It All Together

A Layered Testing Strategy

Mature teams combine all six strategies into a testing pipeline.

1
Every commit
Unit tests on deterministic layers
2
Every PR
Property tests with pinned seeds
3
Merge to main
Snapshot similarity checks
4
Nightly
Statistical evals + LLM judge sweeps

The mindset shift: from "assert exact correctness" to "assert behavioral contracts hold within acceptable tolerance."

Toolbox

Tools That Help

Promptfoo
Open-source LLM eval framework. Define test cases, assertions, and run against multiple providers.
Braintrust
Eval platform with scoring, tracing, and dataset management. Great for LLM-as-judge workflows.
DeepEval
Pytest-style LLM testing. Built-in metrics for hallucination, bias, toxicity, and relevancy.
Hypothesis
Python property-based testing library. Generate random inputs and check invariants hold.
RAGAS
Framework for evaluating RAG pipelines. Measures faithfulness, relevancy, and context precision.
Langsmith
LangChain's tracing and eval platform. Track chains, compare runs, annotate with human feedback.