Testing the Untestable — A Visual Guide

Traditional testing assumes the same input always produces the same output. But LLMs, stochastic simulations, and randomized algorithms break that assumption entirely.

When you call add(2, 3), you expect 5. Every time. But when you ask an LLM "summarize this document," you get a different summary on every call. So how do you write a test for that?

Deterministic

add(2, 3)
→ 5
→ 5
→ 5
→ 5

Non-Deterministic

summarize(doc)
→ "The report covers…"
→ "This document outlines…"
→ "Key findings include…"
→ "In summary, the…"

The key insight: stop testing for exact outputs. Start testing for behavioral contracts.

Strategy 01

Pin the Randomness

Fix the seed. Set temperature to zero. Eliminate variance at the source.

Many non-deterministic systems have a controllable source of randomness. For RNGs, you fix the seed. For LLMs, you set temperature=0 and use a fixed seed parameter. This won't give you perfect reproducibility (floating-point order, GPU scheduling), but it gets you close.

Input

"Summarize X"

Seed

Temp

0.0

→

Output

≈ Same

        Python
# OpenAI-style API with seed pinning
response = client.chat.completions.create(
    model="gpt-4",
    temperature=0,
    seed=42,          # pin randomness
    messages=[{"role": "user", "content": prompt}]
)
# Check system_fingerprint to verify same backend
assert response.system_fingerprint == expected_fingerprint
      

When to use: CI pipelines where you need reproducible snapshots. Great as a baseline, but don't rely on it alone — model updates will still break your tests.

Strategy 02

Test Properties, Not Values

Assert what must always be true, regardless of the specific output.

Instead of assert output == "expected", define invariants — properties that every valid output must satisfy. The output varies, but the contract doesn't.

📐

Valid JSON

✓ schema match

📏

Length ≤ 200 tokens

✓ 142 tokens

🚫

No PII leaked

✓ clean

🏷️

Contains key entities

✓ 3/3 found

🌡️

Sentiment positive

✓ score: 0.82

🔤

Language = English

✓ en (0.99)

        Python
def test_summary_properties(llm_output):
    # Structure
    data = json.loads(llm_output)          # must be valid JSON
    validate(data, summary_schema)         # must match schema

    # Content guardrails
    assert len(data["summary"]) <= 500     # bounded length
    assert detect_pii(data) == []           # no PII leakage
    assert data["language"] == "en"         # correct language

    # Semantic checks
    for entity in required_entities:
        assert entity in data["summary"]   # key facts present
      

When to use: Always. This is the foundational layer. Every non-deterministic system should have property tests as guardrails. They catch regressions without being brittle.

Strategy 03

LLM-as-Judge

Use a second model to evaluate the first — with structured rubrics.

Some quality dimensions (coherence, helpfulness, factual accuracy) are hard to reduce to code. So you use another LLM to grade the output against a rubric. This introduces its own variance, but running multiple evaluations and taking the consensus mitigates it.

🤖 Model Under Test

→

📄 Output

→

⚖️ Judge Model

→

✅ Score: 4.2 / 5

        Python
rubric = """
Score the following summary from 1-5 on each dimension:
- Accuracy: Are all facts correct?
- Completeness: Are key points covered?
- Coherence: Does it read well?
- Conciseness: Is it appropriately brief?
Respond as JSON: {"accuracy": N, "completeness": N, ...}
"""

# Run judge 3 times, take median scores
scores = [judge(rubric, output) for _ in range(3)]
median_scores = median_per_key(scores)
assert all(v >= 3.5 for v in median_scores.values())
      

When to use: Quality regression testing, comparing model versions, evaluating subjective dimensions. Tools like Braintrust, Promptfoo, and DeepEval formalize this pattern.

Strategy 04

Statistical Assertions

Run it N times. Assert over the distribution, not any single run.

Treat non-determinism as a first-class citizen. Run the same input through the system multiple times and make assertions about the aggregate: pass rates, score distributions, variance bounds.

Run 1

Run 2

Run 3

Run 4

Run 5

Run 6

Run 7

Run 8

Run 9

Run 10

90% pass threshold

        Python
N = 20
results = [evaluate(prompt, expected_props) for _ in range(N)]

pass_rate = sum(r.passed for r in results) / N
avg_score = mean([r.score for r in results])
variance  = var([r.score for r in results])

assert pass_rate >= 0.90,    f"Pass rate {pass_rate:.0%} below 90%"
assert avg_score >= 0.85,   f"Avg score {avg_score:.2f} below 0.85"
assert variance  <= 0.05,   f"Variance {variance:.3f} too high"
      

When to use: Nightly eval suites, pre-deployment gates, comparing prompt versions. Expensive to run (N × cost), so typically reserved for scheduled pipelines rather than every PR.

Strategy 05

Snapshot Testing with Similarity

Like snapshot tests, but with a fuzzy match using embedding distance.

Store a reference output. On each test run, compare the new output to the reference using cosine similarity on embeddings or edit distance. If it drifts beyond a threshold, the test fails — alerting you to unexpected behavioral changes.

Reference (stored)

"The quarterly report shows revenue growth of 12% driven by expansion in APAC markets and the launch of Product X."

0.94 cosine

New output

"Revenue grew 12% this quarter, primarily due to APAC expansion and the new Product X launch."

        Python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
ref_embedding = model.encode(reference_output)
new_embedding = model.encode(new_output)

similarity = cosine_similarity(ref_embedding, new_embedding)
assert similarity >= 0.85, (
    f"Output drifted: similarity {similarity:.2f} < 0.85"
)
      

When to use: Detecting regressions after prompt changes, model upgrades, or system config changes. Especially useful when you have a "golden" output you want to stay close to.

Strategy 06

Decompose & Isolate

Shrink the non-deterministic surface. Test the deterministic parts normally.

Most systems aren't 100% non-deterministic. The prompt template, the parsing logic, the tool selection, the output formatter — all of these are deterministic and can be tested with standard unit tests. Isolate them, and reserve the fuzzy strategies for the truly random core.

Input validation & parsing Unit tests

Prompt template rendering Unit tests

Tool / function selection Unit tests

LLM inference Property + Eval tests

Output parsing & formatting Unit tests

Response validation Unit tests

The smaller you make the non-deterministic surface area, the more of your system you can test with fast, cheap, reliable unit tests — and the more targeted your expensive eval strategies can be.

Putting It All Together

A Layered Testing Strategy

Mature teams combine all six strategies into a testing pipeline.

Every commit

Unit tests on deterministic layers

Every PR

Property tests with pinned seeds

Merge to main

Snapshot similarity checks

Nightly

Statistical evals + LLM judge sweeps

The mindset shift: from "assert exact correctness" to "assert behavioral contracts hold within acceptable tolerance."

Toolbox

Tools That Help

Promptfoo

Open-source LLM eval framework. Define test cases, assertions, and run against multiple providers.

Braintrust

Eval platform with scoring, tracing, and dataset management. Great for LLM-as-judge workflows.

DeepEval

Pytest-style LLM testing. Built-in metrics for hallucination, bias, toxicity, and relevancy.

Hypothesis

Python property-based testing library. Generate random inputs and check invariants hold.

RAGAS

Framework for evaluating RAG pipelines. Measures faithfulness, relevancy, and context precision.

Langsmith

LangChain's tracing and eval platform. Track chains, compare runs, annotate with human feedback.

How Do You Deterministically Test a Non-Deterministic System?

Deterministic

Non-Deterministic

Pin the Randomness

Test Properties, Not Values

LLM-as-Judge

Statistical Assertions

Snapshot Testing with Similarity

Decompose & Isolate

A Layered Testing Strategy

Tools That Help