LLM-as-a-Judge

Semantic Similarity

Claude evaluates whether an AI response semantically matches the expected output.

▶How this works

Mechanism

Claude acts as a semantic judge, not a string comparator. You provide an expected output and an actual output; Claude scores their semantic similarity on a 0–1 scale and explains where they diverge. A similarity score ≥ 0.75 is a PASS.

// Judge system prompt (excerpt from /api/validate)
You are an expert evaluator assessing AI output quality.
Compare the expected and actual outputs semantically.
Return JSON: { similarity_score: 0-1, verdict: "pass"|"fail",
  reasoning: string, differences: string[] }
Threshold for pass: similarity_score >= 0.75

QA use cases

Use this to validate AI-generated test assertions, chatbot responses, or any system where exact string matching would be too brittle. The 3 built-in examples show: (1) semantically equivalent error messages that should PASS, (2) different outcomes that should FAIL, and (3) button label variations that should PASS.

Why this matters for QE: traditional assertions break on word order, synonyms, or whitespace. LLM-as-a-Judge catches regressions that matter — when the meaning changes — while ignoring cosmetic differences.

evaluation_inputs

Context / Prompt (optional)

Expected Output

Actual AI Output

evaluation_result

⚖

Fill in the expected and actual outputs,
then click Evaluate to run the judge.