LLM-as-a-Judge
Semantic SimilarityClaude evaluates whether an AI response semantically matches the expected output.
▶How this works
Mechanism
Claude acts as a semantic judge, not a string comparator. You provide an expected output and an actual output; Claude scores their semantic similarity on a 0–1 scale and explains where they diverge. A similarity score ≥ 0.75 is a PASS.
// Judge system prompt (excerpt from /api/validate)
You are an expert evaluator assessing AI output quality.
Compare the expected and actual outputs semantically.
Return JSON: { similarity_score: 0-1, verdict: "pass"|"fail",
reasoning: string, differences: string[] }
Threshold for pass: similarity_score >= 0.75QA use cases
Use this to validate AI-generated test assertions, chatbot responses, or any system where exact string matching would be too brittle. The 3 built-in examples show: (1) semantically equivalent error messages that should PASS, (2) different outcomes that should FAIL, and (3) button label variations that should PASS.
Why this matters for QE: traditional assertions break on word order, synonyms, or whitespace. LLM-as-a-Judge catches regressions that matter — when the meaning changes — while ignoring cosmetic differences.
evaluation_inputs
evaluation_result
⚖
Fill in the expected and actual outputs,
then click Evaluate to run the judge.