While building our platform for automating unstructured data workflows, we faced an interesting technical challenge: How can we efficiently validate output data across diverse domains, formats, and types? Rule-based validation systems are effective at catching obvious errors, but struggle with context-dependent issues. LLMs offer a promising solution.
With their nuanced understanding of language and context, LLMs can evaluate data quality in ways that closely mimic human judgment.
Our goal was to combine traditional checks with the adaptability and contextual understanding of LLMs into a hybrid approach as illustrated below.
Here's a step-by-step guide to implementing an LLM-as-a-judge system:
const criteria = {
coherence:
'Evaluate the overall structure and logical format of the output data.',
consistency: 'Check for factual alignment and consistency across records.',
accuracy:
'Assess how well the output matches the schema and follows the pattern of the expected data.',
completeness:
'Ensure all required fields are present and populated appropriately.',
}
The prompt is crucial and should include:
Here's a simplified version of our implementation:
async function evaluateOutput(llmOutput, schema, referenceAnswer) {
// Construct prompt
// Make API call to LLM
// Parse and return the response
}
To ensure reliability, implement a consistency check that runs multiple evaluations and checks for consistency:
Use the output to make decisions about your data's validity and quality.
Sample result:
{
"link": "https://www.kadoa.com/workflow/XYZ",
"duration": 164.236,
"result": {
"coherence": {
"pass": true,
"explanation": "The data is structured as an array of objects, each containing the required fields. The format is consistent across all entries, maintaining a logical and readable structure."
},
"consistency": {
"pass": true,
"explanation": "The data is consistent in terms of field types and expected content. Each entry follows the same pattern, and there are no discrepancies in the data format or logical flow."
},
"accuracy": {
"pass": false,
"explanation": "While most fields match the expected data types, the 'author' field in one entry is empty, which does not align with the schema's expectation of a non-empty string. This affects the accuracy of the data."
},
"completeness": {
"pass": false,
"explanation": "All required fields are present in each entry, but the 'author' field is not populated in one entry, which violates the requirement for all fields to be appropriately populated."
}
}
}
We first started with numeric scores (e.g. 1-10 scales) for each criterion. While numeric scores offer more granularity, the first experiments showed many limitations:
Inconsistency: LLMs often struggle to maintain consistent interpretations of numeric scales across different evaluations. The distribution was usually very uneven.
Reasoning limitations: LLMs may have difficulty justifying small differences between close numeric scores (e.g., 7 vs 8).
Calibration issues: Different LLMs or even different versions of the same LLM may interpret numeric scales differently, making comparisons difficult.
In contrast, a binary scale (pass/fail) is less ambiguous, more consistent across evaluations, and easier to interpret and compare.
Given these considerations, we opted for a binary pass/fail scale.
There is an interesting Twitter thread with more in-depth analysis:
Numerical judges might still work, but they require more time designing and tuning the eval.
The creator of Prometheus (small judge LLM) mentions that scoring decisions could be more precise by adding explanations for each score/range and let the model generate a language feedback before scoring.
We were able to improve our data validation, improve error detection, and reduce data cleaning time. We'll continue to refine our use of LLMs as judges and might even fine-tune the models on domain-specific data to improve accuracy further.