While building our platform for automating unstructured data workflows, we faced an interesting technical challenge: How can we efficiently validate output data across diverse domains, formats, and types? Rule-based validation systems are effective at catching obvious errors, but struggle with context-dependent issues. LLMs offer a promising solution.
With their nuanced understanding of language and context, LLMs can evaluate data quality in ways that closely mimic human judgment.
Our goal was to combine traditional checks with the contextual understanding of LLMs into a hybrid approach as illustrated below.
Here's a step-by-step guide to implementing an LLM-as-a-judge system:
const criteria = {
coherence:
'Evaluate the overall structure and logical format of the output data.',
consistency: 'Check for factual alignment and consistency across records.',
accuracy:
'Assess how well the output matches the schema and follows the pattern of the expected data.',
completeness:
'Ensure all required fields are present and populated appropriately.',
}
The prompt is crucial and should include:
Here's a simplified version of our implementation:
async function evaluateOutput(llmOutput, schema, referenceAnswer) {
// Construct prompt
// Make API call to LLM
// Parse and return the response
}
To ensure reliability, implement a consistency check that runs multiple evaluations and checks for consistency:
Use the output to make decisions about your data's validity and quality.
Sample result:
{
"link": "https://www.kadoa.com/workflow/XYZ",
"duration": 164.236,
"result": {
"coherence": {
"pass": true,
"explanation": "The data is structured as an array of objects, each containing the required fields. The format is consistent across all entries, maintaining a logical and readable structure."
},
"consistency": {
"pass": true,
"explanation": "The data is consistent in terms of field types and expected content. Each entry follows the same pattern, and there are no discrepancies in the data format or logical flow."
},
"accuracy": {
"pass": false,
"explanation": "While most fields match the expected data types, the 'author' field in one entry is empty, which does not align with the schema's expectation of a non-empty string. This affects the accuracy of the data."
},
"completeness": {
"pass": false,
"explanation": "All required fields are present in each entry, but the 'author' field is not populated in one entry, which violates the requirement for all fields to be appropriately populated."
}
}
}
We first started with numeric scores (e.g. 1-10 scales) for each criterion and ran systematic tests on various LLMs with injected controlled errors into data. We then asked the model to output a numeric score reflecting error severity. Instead of a smooth, predictable gradient, the scores clustered randomly.
The distributions also varied wildly between different models or even different score ranges (e.g., -1 to 1, 0 to 1, or 1 to 10).
While numeric scores offer more granularity, the first experiments showed many limitations:
Inconsistency: LLMs often struggle to maintain consistent interpretations of numeric scales across different evaluations. The distribution was usually very uneven.
Reasoning limitations: LLMs may have difficulty justifying small differences between close numeric scores (e.g., 7 vs 8).
Calibration issues: Calibrating a consistent numeric output across models and ranges seems very challenging as LLMs interpret scoring instructions differently. A model might label everything 10 (“worst”) despite small differences in error rates.
In contrast, a binary scale (pass/fail) is less ambiguous, more consistent across evaluations, and easier to interpret and compare.
Numeric scoring might become more reliable if you:
Numeric scoring is tempting but often less stable for production use. A binary pass/fail approach is more robust unless you invest the time into meticulously defining and testing numeric thresholds for a specific domain.
We were able to improve our data validation, improve error detection, and reduce data cleaning time. We'll continue to refine our use of LLMs as judges and might even fine-tune the models on domain-specific data to improve accuracy further.