Blog post illustration

Using LLM as a Judge for Evaluation

Adrian Krebs,Co-Founder & CEO of Kadoa

Why Use an LLM as a Judge?

While building our platform for automating unstructured data workflows, we faced an interesting technical challenge: How can we efficiently validate output data across diverse domains, formats, and types? Rule-based validation systems are effective at catching obvious errors, but struggle with context-dependent issues. LLMs offer a promising solution.

With their nuanced understanding of language and context, LLMs can evaluate data quality in ways that closely mimic human judgment.

Our goal was to combine traditional checks with the adaptability and contextual understanding of LLMs into a hybrid approach as illustrated below.

diagram

Implementing LLM as a Judge

Here's a step-by-step guide to implementing an LLM-as-a-judge system:

  1. Define Your Evaluation Criteria
  • Start by clearly defining what you're validating. In our case, we use the following criteria:
const criteria = {
  coherence:
    'Evaluate the overall structure and logical format of the output data.',
  consistency: 'Check for factual alignment and consistency across records.',
  accuracy:
    'Assess how well the output matches the schema and follows the pattern of the expected data.',
  completeness:
    'Ensure all required fields are present and populated appropriately.',
}
  1. Prepare your inputs you'll need
  • The data to be validated
  • A schema or structure definition
  • A reference answer (if applicable)
  1. Prompt engineering

The prompt is crucial and should include:

  • Clear instructions
  • The evaluation criteria
  • All necessary reference materials
  1. Implement the validation function

Here's a simplified version of our implementation:

async function evaluateOutput(llmOutput, schema, referenceAnswer) {
  // Construct prompt
  // Make API call to LLM
  // Parse and return the response
}
  1. Add consistency checks

To ensure reliability, implement a consistency check that runs multiple evaluations and checks for consistency:

  1. Interpret and act on results

Use the output to make decisions about your data's validity and quality.

Sample result:

{
  "link": "https://www.kadoa.com/workflow/XYZ",
  "duration": 164.236,
  "result": {
    "coherence": {
      "pass": true,
      "explanation": "The data is structured as an array of objects, each containing the required fields. The format is consistent across all entries, maintaining a logical and readable structure."
    },
    "consistency": {
      "pass": true,
      "explanation": "The data is consistent in terms of field types and expected content. Each entry follows the same pattern, and there are no discrepancies in the data format or logical flow."
    },
    "accuracy": {
      "pass": false,
      "explanation": "While most fields match the expected data types, the 'author' field in one entry is empty, which does not align with the schema's expectation of a non-empty string. This affects the accuracy of the data."
    },
    "completeness": {
      "pass": false,
      "explanation": "All required fields are present in each entry, but the 'author' field is not populated in one entry, which violates the requirement for all fields to be appropriately populated."
    }
  }
}

Binary Scale vs Numeric Scores

We first started with numeric scores (e.g. 1-10 scales) for each criterion. While numeric scores offer more granularity, the first experiments showed many limitations:

  1. Inconsistency: LLMs often struggle to maintain consistent interpretations of numeric scales across different evaluations. The distribution was usually very uneven.

  2. Reasoning limitations: LLMs may have difficulty justifying small differences between close numeric scores (e.g., 7 vs 8).

  3. Calibration issues: Different LLMs or even different versions of the same LLM may interpret numeric scales differently, making comparisons difficult.

In contrast, a binary scale (pass/fail) is less ambiguous, more consistent across evaluations, and easier to interpret and compare.

Given these considerations, we opted for a binary pass/fail scale.

There is an interesting Twitter thread with more in-depth analysis:

Numerical judges might still work, but they require more time designing and tuning the eval.

The creator of Prometheus (small judge LLM) mentions that scoring decisions could be more precise by adding explanations for each score/range and let the model generate a language feedback before scoring.

Best Practices and Considerations

  • Use a Binary Scale: We've found that a pass/fail system for each criterion leads to more consistent results than nuanced scales.
  • Provide Detailed Reference Materials: The more context you give the LLM, the more accurate its judgments will be.
  • Balance Temperature: A lower temperature tends to produce more consistent scoring results, but usually leads to less meaningful feedback.
  • Implement Error Handling: LLMs can occasionally produce unexpected outputs, so robust error handling is crucial.
  • Integrating LLM judges with traditional rule-based systems for hybrid approaches

What's Next?

We were able to improve our data validation, improve error detection, and reduce data cleaning time. We'll continue to refine our use of LLMs as judges and might even fine-tune the models on domain-specific data to improve accuracy further.