Skip to main content

Evaluators

Zenbase provides optimizers to enhance LLM performance across different tasks. Evaluators measure this performance by comparing LLM outputs against expected results, both before and after optimization. This enables quantitative assessment of optimization effectiveness. For now, Zenbase supports three types of evaluators:

Exact Match Evaluator (Default)

The Exact Match Evaluator is the default evaluator for all optimizers. It performs a strict comparison between the function’s response and the expected response, requiring them to be identical.

Partial Field Match Evaluator

The Partial Field Match Evaluator compares specific fields between the actual and expected outputs. Instead of requiring the entire response to match, it checks only designated fields for exact matches.

Example

Here is an example of how to define a custom evaluator when configuring an optimizer. To learn more about optimizers, please refer to Optimizer
import requests
import json

BASE_URL = "https://orch.zenbase.ai/api"
API_KEY = "YOUR ZENBASE API KEY"

def api_call(method, endpoint, data=None):
    url = f"{BASE_URL}/{endpoint}"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Api-Key {API_KEY}"
    }
    response = requests.request(method, url, headers=headers, data=json.dumps(data) if data else None)
    return response


optimizer_data = {
    "function": function_id,
    "train_set": train_dataset_id,
    "validation_set": validation_dataset_id,
    "test_set": test_dataset_id,
    "parameters": {
        "shots": 5,
        "samples": 5,
        "model_keywords": {
            "temperature": 0.5
        },
        "custom_evaluator": {
            "type": "partial",
            "partial_fields": ["answer"]
        },
    },
    "schedule": {
        "cron": "*/5 * * * *"
    },
    "api_key": API_KEY,
    "model": "MODEL_NAME",
    "optimizer_type": "fewshot",
}
optimizer = api_call("POST", "optimizer-configurations/", optimizer_data)
optimizer_id = optimizer.json()['id']
In this example, the evaluator only compares the answer field between the actual and expected outputs, ignoring other fields.

LLM as Judge Evaluator

This evaluator uses a language model to determine if outputs are semantically equivalent, even when they don’t match exactly. It’s particularly useful for evaluating natural language responses where multiple valid formulations are possible.

Example

Here is an example of how to define llm_as_judge as a custom evaluator when configuring an optimizer. To learn more about optimizers, please refer to Optimizer
import requests
import json

BASE_URL = "https://orch.zenbase.ai/api"
API_KEY = "YOUR ZENBASE API KEY"

def api_call(method, endpoint, data=None):
    url = f"{BASE_URL}/{endpoint}"
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Api-Key {API_KEY}"
    }
    response = requests.request(method, url, headers=headers, data=json.dumps(data) if data else None)
    return response


optimizer_data = {
    "function": function_id,
    "train_set": train_dataset_id,
    "validation_set": validation_dataset_id,
    "test_set": test_dataset_id,
    "parameters": {
        "shots": 5,
        "samples": 5,
        "model_keywords": {
            "temperature": 0.5
        },
        "custom_evaluator": {
            "type": "llm_as_judge",
            "prompt": "Compare the model's output with the expected output. Determine if they are semantically equivalent, even if worded differently. Respond in JSON with 'reasoning' (list of points) and 'passed' (boolean).",
        },
    },
    "schedule": {
        "cron": "*/5 * * * *"
    },
    "api_key": API_KEY,
    "model": "MODEL_NAME",
    "optimizer_type": "fewshot",
}
optimizer = api_call("POST", "optimizer-configurations/", optimizer_data)
optimizer_id = optimizer.json()['id']

When to use LLM as Judge Evaluator

The LLM evaluator uses the provided prompt to analyze outputs and determine whether they pass or fail based on comparison with the correct output. This evaluator is especially useful when valid outputs can be expressed in different formats or variations. Example cases where exact_match fails but llm_as_judge passes: Case 1: Question: Who was the first person to walk on the moon? Expected output: Neil Armstrong Actual output: Neil Alden Armstrong Case 2: Question: What caused World War I? Expected output: The assassination of Archduke Franz Ferdinand triggered World War I. Actual output: World War I began after Archduke Franz Ferdinand was killed in Sarajevo in 1914.