Benchmark an Answer

Benchmarking tests your population’s accuracy by comparing its predictions to questions with known real-world answers. This is essential for validating population quality and understanding prediction reliability.

Why benchmark?

Benchmarking helps you:

Validate population accuracy for questions similar to your use case
Compare different populations to choose the best one for your needs
Build confidence in predictions before making important decisions
Identify weaknesses in population seed data

Basic usage

from semilattice import Semilattice

semilattice = Semilattice()

result = semilattice.answers.benchmark(
    population_id="d670f351-8567-4586-9bba-b81add1bebe3",
    answers={
        "question": "What's your primary role?",
        "question_options": {"question_type": "single-choice"},
        "answer_options": ["Engineer", "Manager", "Designer", "Product"],
        "ground_answer_counts": {
            "Engineer": 45,
            "Manager": 12,
            "Designer": 18,
            "Product": 25
        },
        "ground_answer_sample_size": 100
    }
)

Required parameters

Ground truth data

For benchmarking, you must provide the real-world answer distribution:

`ground_answer_counts`

A dictionary mapping each answer option to the number of real people who selected it:

{
    "Very satisfied": 34,
    "Somewhat satisfied": 41,
    "Neutral": 15,
    "Somewhat dissatisfied": 7,
    "Very dissatisfied": 3
}

`ground_answer_sample_size`

The total number of people who responded to the question. This is crucial for multiple-choice questions where respondents can select multiple options: Single-choice example:

100 people responded
Each person selected exactly one option
ground_answer_sample_size: 100

Multiple-choice example:

100 people responded
People could select multiple options
Total selections across all options: 180
ground_answer_sample_size: 100 (the number of people, not total selections)

Question types

Single-choice Benchmarking

result = semilattice.answers.benchmark(
    population_id="population-id",
    answers={
        "question": "What's your experience level?",
        "question_options": {"question_type": "single-choice"},
        "answer_options": ["Junior", "Mid-level", "Senior", "Staff"],
        "ground_answer_counts": {
            "Junior": 25,
            "Mid-level": 35,
            "Senior": 30,
            "Staff": 10
        },
        "ground_answer_sample_size": 100
    }
)

Multiple-choice Benchmarking

result = semilattice.answers.benchmark(
    population_id="population-id",
    answers={
        "question": "Which tools do you use? (Select all that apply)",
        "question_options": {"question_type": "multiple-choice"},
        "answer_options": ["Git", "Docker", "Kubernetes", "AWS"],
        "ground_answer_counts": {
            "Git": 85,        # 85 out of 100 people selected Git
            "Docker": 60,     # 60 out of 100 people selected Docker
            "Kubernetes": 30, # 30 out of 100 people selected Kubernetes
            "AWS": 45         # 45 out of 100 people selected AWS
        },
        "ground_answer_sample_size": 100  # 100 people total responded
    }
)

Handling async results

Like simulations, benchmarks run asynchronously:

import time

answer_id = result.data[0].id

while result.data[0].status != "Predicted":
    time.sleep(1)
    result = semilattice.answers.get(answer_id)

print("Benchmark complete!")

Understanding benchmark results

Once complete, benchmark results include both predictions and accuracy metrics:

{
    "data": {
        "id": "benchmark-answer-uuid",
        "status": "Predicted",
        "question": "What's your primary role?",
        "answer_options": ["Engineer", "Manager", "Designer", "Product"],
        
        // Predicted distribution
        "simulated_answer_percentages": {
            "Engineer": 0.42,
            "Manager": 0.15,
            "Designer": 0.19,
            "Product": 0.24
        },
        
        // Actual distribution (calculated from your ground truth data)
        "ground_answer_percentages": {
            "Engineer": 0.45,
            "Manager": 0.12,
            "Designer": 0.18,
            "Product": 0.25
        },
        
        // Accuracy metrics for this specific question
        "accuracy": 87.5,
        "root_mean_squared_error": 14.2,
        "normalised_kullback_leibler_divergence": 0.08,
        "kullback_leibler_divergence": 0.12
    }
}

Interpreting accuracy metrics

Question-level accuracy

Unlike population-level metrics, benchmark results give you accuracy for this specific question:

Mean absolute error

Field: accuracy (displayed as 1 - MAE)
Range: 0 to 100% (higher is better)
Meaning: Overall prediction accuracy as a percentage for this specific question
Example: 87.5% accuracy means predictions were within ~12.5% of actual results on average

Root mean squared error

Field: root_mean_squared_error
Range: 0 to 100% (lower is better)
Meaning: Penalises large prediction errors more heavily than small ones for this question
Use case: Shows prediction variance - higher values indicate less consistent predictions

Normalised Kullback-Leibler divergence

Field: normalised_kullback_leibler_divergence
Range: 0 to 1+ (lower is better)
Meaning: Measures how different the predicted distribution is from reality, normalised by number of answer options
Use case: Best overall measure of prediction quality for this question - values below 0.1 are considered good

Kullback-Leibler divergence

Field: kullback_leibler_divergence
Range: Varies by question (lower is better)
Meaning: Raw divergence measure without normalisation
Use case: Question-specific analysis - not comparable between questions with different numbers of options

Best practices

Data quality

Representative samples: Ensure your ground truth data comes from the same user profile as your target population.
Sufficient sample size: Use at least 25-50 responses for reliable benchmarking results.

Get Started

Learn

Why benchmark?

Basic usage

Required parameters

Ground truth data

`ground_answer_counts`

`ground_answer_sample_size`

Question types

Single-choice Benchmarking

Multiple-choice Benchmarking

Handling async results

Understanding benchmark results

Interpreting accuracy metrics

Question-level accuracy

Mean absolute error

Root mean squared error

Normalised Kullback-Leibler divergence

Kullback-Leibler divergence

Best practices

Data quality

Get Started

Learn

​Why benchmark?

​Basic usage

​Required parameters

​Ground truth data

​ground_answer_counts

​ground_answer_sample_size

​Question types

​Single-choice Benchmarking

​Multiple-choice Benchmarking

​Handling async results

​Understanding benchmark results

​Interpreting accuracy metrics

​Question-level accuracy

​Mean absolute error

​Root mean squared error

​Normalised Kullback-Leibler divergence

​Kullback-Leibler divergence

​Best practices

​Data quality

Why benchmark?

Basic usage

Required parameters

Ground truth data

`ground_answer_counts`

`ground_answer_sample_size`

Question types

Single-choice Benchmarking

Multiple-choice Benchmarking

Handling async results

Understanding benchmark results

Interpreting accuracy metrics

Question-level accuracy

Mean absolute error

Root mean squared error

Normalised Kullback-Leibler divergence

Kullback-Leibler divergence

Best practices

Data quality