Benchmarking tests your population’s accuracy by comparing its predictions to questions with known real-world answers. This is essential for validating population quality and understanding prediction reliability.

Why benchmark?

Benchmarking helps you:
  • Validate population accuracy for questions similar to your use case
  • Compare different populations to choose the best one for your needs
  • Build confidence in predictions before making important decisions
  • Identify weaknesses in population seed data

Basic usage

from semilattice import Semilattice

semilattice = Semilattice()

result = semilattice.answers.benchmark(
    population_id="d670f351-8567-4586-9bba-b81add1bebe3",
    answers={
        "question": "What's your primary role?",
        "question_options": {"question_type": "single-choice"},
        "answer_options": ["Engineer", "Manager", "Designer", "Product"],
        "ground_answer_counts": {
            "Engineer": 45,
            "Manager": 12,
            "Designer": 18,
            "Product": 25
        },
        "ground_answer_sample_size": 100
    }
)

Required parameters

Ground truth data

For benchmarking, you must provide the real-world answer distribution:

ground_answer_counts

A dictionary mapping each answer option to the number of real people who selected it:
{
    "Very satisfied": 34,
    "Somewhat satisfied": 41,
    "Neutral": 15,
    "Somewhat dissatisfied": 7,
    "Very dissatisfied": 3
}

ground_answer_sample_size

The total number of people who responded to the question. This is crucial for multiple-choice questions where respondents can select multiple options: Single-choice example:
  • 100 people responded
  • Each person selected exactly one option
  • ground_answer_sample_size: 100
Multiple-choice example:
  • 100 people responded
  • People could select multiple options
  • Total selections across all options: 180
  • ground_answer_sample_size: 100 (the number of people, not total selections)

Question types

Single-choice Benchmarking

result = semilattice.answers.benchmark(
    population_id="population-id",
    answers={
        "question": "What's your experience level?",
        "question_options": {"question_type": "single-choice"},
        "answer_options": ["Junior", "Mid-level", "Senior", "Staff"],
        "ground_answer_counts": {
            "Junior": 25,
            "Mid-level": 35,
            "Senior": 30,
            "Staff": 10
        },
        "ground_answer_sample_size": 100
    }
)

Multiple-choice Benchmarking

result = semilattice.answers.benchmark(
    population_id="population-id",
    answers={
        "question": "Which tools do you use? (Select all that apply)",
        "question_options": {"question_type": "multiple-choice"},
        "answer_options": ["Git", "Docker", "Kubernetes", "AWS"],
        "ground_answer_counts": {
            "Git": 85,        # 85 out of 100 people selected Git
            "Docker": 60,     # 60 out of 100 people selected Docker
            "Kubernetes": 30, # 30 out of 100 people selected Kubernetes
            "AWS": 45         # 45 out of 100 people selected AWS
        },
        "ground_answer_sample_size": 100  # 100 people total responded
    }
)

Handling async results

Like simulations, benchmarks run asynchronously:
import time

answer_id = result.data[0].id

while result.data[0].status != "Predicted":
    time.sleep(1)
    result = semilattice.answers.get(answer_id)

print("Benchmark complete!")

Understanding benchmark results

Once complete, benchmark results include both predictions and accuracy metrics:
{
    "data": {
        "id": "benchmark-answer-uuid",
        "status": "Predicted",
        "question": "What's your primary role?",
        "answer_options": ["Engineer", "Manager", "Designer", "Product"],
        
        // Predicted distribution
        "simulated_answer_percentages": {
            "Engineer": 0.42,
            "Manager": 0.15,
            "Designer": 0.19,
            "Product": 0.24
        },
        
        // Actual distribution (calculated from your ground truth data)
        "ground_answer_percentages": {
            "Engineer": 0.45,
            "Manager": 0.12,
            "Designer": 0.18,
            "Product": 0.25
        },
        
        // Accuracy metrics for this specific question
        "accuracy": 87.5,
        "root_mean_squared_error": 14.2,
        "normalised_kullback_leibler_divergence": 0.08,
        "kullback_leibler_divergence": 0.12
    }
}

Interpreting accuracy metrics

Question-level accuracy

Unlike population-level metrics, benchmark results give you accuracy for this specific question:

Mean absolute error

  • Field: accuracy (displayed as 1 - MAE)
  • Range: 0 to 100% (higher is better)
  • Meaning: Overall prediction accuracy as a percentage for this specific question
  • Example: 87.5% accuracy means predictions were within ~12.5% of actual results on average

Root mean squared error

  • Field: root_mean_squared_error
  • Range: 0 to 100% (lower is better)
  • Meaning: Penalises large prediction errors more heavily than small ones for this question
  • Use case: Shows prediction variance - higher values indicate less consistent predictions

Normalised Kullback-Leibler divergence

  • Field: normalised_kullback_leibler_divergence
  • Range: 0 to 1+ (lower is better)
  • Meaning: Measures how different the predicted distribution is from reality, normalised by number of answer options
  • Use case: Best overall measure of prediction quality for this question - values below 0.1 are considered good

Kullback-Leibler divergence

  • Field: kullback_leibler_divergence
  • Range: Varies by question (lower is better)
  • Meaning: Raw divergence measure without normalisation
  • Use case: Question-specific analysis - not comparable between questions with different numbers of options

Best practices

Data quality

  • Representative samples: Ensure your ground truth data comes from the same user profile as your target population.
  • Sufficient sample size: Use at least 25-50 responses for reliable benchmarking results.