Evaluation Population evaluation measures how accurately your custom population can predict responses to new questions. This testing process provides confidence metrics that help you understand the reliability of your API predictions.

Why test populations?

When you create a population, you get a model built on your specific data. But without testing, you don’t know:
  • How accurate predictions will be for new questions
  • Which types of questions the population handles well
  • Whether your seed data was sufficient
Population testing answers these questions by measuring performance against held-out data.

Testing methodologies

Semilattice uses two types of testing to evaluate population model accuracy:

Population test

The default testing method uses your original seed data:
  1. Question removal: Temporarily removes each question and its answers from the model
  2. Prediction: Asks the model to predict answers for that removed question
  3. Comparison: Compares predictions to the actual responses from your seed data
  4. Iteration: Repeats this process across all questions in your dataset
  5. Averaging: Calculates average accuracy metrics across all questions

Benchmarking test

You can also test against separate data:
  • Uses completely separate test questions from the same target audience
  • Provides accuracy estimates for real-world performance
  • Currently available via API (UI support coming soon)
Most accuracy scores in the product come from population tests. The degree to which these represent real-world performance depends on how well your seed data represents the types of questions you’ll ask in production.

Open-ended questions cannot be tested

The API currently does not support simulating open-ended outputs, so it is not possible test the accuracy of population models against open-ended ground truth data. If your population model contains open-ended seed data, these questions will not be tested as part of population tests. Their status will appear as status: "Not Tested" and population accuracy metrics will be calculated based on the results from other questions.

Running population tests

You can trigger population testing using the test method:
# Trigger population testing
population = semilattice.populations.test(population_id="your_population_id")
population_id = population.data.id
You need to poll for completion as testing runs asynchronously:
import time

while population.data.status != "Tested":
    time.sleep(1)
    population = semilattice.populations.get(population_id=population_id)

print("Testing complete!")
The status field will progress through:
  1. “Testing”: Population test is currently running
  2. “Tested”: Testing completed successfully, metrics are available
If something goes wrong the status will be “Action Required”, indicating that testing failed due to a simulation error on one or more questions.

Evaluation metrics

Four key accuracy metrics

Semilattice calculates three population-level metrics by averaging individual answer test results:

Average mean absolute error

  • Field: avg_mean_absolute_error
  • Range: 0 to 1 (lower is better)
  • Meaning: Average percentage difference between predicted and actual answer distributions across all test questions
  • Example: 0.1472 means predictions are typically within ~14.7% of actual results

Average mean squared error

  • Field: avg_mean_squared_error
  • Range: 0 to 1+ (lower is better)
  • Meaning: Penalises large prediction errors more heavily than small ones, averaged across test questions
  • Use case: Identifies populations that occasionally make very wrong predictions

Average normalised Kullback-Leibler divergence

  • Field: avg_normalised_kullback_leibler_divergence
  • Range: 0 to 1+ (lower is better)
  • Meaning: Measures how different predicted distributions are from reality, normalised by number of answer options and averaged across test questions
  • Use case: Best overall measure of population prediction quality

Interpreting results

Based on Semilattice’s benchmarking data, here are the thresholds for good performance:

Good performance thresholds

  • Average MAE: Below 0.15 (lower values indicate better accuracy)
  • Average MSE: Below 0.25 (lower values indicate more consistent predictions)
  • Average Normalised KLD: Below 0.1 (lower values indicate better distribution matching)

Benchmarking context

These thresholds come from extensive benchmarking work:
MetricGood ThresholdBenchmarking AverageBenchmarking Range
Average MAEBelow 0.150.130.10 - 0.18
Average MSEBelow 0.250.210.15 - 0.30
Average Normalised KLDBelow 0.10.05690.0244 - 0.1006
We increasingly find Average Normalised KLD to be the best single metric, but usually expect scores across all three metrics to be “good” to consider a population model reliable.

Accessing evaluation results

Once testing is complete, metrics are available in API responses:
# Get population with evaluation metrics
population = semilattice.populations.get(population_id)

print(f"Average MAE: {population.data.avg_mean_absolute_error}")
print(f"Average MSE: {population.data.avg_mean_squared_error}")
print(f"Average Normalised KLD: {population.data.avg_normalised_kullback_leibler_divergence}")