Population evaluation measures how accurately your custom population can predict responses to new questions. This testing process provides confidence metrics that help you understand the reliability of your API predictions.

Why test populations?

When you create a population, you get a model built on your specific data. But without testing, you don’t know:

  • How accurate predictions will be for new questions
  • Which types of questions the population handles well
  • Whether your seed data was sufficient

Population testing answers these questions by measuring performance against held-out data.

Testing methodologies

Semilattice uses two types of testing to evaluate population model accuracy:

Population test (Internal)

The default testing method uses your original seed data:

  1. Question removal: Temporarily removes each question and its answers from the model
  2. Prediction: Asks the model to predict answers for that removed question
  3. Comparison: Compares predictions to the actual responses from your seed data
  4. Iteration: Repeats this process across all questions in your dataset
  5. Averaging: Calculates average accuracy metrics across all questions

Benchmarking test (External)

You can also test against separate data:

  • Uses completely separate test questions from the same target audience
  • Provides accuracy estimates for real-world performance
  • Currently available via API (UI support coming soon)
Most accuracy scores in the product come from population tests (internal). The degree to which these represent real-world performance depends on how well your seed data represents the types of questions you’ll ask in production.

Evaluation metrics

Four key accuracy metrics

Semilattice calculates three population-level metrics by averaging individual answer test results:

Average mean absolute error

  • Field: avg_mean_absolute_error
  • Range: 0 to 1 (lower is better)
  • Meaning: Average percentage difference between predicted and actual answer distributions across all test questions
  • Example: 0.1472 means predictions are typically within ~14.7% of actual results

Average mean squared error

  • Field: avg_mean_squared_error
  • Range: 0 to 1+ (lower is better)
  • Meaning: Penalises large prediction errors more heavily than small ones, averaged across test questions
  • Use case: Identifies populations that occasionally make very wrong predictions

Average normalised Kullback-Leibler divergence

  • Field: avg_normalised_kullback_leibler_divergence
  • Range: 0 to 1+ (lower is better)
  • Meaning: Measures how different predicted distributions are from reality, normalised by number of answer options and averaged across test questions
  • Use case: Best overall measure of population prediction quality

Interpreting results

Based on Semilattice’s benchmarking data, here are the thresholds for good performance:

Good performance thresholds

  • Average MAE: Below 0.15 (lower values indicate better accuracy)
  • Average MSE: Below 0.25 (lower values indicate more consistent predictions)
  • Average Normalised KLD: Below 0.1 (lower values indicate better distribution matching)

Benchmarking context

These thresholds come from extensive benchmarking work:

MetricGood ThresholdBenchmarking AverageBenchmarking Range
Average MAEBelow 0.150.130.10 - 0.18
Average MSEBelow 0.250.210.15 - 0.30
Average Normalised KLDBelow 0.10.05690.0244 - 0.1006
We increasingly find Average Normalised KLD to be the best single metric, but usually expect scores across all three metrics to be “good” to consider a population model reliable.

Accessing evaluation results

Once testing is complete, metrics are available in API responses:

# Get population with evaluation metrics
population = semilattice.populations.get(population_id)

print(f"Average MAE: {population.data.avg_mean_absolute_error}")
print(f"Average MSE: {population.data.avg_mean_squared_error}")
print(f"Average Normalised KLD: {population.data.avg_normalised_kullback_leibler_divergence}")