Accuracy evaluation
Understanding population testing and accuracy metrics
Population evaluation measures how accurately your custom population can predict responses to new questions. This testing process provides confidence metrics that help you understand the reliability of your API predictions.
Why test populations?
When you create a population, you get a model built on your specific data. But without testing, you don’t know:
- How accurate predictions will be for new questions
- Which types of questions the population handles well
- Whether your seed data was sufficient
Population testing answers these questions by measuring performance against held-out data.
Testing methodologies
Semilattice uses two types of testing to evaluate population model accuracy:
Population test (Internal)
The default testing method uses your original seed data:
- Question removal: Temporarily removes each question and its answers from the model
- Prediction: Asks the model to predict answers for that removed question
- Comparison: Compares predictions to the actual responses from your seed data
- Iteration: Repeats this process across all questions in your dataset
- Averaging: Calculates average accuracy metrics across all questions
Benchmarking test (External)
You can also test against separate data:
- Uses completely separate test questions from the same target audience
- Provides accuracy estimates for real-world performance
- Currently available via API (UI support coming soon)
Evaluation metrics
Four key accuracy metrics
Semilattice calculates three population-level metrics by averaging individual answer test results:
Average mean absolute error
- Field:
avg_mean_absolute_error
- Range: 0 to 1 (lower is better)
- Meaning: Average percentage difference between predicted and actual answer distributions across all test questions
- Example: 0.1472 means predictions are typically within ~14.7% of actual results
Average mean squared error
- Field:
avg_mean_squared_error
- Range: 0 to 1+ (lower is better)
- Meaning: Penalises large prediction errors more heavily than small ones, averaged across test questions
- Use case: Identifies populations that occasionally make very wrong predictions
Average normalised Kullback-Leibler divergence
- Field:
avg_normalised_kullback_leibler_divergence
- Range: 0 to 1+ (lower is better)
- Meaning: Measures how different predicted distributions are from reality, normalised by number of answer options and averaged across test questions
- Use case: Best overall measure of population prediction quality
Interpreting results
Based on Semilattice’s benchmarking data, here are the thresholds for good performance:
Good performance thresholds
- Average MAE: Below 0.15 (lower values indicate better accuracy)
- Average MSE: Below 0.25 (lower values indicate more consistent predictions)
- Average Normalised KLD: Below 0.1 (lower values indicate better distribution matching)
Benchmarking context
These thresholds come from extensive benchmarking work:
Metric | Good Threshold | Benchmarking Average | Benchmarking Range |
---|---|---|---|
Average MAE | Below 0.15 | 0.13 | 0.10 - 0.18 |
Average MSE | Below 0.25 | 0.21 | 0.15 - 0.30 |
Average Normalised KLD | Below 0.1 | 0.0569 | 0.0244 - 0.1006 |
Accessing evaluation results
Once testing is complete, metrics are available in API responses: