Basic usage
Create a test by calling thetests
method in a similar way to how you create a prediction, but provide ground_answer_counts
and ground_answer_sample_size
data.
Tests are asynchronous and take ~20 seconds to run. Please see the section on handling async results for more details.
Choosing population models
require a specific population model ID. Call thelist
method on populations
to get a list of population models available for simulation.

Click to copy the population model's ID
Ground truth parameters
Testing requires the target population’s true answer distribution for the question in order to calculate evaluation scores.ground_answer_counts
The ground_answer_counts
input provides a count of people for each answer option:
ground_answer_sample_size
The ground_answer_sample_size
provides the total number of people who responded to the question. This is crucial for multiple-choice questions where respondents can select multiple options.
Single-choice example:
- 100 people responded
- Each person selected exactly one option
ground_answer_sample_size: 100
- 100 people responded
- People could select multiple options
- Total selections across all options: 180
ground_answer_sample_size: 100
(the number of people, not total selections)
Question types
Single-choice Benchmarking
Multiple-choice Benchmarking
Handling async results
Simulations run asynchronously. The initial response has"Test Queued"
, and you need to poll for completion.
Initial response
Polling for results
The simulation will progress through these statuses:Test Queued
→ Test Running
→ Tested
(or potentially Test Failed
). Predictions typically take less than 20 seconds.
Evaluation metrics
Once complete, test results include both predictions and evaluation metrics:Accuracy
- Field:
accuracy
- Range: 0 to 1 (higher is better)
- Calculation: The mean absolute error (MAE) between the predicted and ground truth answer distributions is calculated and then subtracted from 1 to convert it from an error measure to a crude accuracy measure.
- Example: 0.8721
Squared Error
- Field:
squared_error
- Range: 0 to 1 (lower is better)
- Calculation: The root mean squared error (RMSE) between the predicted and ground truth answer distributions is calculated.
- Example: 0.1607
Information Loss
- Field:
information_loss
- Range: 0 to 1+ (lower is better)
- Calculation: The Kullback–Leibler (KL) divergence (also called relative entropy) between the predicted and ground truth answer distributions is calculated.
- Example: 0.0063
Information loss scales with the number of answer options in a question. This means it cannot be used to compare accuracy between different questions (unless they have the same number of answer options). See normalised information loss below for a metric which can be used for comparisons.
Normalised Information Loss
- Field:
normalised_information_loss
- Range: 0 to 1+ (lower is better)
- Calculation: The Kullback–Leibler (KL) divergence (also called relative entropy) between the predicted and ground truth answer distributions is calculated and then normalised to the number of answer options in the question.
- Example: 0.0063
Interpreting results
Based on all of our benchmarking to date, we have some hueristics on what “good” looks like.Good performance thresholds
- Accuracy: Above 0.85 (higher values indicate better accuracy)
- Squared Error: Below 0.18 (lower values indicate more consistent predictions)
- Normalised Information Loss: Below 0.1 (lower values indicate better distribution matching)
Benchmarking context
These thresholds come from extensive benchmarking work:Metric | Good Threshold | Benchmarking Average | Benchmarking Range |
---|---|---|---|
Accuracy | Above 0.85 | 0.87 | 0.82 - 0.92 |
Squared Error | Below 0.18 | 0.16 | 0.11 - 0.24 |
Normalised Information Loss | Below 0.1 | 0.0569 | 0.0244 - 0.1006 |
We increasingly find Normalised Information Loss to be the best single metric, but usually expect scores across all three metrics to be “good” to consider a population model reliable.