Create a test

Basic usage

Create a test by calling the tests method in a similar way to how you create a prediction, but provide ground_answer_counts and ground_answer_sample_size data.

Tests are asynchronous and take ~20 seconds to run. Please see the section on handling async results for more details.

from semilattice import Semilattice

semilattice = Semilattice()

response = semilattice.tests.create(
    population_id="population-id", # Replace with specific ID
    tests={
        "question": "What's your primary role?",
        "question_options": {"question_type": "single-choice"},
        "answer_options": ["Engineer", "Manager", "Designer", "Product"],
        "ground_answer_counts": {
            "Engineer": 45,
            "Manager": 12,
            "Designer": 18,
            "Product": 25
        },
        "ground_answer_sample_size": 100
    }
)

Choosing population models

require a specific population model ID. Call the list method on populations to get a list of population models available for simulation.

response = semilattice.populations.list()
populations = response.data

Alternatively, you can navigate to the populations page on your dashboard and select a population model to use. Copy the ID from the population’s metadata or from the address bar.

Click to copy the population model's ID

Ground truth parameters

Testing requires the target population’s true answer distribution for the question in order to calculate evaluation scores.

`ground_answer_counts`

The ground_answer_counts input provides a count of people for each answer option:

{
    "Very satisfied": 34,
    "Somewhat satisfied": 41,
    "Neutral": 15,
    "Somewhat dissatisfied": 7,
    "Very dissatisfied": 3
}

`ground_answer_sample_size`

The ground_answer_sample_size provides the total number of people who responded to the question. This is crucial for multiple-choice questions where respondents can select multiple options. Single-choice example:

100 people responded
Each person selected exactly one option
ground_answer_sample_size: 100

Multiple-choice example:

100 people responded
People could select multiple options
Total selections across all options: 180
ground_answer_sample_size: 100 (the number of people, not total selections)

Question types

Single-choice Benchmarking

response = semilattice.tests.create(
    population_id="population-id", # Replace with specific ID
    tests={
        "question": "What's your experience level?",
        "answer_options": ["Junior", "Mid-level", "Senior", "Staff"],
        "question_options": {"question_type": "single-choice"},
        "ground_answer_counts": {
            "Junior": 25,
            "Mid-level": 35,
            "Senior": 30,
            "Staff": 10
        },
        "ground_answer_sample_size": 100
    }
)

Multiple-choice Benchmarking

response = semilattice.tests.create(
    population_id="population-id", # Replace with specific ID
    tests={
        "question": "Which tools do you use? (Select all that apply)",
        "answer_options": ["Git", "Docker", "Kubernetes", "AWS"],
        "question_options": {"question_type": "multiple-choice"},
        "ground_answer_counts": {
            "Git": 85,        # 85 out of 100 people selected Git
            "Docker": 60,     # 60 out of 100 people selected Docker
            "Kubernetes": 30, # 30 out of 100 people selected Kubernetes
            "AWS": 45         # 45 out of 100 people selected AWS
        },
        "ground_answer_sample_size": 100  # 100 people total responded
    }
)

Handling async results

Simulations run asynchronously. The initial response has "Test Queued", and you need to poll for completion.

Initial response

{
    "id": "84a92e29-54e6-4a60-862d-26cd2a78421e",
    "status": "Test Queued",
    "question": "What's your experience level?",
    "answer_options": ["Junior", "Mid-level", "Senior", "Staff"],
    "predicted_answer_percentages": null,
    "accuracy": null,
    // ... other fields
}

Polling for results

The simulation will progress through these statuses: Test Queued → Test Running → Tested (or potentially Test Failed). Predictions typically take less than 20 seconds.

import time

test = response.data[0]

while test.status != "Tested":
    time.sleep(1)
    response = semilattice.tests.get(test.id)
    test = response.data

print("Test complete!")

Evaluation metrics

Once complete, test results include both predictions and evaluation metrics:

{
    "id": "prediction-id",
    "created_at": "2025-09-17T11:11:45.493588Z",
    "population": "population-id",
    "population_name": "Developers",
    "batch": "batch-id", // or null
    "status": "Predicted",
    "question": "Which is worse?",
    "answer_options": ["Tech debt", "Unclear error messages"],
    
    // Predicted answer distribution
    "predicted_answer_percentages": { "Tech debt": 0.38, "Unclear error messages": 0.62 },
    
    // Ground truth provided when creating the test
    "ground_answer_counts": { "Tech debt": 55, "Unclear error messages": 45 },
    "ground_answer_percentages": { "Tech debt": 0.47, "Unclear error messages": 0.53 },
    
    // Accuracy metrics (see below)
    "accuracy": 0.91,
    "squared_error": 0.08999999999999997,
    "information_loss": 0.0167,
    "normalised_information_loss": 0.008388683920526128,

    // Other details
    "question_options": { "question_type": "single-choice" },
    "simulation_engine": "answers-1",
    "test_started_at": "2025-09-17T11:11:45.887311Z",
    "test_finished_at": "2025-09-17T11:11:47.506565Z",
    "public": false,
}

For individual test predictions, the API calculates four evaluation metrics:

Accuracy

Field: accuracy
Range: 0 to 1 (higher is better)
Calculation: The mean absolute error (MAE) between the predicted and ground truth answer distributions is calculated and then subtracted from 1 to convert it from an error measure to a crude accuracy measure.
Example: 0.8721

Squared Error

Field: squared_error
Range: 0 to 1 (lower is better)
Calculation: The root mean squared error (RMSE) between the predicted and ground truth answer distributions is calculated.
Example: 0.1607

Information Loss

Field: information_loss
Range: 0 to 1+ (lower is better)
Calculation: The Kullback–Leibler (KL) divergence (also called relative entropy) between the predicted and ground truth answer distributions is calculated.
Example: 0.0063

Information loss scales with the number of answer options in a question. This means it cannot be used to compare accuracy between different questions (unless they have the same number of answer options). See normalised information loss below for a metric which can be used for comparisons.

Normalised Information Loss

Field: normalised_information_loss
Range: 0 to 1+ (lower is better)
Calculation: The Kullback–Leibler (KL) divergence (also called relative entropy) between the predicted and ground truth answer distributions is calculated and then normalised to the number of answer options in the question.
Example: 0.0063

Interpreting results

Based on all of our benchmarking to date, we have some hueristics on what “good” looks like.

Good performance thresholds

Accuracy: Above 0.85 (higher values indicate better accuracy)
Squared Error: Below 0.18 (lower values indicate more consistent predictions)
Normalised Information Loss: Below 0.1 (lower values indicate better distribution matching)

Benchmarking context

These thresholds come from extensive benchmarking work:

Metric	Good Threshold	Benchmarking Average	Benchmarking Range
Accuracy	Above 0.85	0.87	0.82 - 0.92
Squared Error	Below 0.18	0.16	0.11 - 0.24
Normalised Information Loss	Below 0.1	0.0569	0.0244 - 0.1006

We increasingly find Normalised Information Loss to be the best single metric, but usually expect scores across all three metrics to be “good” to consider a population model reliable.

Get Started

Learn

Basic usage

Choosing population models

Ground truth parameters

`ground_answer_counts`

`ground_answer_sample_size`

Question types

Single-choice Benchmarking

Multiple-choice Benchmarking

Handling async results

Initial response

Polling for results

Evaluation metrics

Accuracy

Squared Error

Information Loss

Normalised Information Loss

Interpreting results

Good performance thresholds

Benchmarking context

Get Started

Learn

​Basic usage

​Choosing population models

​Ground truth parameters

​ground_answer_counts

​ground_answer_sample_size

​Question types

​Single-choice Benchmarking

​Multiple-choice Benchmarking

​Handling async results

​Initial response

​Polling for results

​Evaluation metrics

​Accuracy

​Squared Error

​Information Loss

​Normalised Information Loss

​Interpreting results

​Good performance thresholds

​Benchmarking context

Basic usage

Choosing population models

Ground truth parameters

`ground_answer_counts`

`ground_answer_sample_size`

Question types

Single-choice Benchmarking

Multiple-choice Benchmarking

Handling async results

Initial response

Polling for results

Evaluation metrics

Accuracy

Squared Error

Information Loss

Normalised Information Loss

Interpreting results

Good performance thresholds

Benchmarking context