Basic usage

Create a test by calling the tests method in a similar way to how you create a prediction, but provide ground_answer_counts and ground_answer_sample_size data.
Tests are asynchronous and take ~20 seconds to run. Please see the section on handling async results for more details.
from semilattice import Semilattice

semilattice = Semilattice()

response = semilattice.tests.create(
    population_id="population-id", # Replace with specific ID
    tests={
        "question": "What's your primary role?",
        "question_options": {"question_type": "single-choice"},
        "answer_options": ["Engineer", "Manager", "Designer", "Product"],
        "ground_answer_counts": {
            "Engineer": 45,
            "Manager": 12,
            "Designer": 18,
            "Product": 25
        },
        "ground_answer_sample_size": 100
    }
)

Choosing population models

require a specific population model ID. Call the list method on populations to get a list of population models available for simulation.
response = semilattice.populations.list()
populations = response.data
Alternatively, you can navigate to the populations page on your dashboard and select a population model to use. Copy the ID from the population’s metadata or from the address bar.

Click to copy the population model's ID

Ground truth parameters

Testing requires the target population’s true answer distribution for the question in order to calculate evaluation scores.

ground_answer_counts

The ground_answer_counts input provides a count of people for each answer option:
{
    "Very satisfied": 34,
    "Somewhat satisfied": 41,
    "Neutral": 15,
    "Somewhat dissatisfied": 7,
    "Very dissatisfied": 3
}

ground_answer_sample_size

The ground_answer_sample_size provides the total number of people who responded to the question. This is crucial for multiple-choice questions where respondents can select multiple options. Single-choice example:
  • 100 people responded
  • Each person selected exactly one option
  • ground_answer_sample_size: 100
Multiple-choice example:
  • 100 people responded
  • People could select multiple options
  • Total selections across all options: 180
  • ground_answer_sample_size: 100 (the number of people, not total selections)

Question types

Single-choice Benchmarking

response = semilattice.tests.create(
    population_id="population-id", # Replace with specific ID
    tests={
        "question": "What's your experience level?",
        "answer_options": ["Junior", "Mid-level", "Senior", "Staff"],
        "question_options": {"question_type": "single-choice"},
        "ground_answer_counts": {
            "Junior": 25,
            "Mid-level": 35,
            "Senior": 30,
            "Staff": 10
        },
        "ground_answer_sample_size": 100
    }
)

Multiple-choice Benchmarking

response = semilattice.tests.create(
    population_id="population-id", # Replace with specific ID
    tests={
        "question": "Which tools do you use? (Select all that apply)",
        "answer_options": ["Git", "Docker", "Kubernetes", "AWS"],
        "question_options": {"question_type": "multiple-choice"},
        "ground_answer_counts": {
            "Git": 85,        # 85 out of 100 people selected Git
            "Docker": 60,     # 60 out of 100 people selected Docker
            "Kubernetes": 30, # 30 out of 100 people selected Kubernetes
            "AWS": 45         # 45 out of 100 people selected AWS
        },
        "ground_answer_sample_size": 100  # 100 people total responded
    }
)

Handling async results

Simulations run asynchronously. The initial response has "Test Queued", and you need to poll for completion.

Initial response

{
    "id": "84a92e29-54e6-4a60-862d-26cd2a78421e",
    "status": "Test Queued",
    "question": "What's your experience level?",
    "answer_options": ["Junior", "Mid-level", "Senior", "Staff"],
    "predicted_answer_percentages": null,
    "accuracy": null,
    // ... other fields
}

Polling for results

The simulation will progress through these statuses: Test QueuedTest RunningTested (or potentially Test Failed). Predictions typically take less than 20 seconds.
import time

test = response.data[0]

while test.status != "Tested":
    time.sleep(1)
    response = semilattice.tests.get(test.id)
    test = response.data

print("Test complete!")

Evaluation metrics

Once complete, test results include both predictions and evaluation metrics:
{
    "id": "prediction-id",
    "created_at": "2025-09-17T11:11:45.493588Z",
    "population": "population-id",
    "population_name": "Developers",
    "batch": "batch-id", // or null
    "status": "Predicted",
    "question": "Which is worse?",
    "answer_options": ["Tech debt", "Unclear error messages"],
    
    // Predicted answer distribution
    "predicted_answer_percentages": { "Tech debt": 0.38, "Unclear error messages": 0.62 },
    
    // Ground truth provided when creating the test
    "ground_answer_counts": { "Tech debt": 55, "Unclear error messages": 45 },
    "ground_answer_percentages": { "Tech debt": 0.47, "Unclear error messages": 0.53 },
    
    // Accuracy metrics (see below)
    "accuracy": 0.91,
    "squared_error": 0.08999999999999997,
    "information_loss": 0.0167,
    "normalised_information_loss": 0.008388683920526128,

    // Other details
    "question_options": { "question_type": "single-choice" },
    "simulation_engine": "answers-1",
    "test_started_at": "2025-09-17T11:11:45.887311Z",
    "test_finished_at": "2025-09-17T11:11:47.506565Z",
    "public": false,
}
For individual test predictions, the API calculates four evaluation metrics:

Accuracy

  • Field: accuracy
  • Range: 0 to 1 (higher is better)
  • Calculation: The mean absolute error (MAE) between the predicted and ground truth answer distributions is calculated and then subtracted from 1 to convert it from an error measure to a crude accuracy measure.
  • Example: 0.8721

Squared Error

  • Field: squared_error
  • Range: 0 to 1 (lower is better)
  • Calculation: The root mean squared error (RMSE) between the predicted and ground truth answer distributions is calculated.
  • Example: 0.1607

Information Loss

  • Field: information_loss
  • Range: 0 to 1+ (lower is better)
  • Calculation: The Kullback–Leibler (KL) divergence (also called relative entropy) between the predicted and ground truth answer distributions is calculated.
  • Example: 0.0063
Information loss scales with the number of answer options in a question. This means it cannot be used to compare accuracy between different questions (unless they have the same number of answer options). See normalised information loss below for a metric which can be used for comparisons.

Normalised Information Loss

Interpreting results

Based on all of our benchmarking to date, we have some hueristics on what “good” looks like.

Good performance thresholds

  • Accuracy: Above 0.85 (higher values indicate better accuracy)
  • Squared Error: Below 0.18 (lower values indicate more consistent predictions)
  • Normalised Information Loss: Below 0.1 (lower values indicate better distribution matching)

Benchmarking context

These thresholds come from extensive benchmarking work:
MetricGood ThresholdBenchmarking AverageBenchmarking Range
AccuracyAbove 0.850.870.82 - 0.92
Squared ErrorBelow 0.180.160.11 - 0.24
Normalised Information LossBelow 0.10.05690.0244 - 0.1006
We increasingly find Normalised Information Loss to be the best single metric, but usually expect scores across all three metrics to be “good” to consider a population model reliable.