Create a test batch

Batches are sets of tests grouped with a specific name and optional description. The name paremeter lets you record the purpose of the tests, for example “Q4 feature preference testing”.

Basic usage

Create a test batch by passing an object to the batch field in a tests create request. The create method’s tests field accepts both single test objects or a list of objects, but only if batch details are set will a batch get created.

Test predictions are asynchronous and take ~20 seconds each to run. Please see the section on handling async results for more details.

from semilattice import Semilattice

semilattice = Semilattice()

response = semilattice.tests.create(
    population_id="population-id", # Replace with specific ID
    batch={
        "name": "Q4 feature preference testing",
        "description": "Benchmarking from recent user survey",
    },
    tests=[
        {
            "question": "What feature would have the biggest impact on your daily workflow?",
            "answer_options": [
                "Advanced search and filtering", 
                "Real-time collaboration tools", 
                "Mobile app improvements", 
                "API integrations"
            ],
            "question_options": {"question_type": "single-choice"},
            "ground_answer_counts": {
                "Advanced search and filtering": 45,
                "Real-time collaboration tools": 28,
                "Mobile app improvements": 18,
                "API integrations": 9
            },
            "ground_answer_sample_size": 100
        },
        {
            "question": "Which type of enhancement should we prioritize next quarter?",
            "answer_options": [
                "Performance optimizations", 
                "New dashboard features", 
                "Enhanced reporting capabilities", 
                "User interface redesign"
            ],
            "question_options": {"question_type": "single-choice"},
            "ground_answer_counts": {
                "Performance optimizations": 42,
                "New dashboard features": 25,
                "Enhanced reporting capabilities": 18,
                "User interface redesign": 15
            },
            "ground_answer_sample_size": 100
        }
    ]
)

Fetch a batch

Test responses always contain a batch field, but only if batch details were provided will a batch be created.

# Each test response will have a batch field containing the batch id
tests = response.data
batch_id = tests[0].batch

If batch details were provided, each test in the batch will have the same batch ID in its batch field. Grab this ID and then fetch the batch. The batch response will contain both the batch object and the batch’s tests.

response = semilattice.tests.get_batch(batch_id=batch_id)
batch = response.data.batch
batch_tests = response.data.tests

Choosing population models

require a specific population model ID. Call the list method on populations to get a list of population models available for simulation.

response = semilattice.populations.list()
populations = response.data

Alternatively, you can navigate to the populations page on your dashboard and select a population model to use. Copy the ID from the population’s metadata or from the address bar.

Click to copy the population model's ID

Handling async batch results

Batch test simulations run asynchronously. A batch object has a status field which captures the overall test status of the batch. In addition, each individual test within the batch has its own status field which captures its individual test status. The initial batch status will be "Test Queued", and you need to poll for completion.

Initial response

{
    "id": "ac9c798e-87b8-46d0-b824-8d8d8d5dca03",
    "status": "Test Queued",
    "name": "Q4 feature preference testing",
    "description": "Benchmarking from recent user survey",
    // ... other fields
}

Polling for results

The batch will progress through these statuses: Test Queued → Test Running → Tested (or potentially Test Failed). Tests typically take less than 20 seconds, so a batch should take N * ~20 seconds, where N is the number of tests in the batch.

import time

while batch.status != "Tested":
    time.sleep(1)
    response = semilattice.tests.get_batch(batch_id=batch.id)
    batch = response.data.batch

print("Batch complete!")

Evaluation metrics

Once complete, batch results include aggregate evaluation metrics for all tests in the batch:

{
    "id": "batch-id",
    "created_at": "2025-09-17T11:11:45.493588Z",
    "name": "Q4 feature preference testing",
    "description": "Benchmarking from recent user survey",
    "population": "population-id",
    "status": "Tested",
    "batch_type": "test",
    
    // Aggregate evaluation metrics
    "average_accuracy": 0.6908609559394638,
    "average_normalised_information_loss": 0.1541559753180729,
    "average_squared_error": 0.34008355133987767,
    
    // Other details
    "simulation_engine": "answers-1",
    "test_started_at": "2025-09-17T11:11:45.887311Z",
    "test_finished_at": "2025-09-17T11:11:47.506565Z",
    "effective_date": null,
    "data_source": null,
    "public": false,
}

For test batches, the API calculates the same three evaluation metrics as for population cross-validation tests:

Average Accuracy

Field: average_accuracy
Range: 0 to 1 (higher is better)
Calculation: The mean absolute error (MAE) between the predicted and ground truth answer distributions for each question is calculated, averaged, and then subtracted from 1 to convert it from an error measure to a crude accuracy measure.
Example: 0.8721

Average Squared Error

Field: average_squared_error
Range: 0 to 1 (lower is better)
Calculation: The root mean squared error (RMSE) between the predicted and ground truth answer distributions for each question is calculated and then averaged.
Example: 0.1607

Average Normalised Information Loss

Field: average_normalised_information_loss
Range: 0 to 1+ (lower is better)
Calculation: The Kullback–Leibler (KL) divergence (also called relative entropy) between the predicted and ground truth answer distributions for each question is calculated, normalised to the number of answer options in the question, and then averaged.
Example: 0.0063

Interpreting results

Based on all of our benchmarking to date, we have some hueristics on what “good” looks like.

Good performance thresholds

Average Accuracy: Above 0.85 (higher values indicate better accuracy)
Average Squared Error: Below 0.18 (lower values indicate more consistent predictions)
Average Normalised Information Loss: Below 0.1 (lower values indicate better distribution matching)

Benchmarking context

These thresholds come from extensive benchmarking work:

Metric	Good Threshold	Benchmarking Average	Benchmarking Range
Average Accuracy	Above 0.85	0.87	0.82 - 0.92
Average Squared Error	Below 0.18	0.16	0.11 - 0.24
Average Normalised Information Loss	Below 0.1	0.0569	0.0244 - 0.1006

Get Started

Learn

Basic usage

Fetch a batch

Choosing population models

Handling async batch results

Initial response

Polling for results

Evaluation metrics

Average Accuracy

Average Squared Error

Average Normalised Information Loss

Interpreting results

Good performance thresholds

Benchmarking context

Get Started

Learn

​Basic usage

​Fetch a batch

​Choosing population models

​Handling async batch results

​Initial response

​Polling for results

​Evaluation metrics

​Average Accuracy

​Average Squared Error

​Average Normalised Information Loss

​Interpreting results

​Good performance thresholds

​Benchmarking context

Basic usage

Fetch a batch

Choosing population models

Handling async batch results

Initial response

Polling for results

Evaluation metrics

Average Accuracy

Average Squared Error

Average Normalised Information Loss

Interpreting results

Good performance thresholds

Benchmarking context