Batches are sets of tests grouped with a specific name and optional description. The name paremeter lets you record the purpose of the tests, for example “Q4 feature preference testing”.

Basic usage

Create a test batch by passing an object to the batch field in a tests create request. The create method’s tests field accepts both single test objects or a list of objects, but only if batch details are set will a batch get created.
Test predictions are asynchronous and take ~20 seconds each to run. Please see the section on handling async results for more details.
from semilattice import Semilattice

semilattice = Semilattice()

response = semilattice.tests.create(
    population_id="population-id", # Replace with specific ID
    batch={
        "name": "Q4 feature preference testing",
        "description": "Benchmarking from recent user survey",
    },
    tests=[
        {
            "question": "What feature would have the biggest impact on your daily workflow?",
            "answer_options": [
                "Advanced search and filtering", 
                "Real-time collaboration tools", 
                "Mobile app improvements", 
                "API integrations"
            ],
            "question_options": {"question_type": "single-choice"},
            "ground_answer_counts": {
                "Advanced search and filtering": 45,
                "Real-time collaboration tools": 28,
                "Mobile app improvements": 18,
                "API integrations": 9
            },
            "ground_answer_sample_size": 100
        },
        {
            "question": "Which type of enhancement should we prioritize next quarter?",
            "answer_options": [
                "Performance optimizations", 
                "New dashboard features", 
                "Enhanced reporting capabilities", 
                "User interface redesign"
            ],
            "question_options": {"question_type": "single-choice"},
            "ground_answer_counts": {
                "Performance optimizations": 42,
                "New dashboard features": 25,
                "Enhanced reporting capabilities": 18,
                "User interface redesign": 15
            },
            "ground_answer_sample_size": 100
        }
    ]
)

Fetch a batch

Test responses always contain a batch field, but only if batch details were provided will a batch be created.
# Each test response will have a batch field containing the batch id
tests = response.data
batch_id = tests[0].batch

If batch details were provided, each test in the batch will have the same batch ID in its batch field. Grab this ID and then fetch the batch. The batch response will contain both the batch object and the batch’s tests.
response = semilattice.tests.get_batch(batch_id=batch_id)
batch = response.data.batch
batch_tests = response.data.tests

Choosing population models

require a specific population model ID. Call the list method on populations to get a list of population models available for simulation.
response = semilattice.populations.list()
populations = response.data
Alternatively, you can navigate to the populations page on your dashboard and select a population model to use. Copy the ID from the population’s metadata or from the address bar.

Click to copy the population model's ID

Handling async batch results

Batch test simulations run asynchronously. A batch object has a status field which captures the overall test status of the batch. In addition, each individual test within the batch has its own status field which captures its individual test status. The initial batch status will be "Test Queued", and you need to poll for completion.

Initial response

{
    "id": "ac9c798e-87b8-46d0-b824-8d8d8d5dca03",
    "status": "Test Queued",
    "name": "Q4 feature preference testing",
    "description": "Benchmarking from recent user survey",
    // ... other fields
}

Polling for results

The batch will progress through these statuses: Test QueuedTest RunningTested (or potentially Test Failed). Tests typically take less than 20 seconds, so a batch should take N * ~20 seconds, where N is the number of tests in the batch.
import time

while batch.status != "Tested":
    time.sleep(1)
    response = semilattice.tests.get_batch(batch_id=batch.id)
    batch = response.data.batch

print("Batch complete!")

Evaluation metrics

Once complete, batch results include aggregate evaluation metrics for all tests in the batch:
{
    "id": "batch-id",
    "created_at": "2025-09-17T11:11:45.493588Z",
    "name": "Q4 feature preference testing",
    "description": "Benchmarking from recent user survey",
    "population": "population-id",
    "status": "Tested",
    "batch_type": "test",
    
    // Aggregate evaluation metrics
    "average_accuracy": 0.6908609559394638,
    "average_normalised_information_loss": 0.1541559753180729,
    "average_squared_error": 0.34008355133987767,
    
    // Other details
    "simulation_engine": "answers-1",
    "test_started_at": "2025-09-17T11:11:45.887311Z",
    "test_finished_at": "2025-09-17T11:11:47.506565Z",
    "effective_date": null,
    "data_source": null,
    "public": false,
}
For test batches, the API calculates the same three evaluation metrics as for population cross-validation tests:

Average Accuracy

  • Field: average_accuracy
  • Range: 0 to 1 (higher is better)
  • Calculation: The mean absolute error (MAE) between the predicted and ground truth answer distributions for each question is calculated, averaged, and then subtracted from 1 to convert it from an error measure to a crude accuracy measure.
  • Example: 0.8721

Average Squared Error

  • Field: average_squared_error
  • Range: 0 to 1 (lower is better)
  • Calculation: The root mean squared error (RMSE) between the predicted and ground truth answer distributions for each question is calculated and then averaged.
  • Example: 0.1607

Average Normalised Information Loss

  • Field: average_normalised_information_loss
  • Range: 0 to 1+ (lower is better)
  • Calculation: The Kullback–Leibler (KL) divergence (also called relative entropy) between the predicted and ground truth answer distributions for each question is calculated, normalised to the number of answer options in the question, and then averaged.
  • Example: 0.0063

Interpreting results

Based on all of our benchmarking to date, we have some hueristics on what “good” looks like.

Good performance thresholds

  • Average Accuracy: Above 0.85 (higher values indicate better accuracy)
  • Average Squared Error: Below 0.18 (lower values indicate more consistent predictions)
  • Average Normalised Information Loss: Below 0.1 (lower values indicate better distribution matching)

Benchmarking context

These thresholds come from extensive benchmarking work:
MetricGood ThresholdBenchmarking AverageBenchmarking Range
Average AccuracyAbove 0.850.870.82 - 0.92
Average Squared ErrorBelow 0.180.160.11 - 0.24
Average Normalised Information LossBelow 0.10.05690.0244 - 0.1006