Seed data requirements

Population model creation is a paid feature on the Play and Launch pricing plans. Visit the account page in the dashboard to upgrade.

This page outlines the technical requirements and best practices for preparing CSV data to create custom populations. Your data quality directly impacts prediction accuracy, so following these guidelines is essential.

Overview

Your CSV file should contain QA pair data where each row represents one respondent and each column represents their answer to a multiple-choice question. This data creates your population model to make accurate predictions for similar groups.

CSV format requirements

Your CSV file must meet these technical specifications:

Requirement	Value	Notes
First column	`sim_id`	Unique identifier for each respondent
Question columns	5-20	Each column header is the question text
Respondent rows	25-3,000	More respondents = better accuracy
Question length	3-300 characters	Clear, concise questions work best
Answer length	Max 200 characters	Short, distinct answer options
Missing data	Max 5% empty cells	Minimal gaps in responses
Answer options	1-15 per question	Multiple choice options

sim_id,"What's your primary role?","How many years of experience?","Preferred development environment?"
1,"Software Engineer","3-5 years","VS Code"
2,"Product Manager","1-2 years","Notion"
3,"Data Scientist","5+ years","Jupyter"

Data quality guidelines

While the API accepts any CSV meeting technical requirements, following these guidelines ensures better population accuracy:

Headers should be clear questions

Each question should be self-contained

Answer options should be independent

Use human-readable text

All questions and answers should come from the same survey, poll, or questionnaire

Answers in a row should all come from the same person

Good to know

These considerations are useful to bear in mind.

The unique values in a column define the question's answer options

Other types of questions can be reformatted to be compatible, with caution

Question order doesn't matter

Answer option order is always random

Bigger datasets can be filtered, with caution

How population creation works

Your CSV data creates the population model, but predictions also leverage Semilattice’s underlying AI. The seed data acts as a “filter” that selects relevant knowledge from the AI model.

This means your questions should have predictive power - they should reveal information that correlates with how your target group would answer new questions.

Choosing seed questions

The questions in your seed data determine prediction accuracy. Better seed questions lead to more accurate API responses.

What makes a good seed question?

Think about how informative each answer would be:

High predictive power: Age, profession, education level, core values
Low predictive power: Favourite colour, random preferences unrelated to your use case
Context-dependent: Food preferences matter for restaurant surveys but not for software feedback

Understanding predictive power

A population’s accuracy when predicting the answer to a new question is dependent on the predictive power of the questions and answers in its seed data. This predictive power varies depending on the question being predicted and is a function of how those seed data work with the simulation engine.

Predictive power cannot be predicted

Due to the black box nature of the LLMs used by Semilattice, it’s not yet possible to tell which seed data have more predictive power with certainty.

Building intuition around predictive power

An intuitive way of estimating the predictive power of a question is to think about how informative someone’s answer to that question would be. For example, knowing someone’s age tells you a lot more about them than knowing whether they prefer ketchup or mayonnaise. However, it’s also important to consider the purpose of the population you are building. If you work in R&D at Heinz, the latter question will have more predictive power for the questions you plan to predict.

Predictive power is an area of research for Semilattice. If you want to go deeper, it maps broadly to the concept of feature selection in machine learning & data mining.

Example: Developer population

sim_id,"Years of experience?","Primary language?","Team size?","Remote work preference?"
1,"5+ years","Python","10+ people","Fully remote"
2,"1-2 years","JavaScript","2-5 people","Hybrid"

These questions would help predict developer attitudes toward new frameworks, tools, or methodologies.

Best practices for API-ready populations

Target your API use case: Choose seed questions that relate to the types of predictions you’ll make via API. Building a customer feedback population? Include questions about satisfaction, preferences, and user profiles.
Avoid redundant questions: Don’t include multiple questions that reveal the same information. “Preferred condiment” and “Favorite fries topping” are too similar.
Maximize data quality: Aim for 20 questions and 1,000+ responses. More seed data typically means better API prediction accuracy.
Test with real questions: After creating your population, test it with questions similar to your API use case to validate accuracy.

Get Started

Learn

Seed data requirements

Overview

CSV format requirements

Data quality guidelines

Good to know

How population creation works

Choosing seed questions

What makes a good seed question?

Understanding predictive power

Example: Developer population

Best practices for API-ready populations

Get Started

Learn

​Overview

​CSV format requirements

​Data quality guidelines

​Good to know

​How population creation works

​Choosing seed questions

​What makes a good seed question?

​Understanding predictive power

​Example: Developer population

​Best practices for API-ready populations

Overview

CSV format requirements

Data quality guidelines

Good to know

How population creation works

Choosing seed questions

What makes a good seed question?

Understanding predictive power

Example: Developer population

Best practices for API-ready populations