Seed data requirements
CSV format and data quality requirements for population creation
Population model creation is a paid feature on the Play and Launch pricing plans. Visit the account page in the dashboard to upgrade.
This page outlines the technical requirements and best practices for preparing CSV data to create custom populations. Your data quality directly impacts prediction accuracy, so following these guidelines is essential.
Overview
Your CSV file should contain QA pair data where each row represents one respondent and each column represents their answer to a multiple-choice question. This data creates your population model to make accurate predictions for similar groups.
CSV format requirements
Your CSV file must meet these technical specifications:
Requirement | Value | Notes |
---|---|---|
First column | sim_id | Unique identifier for each respondent |
Question columns | 5-20 | Each column header is the question text |
Respondent rows | 25-3,000 | More respondents = better accuracy |
Question length | 3-300 characters | Clear, concise questions work best |
Answer length | Max 200 characters | Short, distinct answer options |
Missing data | Max 5% empty cells | Minimal gaps in responses |
Answer options | 1-15 per question | Multiple choice options |
Data quality guidelines
While the API accepts any CSV meeting technical requirements, following these guidelines ensures better population accuracy:
Headers should be clear questions
Headers should be clear questions
Column headers become the question text in API responses. Use complete questions: “What is your favourite colour?” rather than just “Colour”.
Each question should be self-contained
Each question should be self-contained
Questions can’t reference other questions. Avoid “How strongly do you feel about your previous answer?” Instead, make each question standalone: “How important is work-life balance to you?”
Answer options should be independent
Answer options should be independent
Avoid answer options like “None of the above” or “All of the above” since their meaning depends on other options. Use specific alternatives: “None” or “Other”.
Use human-readable text
Use human-readable text
Convert survey codes to full text. Instead of “Q1_A3”, use “Very satisfied”. This ensures accurate API predictions since the AI understands natural language.
All questions and answers should come from the same survey, poll, or questionnaire
All questions and answers should come from the same survey, poll, or questionnaire
Mixing data from different samples is not recommended. Semilattice assumes seed data come from a single survey, poll, or questionnaire.
Answers in a row should all come from the same person
Answers in a row should all come from the same person
Mixing answers from different people is not recommended. Semilattice assumes all the answers in a row come from the same person.
Good to know
These considerations are useful to bear in mind.
The unique values in a column define the question's answer options
The unique values in a column define the question's answer options
The set of unique values in a column define the answer options for the question in that column. For example, the unique values under the question header “What is your favourite colour?” may be: “Red”, “Green”, “Blue”, “Yellow”, “Purple”, “Orange”, “Teal”, “Pink”, “Brown”, “Indigo”.
Other types of questions can be reformatted to be compatible, with caution
Other types of questions can be reformatted to be compatible, with caution
Questions which were originally posed as single option multiple-choice questions are recommended, but if necessary other, compatible types of questions can be reformatted. For example, multiple selection questions (e.g. “Choose all that apply…”) can be reformatted into one binary question for each selection option (e.g. “Do you do X?”, “Do you do Y?”, …). Or you can group and recode respondents’ answers to free text questions to make a single option multiple-choice question. Caution is advised though: the more you reformat, the more your seed data will diverge from reality.
Question order doesn't matter
Question order doesn't matter
The order of the question columns in your seed data has no effect. Questions will be used in various ways, in various orders.
Answer option order is always random
Answer option order is always random
Answer options will be presented in random order by default. Answer options with semantic order (e.g. “Strongly Agree” to “Strongly Disagree”) are perfectly valid, but answer options which assume an order (e.g. “None of the above”) are not. Answer options like “None of the above” should be reformatted to stand alone, e.g. “None”.
Bigger datasets can be filtered, with caution
Bigger datasets can be filtered, with caution
You can filter large datasets down to specific segments, subsets of questions, or the right number of respondents for Semilattice. However, just like with a real-world survey, heavy filtering of a dataset can reduce the quality of the resultant sample and can lead to lower accuracy Populations in Semilattice.
How population creation works
Your CSV data creates the population model, but predictions also leverage Semilattice’s underlying AI. The seed data acts as a “filter” that selects relevant knowledge from the AI model.
This means your questions should have predictive power - they should reveal information that correlates with how your target group would answer new questions.
Choosing seed questions
The questions in your seed data determine prediction accuracy. Better seed questions lead to more accurate API responses.
What makes a good seed question?
Think about how informative each answer would be:
- High predictive power: Age, profession, education level, core values
- Low predictive power: Favourite colour, random preferences unrelated to your use case
- Context-dependent: Food preferences matter for restaurant surveys but not for software feedback
Understanding predictive power
A population’s accuracy when predicting the answer to a new question is dependent on the predictive power of the questions and answers in its seed data. This predictive power varies depending on the question being predicted and is a function of how those seed data work with the simulation engine.
Predictive power cannot be predicted
Due to the black box nature of the LLMs used by Semilattice, it’s not yet possible to tell which seed data have more predictive power with certainty.
Building intuition around predictive power
An intuitive way of estimating the predictive power of a question is to think about how informative someone’s answer to that question would be. For example, knowing someone’s age tells you a lot more about them than knowing whether they prefer ketchup or mayonnaise. However, it’s also important to consider the purpose of the population you are building. If you work in R&D at Heinz, the latter question will have more predictive power for the questions you plan to predict.
Predictive power is an area of research for Semilattice. If you want to go deeper, it maps broadly to the concept of feature selection in machine learning & data mining.
Example: Developer population
These questions would help predict developer attitudes toward new frameworks, tools, or methodologies.
Best practices for API-ready populations
- Target your API use case: Choose seed questions that relate to the types of predictions you’ll make via API. Building a customer feedback population? Include questions about satisfaction, preferences, and user profiles.
- Avoid redundant questions: Don’t include multiple questions that reveal the same information. “Preferred condiment” and “Favorite fries topping” are too similar.
- Maximize data quality: Aim for 20 questions and 1,000+ responses. More seed data typically means better API prediction accuracy.
- Test with real questions: After creating your population, test it with questions similar to your API use case to validate accuracy.