Population model creation is a paid feature on the Play and Launch pricing plans. Visit the account page in the dashboard to upgrade.

This page outlines the technical requirements and best practices for preparing CSV data to create custom populations. Your data quality directly impacts prediction accuracy, so following these guidelines is essential.

Overview

Your CSV file should contain QA pair data where each row represents one respondent and each column represents their answer to a multiple-choice question. This data creates your population model to make accurate predictions for similar groups.

CSV format requirements

Your CSV file must meet these technical specifications:

RequirementValueNotes
First columnsim_idUnique identifier for each respondent
Question columns5-20Each column header is the question text
Respondent rows25-3,000More respondents = better accuracy
Question length3-300 charactersClear, concise questions work best
Answer lengthMax 200 charactersShort, distinct answer options
Missing dataMax 5% empty cellsMinimal gaps in responses
Answer options1-15 per questionMultiple choice options
sim_id,"What's your primary role?","How many years of experience?","Preferred development environment?"
1,"Software Engineer","3-5 years","VS Code"
2,"Product Manager","1-2 years","Notion"
3,"Data Scientist","5+ years","Jupyter"

Data quality guidelines

While the API accepts any CSV meeting technical requirements, following these guidelines ensures better population accuracy:

Good to know

These considerations are useful to bear in mind.

How population creation works

Your CSV data creates the population model, but predictions also leverage Semilattice’s underlying AI. The seed data acts as a “filter” that selects relevant knowledge from the AI model.

This means your questions should have predictive power - they should reveal information that correlates with how your target group would answer new questions.

Choosing seed questions

The questions in your seed data determine prediction accuracy. Better seed questions lead to more accurate API responses.

What makes a good seed question?

Think about how informative each answer would be:

  • High predictive power: Age, profession, education level, core values
  • Low predictive power: Favourite colour, random preferences unrelated to your use case
  • Context-dependent: Food preferences matter for restaurant surveys but not for software feedback

Understanding predictive power

A population’s accuracy when predicting the answer to a new question is dependent on the predictive power of the questions and answers in its seed data. This predictive power varies depending on the question being predicted and is a function of how those seed data work with the simulation engine.

Predictive power cannot be predicted

Due to the black box nature of the LLMs used by Semilattice, it’s not yet possible to tell which seed data have more predictive power with certainty.

Building intuition around predictive power

An intuitive way of estimating the predictive power of a question is to think about how informative someone’s answer to that question would be. For example, knowing someone’s age tells you a lot more about them than knowing whether they prefer ketchup or mayonnaise. However, it’s also important to consider the purpose of the population you are building. If you work in R&D at Heinz, the latter question will have more predictive power for the questions you plan to predict.

Predictive power is an area of research for Semilattice. If you want to go deeper, it maps broadly to the concept of feature selection in machine learning & data mining.

Example: Developer population

sim_id,"Years of experience?","Primary language?","Team size?","Remote work preference?"
1,"5+ years","Python","10+ people","Fully remote"
2,"1-2 years","JavaScript","2-5 people","Hybrid"

These questions would help predict developer attitudes toward new frameworks, tools, or methodologies.

Best practices for API-ready populations

  • Target your API use case: Choose seed questions that relate to the types of predictions you’ll make via API. Building a customer feedback population? Include questions about satisfaction, preferences, and user profiles.
  • Avoid redundant questions: Don’t include multiple questions that reveal the same information. “Preferred condiment” and “Favorite fries topping” are too similar.
  • Maximize data quality: Aim for 20 questions and 1,000+ responses. More seed data typically means better API prediction accuracy.
  • Test with real questions: After creating your population, test it with questions similar to your API use case to validate accuracy.