Seed Data - Semilattice

Semilattice needs a small sample of data from the target population to make predictions. This page explains the data requirements and some core concepts which will help you to create good Populations.

Requirements

Populations require individual-level survey, poll, or questionnaire response data from the group of people the Population aims to model. Beyond the first sim_id column, each column should contain the group’s responses to a single option multiple-choice question.

Technical requirements

Semilattice accepts .csv files which meet the following requirements:

the first column must be titled sim_id
the sim_id column should contain a unique number in each row
there should be between 5 and 20 question columns
questions must be between 3 and 300 characters long
there should be between 25 and 3,000 respondent rows
answers must be no more than 200 characters long
answer cells can be empty, but no more than 5% of them across all columns
there should be between 1 and 15 unique answers per column

Qualitative requirements & guidelines

Semilattice doesn’t check if your seed data follows these requirements and guidelines, but they are essential to creating effective Populations.

Column headers should be questions or propositions

Questions cannot assume knowledge of other questions

Answer options cannot refer to other answer options

Questions and answer options must be human-readable

All questions and answers should come from the same survey, poll, or questionnaire

Answers in a row should all come from the same person

Good to know

These considerations are useful to bear in mind.

The unique values in a column define the question's answer options

Other types of questions can be reformatted to be compatible, with caution

Question order doesn't matter

Answer option order is always random

Bigger datasets can be filtered, with caution

Seed data are not the only data

A Population’s seed data do not represent all of the information which Semilattice uses to make predictions. Most of the information comes from the Large Language Model (LLM) used by the Simulation Engine, with the seed data effectively selecting the relevant information from the LLM.

This means that you should think about the predictive power of the questions and answers in the seed data rather than the actual information contained in those questions.

Predictive power

A Population’s accuracy when predicting the answer to a new question is dependent on the predictive power of the questions and answers in its seed data. This predictive power varies depending on the question being predicted and is a function of how those seed data work with the Simulation Engine.

Predictive power cannot be predicted

Due to the black box nature of the LLMs used by Semilattice, it’s not yet possible to tell which seed data have more predictive power with certainty.

Building intuition around predictive power

An intuitive way of estimating the predictive power of a question is to think about how informative someone’s answer to that question would be. For example, knowing someone’s age tells you a lot more about them than knowing whether they prefer ketchup or mayonnaise. However, it’s also important to consider the purpose of the Population you are building. If you work in R&D at Heinz, the latter question will have more predictive power for the questions you plan to predict.

Predictive power is an area of research for Semilattice. If you want to go deeper, it maps broadly to the concept of feature selection in machine learning & data mining.

Question selection best practices

The subset of questions you include in your seed dataset has the biggest impact on the accuracy and predictive reach of your Population. These best practices will help you achieve the best results.

Maximise predictive power

Pick questions with the most predictive power for your research objective.

Minimise subjective overlap

Pick a diverse set of questions, minimising questions which reveal similar information. For example, the questions “Do you prefer ketchup, mayonnaise, or mustard?” and “What is your favourite condiment on french fries?” have a high degree of subjective overlap.

Maximise the number of questions and responses

While not strictly required, a dataset with 20 questions and 3,000 responses is recommended. Datasets of 1,000 responses and ~10 questions can also work well.

Get Started

Populations

​Requirements

​Technical requirements

​Qualitative requirements & guidelines

​Good to know

​Seed data are not the only data

​Predictive power

​Predictive power cannot be predicted

​Building intuition around predictive power

​Question selection best practices

Maximise predictive power

Minimise subjective overlap

Maximise the number of questions and responses

Requirements

Technical requirements

Qualitative requirements & guidelines

Good to know

Seed data are not the only data

Predictive power

Predictive power cannot be predicted

Building intuition around predictive power

Question selection best practices