In 2015, our Chief Scientist Chris Potts was part of the team that created the Stanford Natural Language Inference (SNLI) dataset. At the time, it was one of the largest human annotation projects ever created: over 570,000 human written sentences. SNLI won a Best Dataset Award and was instrumental in reorienting the field toward natural language understanding.
It was later discovered that SNLI has a number of, shall we say, quirks. In short, many content words – “Dogs”, “camera”, “sitting” – turn out to be “cheaters” in that they convey a lot of information about the correct prediction even though they should not.
How did this happen? The root cause is that it is very hard work to be an annotator, especially when the task involves something as creative as writing sentences from scratch. An annotator might experience a moment of writer’s block, glance out the window, see a dog, and go on to write the next 10 sentences about dogs. In aggregate, this leads to a pretty unusual set of examples.
All human created datasets will have such biases in one form or another. You might try to side-step this issue by curating naturalistic datasets. Excellent! However, these too will embed biases stemming from whatever natural process created them: the specific user population, the time period of collection, the state of the system, and so forth. These datasets might not have artifacts like SNLI does, but they will contain gaps that could leave you totally unprepared for what comes next.
Our core claim is that synthetic data – data generated by a GenAI model – is a compelling counterpart to both human annotation and naturalistic data collection. Synthetic data is inexpensive to collect. It is also straightforward to ensure that it covers core cases and edge cases, friendly situations and adversarial ones. You can simulate diverse users and contexts, including far-fetched ones for a stress-test. If you are worried about artifacts stemming from a specific model, then use lots of models. If you have an existing dataset, you can ask models to imitate it, with variations that maximize its value. The resulting synthetic datasets can be used to guide system development, and, with some human oversight, they can also be a part of robust evaluations.
In academic research, there is a lingering sense that using synthetic data is bad. In our view, it’s very hard to defend this position in light of what we have learned about human biases, and in light of what is possible with GenAI models today.
With Bigpsin, we make it easy to flexibly create synthetic data, using lots of different models. As a result, developers get to see what happens in an incredibly wide range of scenarios, and the systems they create are, in turn, much better prepared to deal with the complexity of the wider world.




