While the world’s attention is focused on the advantages and risks of generative AIs or LLMs displacing human expertise, another company is growing by focusing on the verifiably human aspects of research.
Nine-year-old behavioral insights start-up Prolific has been named a winner at Deloitte’s Technology Fast 50 Awards for high-growth tech companies. It puts high-level researchers – over 30,000 to date – in touch with a diverse mix of over 120,000 vetted research participants in 38 countries, via what it describes as a proprietary, managed data pool.
The model has caught the attention of clients such as Yale, Oxford University, UCL, Stanford University, Kings College London, Google, Meta, Cancer Research UK, Kickstarter, the European Commission, and the World Bank - an impressive list.
However, one of the aims is to enable organizations to source high-quality, verified human data to better train the next generation of AI systems – Nugget AI is another client, for example.
In this way, Prolific’s approach to data quality and diversity is a counterpoint to the business models of any vendors that simply scraped the Web for a mass of data, regardless of its provenance, accuracy, or cultural and linguistic richness.
Recently, for example, a longitudinal study conducted via Prolific’s platform helped Cambridge Professor Sander Van Der Linden research a book, Foolproof, on how to inoculate users against online misinformation.
Phelim Bradley is co-founder and CEO of Prolific. He tells me:
That's the core problem we solve now: fast access to high-quality people. And it was the motivation for founding Prolific. The use case that we started with was academic behavioural research and social psychology. And it stemmed from three core problems.
One, the tools that researchers were using at the time to collect this type of data, namely places like Amazon Mechanical Turk, gave pretty poor data quality. Second, they were very difficult for researchers to use, and limited their degrees of freedom in terms of how they collected data from people. And third, the way that participants were treated was not consistent and they weren't always fairly rewarded.
Yet in a world in which AI is in the ascendant, has that founding purpose changed over the past decade? Not a bit of it, he says. In many ways, it has given it new meaning and context:
In growing the company, we realised that the core problem of getting fast access to reliable, trustworthy people, is not only a problem in research, but also in AI training. For example, in taking reinforcement learning through human feedback, and stress-testing models. Also, in understanding the impacts of AI.
So, our ambition and scope have changed, but the core problem we are solving has stayed exactly the same since inception.
VP of Product Sara Saab adds:
We've built up this pedigree of being a trustworthy participant in the human discovery space. And this has allowed us to attract, through network effects and word of mouth, people who want to take part in meaningful, interesting, well-intentioned research.
We protect them and make sure that they're treated and paid well – we have a wellness programme, for example.
And diversity is key, she explains. But Prolific’s duty of care also ensures that, while verified, participants are never named or otherwise identified:
In the screeners section of the system we have a knowledge tree of all the various participants, and their numbers, against various criteria we hold. But they are completely anonymized. We don't allow personal data to leak into that.
However, if 2023 has taught us anything, it is the bewildering speed with which organizations have adopted generative AIs and LLMs – albeit often via individuals’ use of shadow IT. More, many are trusting AI to provide authoritative information. Has Prolific suffered from this generational shift towards machine intelligence – or perceived machine intelligence?
CEO Bradley says:
It affects our business in a lot of different ways. First and foremost, I'd say that because these tools are nascent and still developing, there's still a big need for human input to them. So, first and foremost, we've experienced a lot of growth via people looking to source the human inputs necessary to make their tools better.
Second, there's this whole problem in AI of model collapse. If you continue to train a model on data that's generated only by AI, it tends towards the model getting worse and worse. So, you always need novel human data on the margins. And that's what we are able to provide.
We have a strong affinity for being on the side of people in this economy – making sure that humans are represented and have a meaningful part to play in training AI. Especially when humans are the data source for science, or things that have policy implications.
For example, I'm not sure I would trust a policy paper that was generated on the back of research that was itself generated on the back of synthetic AI-generated populations.
But how is Prolific using AI itself in this business model? It describes itself as a technology, rather than human-network, company. She explains:
We have our own suite of AI detection and machine learning tools. They are not a hard gatekeeper, but we do deploy them based on risk-scoring, and in certain scenarios to ensure real humans are answering Prolific studies.
What happens to the data collected via the Prolific platform? Does the company have access to it? Might it even use it to create its own trusted data sets? Far from it, says Bradley.
Ultimately, we provide the connection between researchers and participants. We allow researchers to collect anonymized data, but without them needing to gain access to personal information about the participants. And on the flipside, we don't have access to the research data ourselves. So, we're not a data aggregator or data controller.
Consent is critical, he adds:
Without getting political, consent is everything. Participants are opting in, understanding what their data is going to be used for. That is a core principle.
One prediction I will make is that sourcing and auditing where data comes from, and who was involved in the development of datasets, are both going to become increasingly important.
If you look at academic research, political polling, or market research, the audience that goes into those studies is a really important part of publication, and of whether or not the results are trustworthy and replicable. I believe that’s going to be an increasingly important part of the next phase of AI. Who was involved in these data sets? And how were they treated?
We are watching the lawsuits [US copyright class actions against AI companies] with interest. Vendors argue that data-scraping is fair use, but I think the trend will be towards data collected with consent.
That idea of trawling the internet and plugging it into a model…. those doors are slowly closing. A lot of data sources are getting wise to it. So, it will be getting increasingly difficult to access data that used to be freely available on the Web, because people are going to be wary of copyright.
An intriguing perspective: that one unforeseen consequence of AI companies exploiting grey areas in copyright might be data being pulled, wholesale, from the Web, in order to protect and monetize it. As Saab puts it:
We call it provenance. You should be able to audit-trail everything that's gone into making an AI model, and prove it's coming from an ethical place. We strongly believe that's going to be important.
People’s data should not be stolen, scraped, or laundered. That’s not OK.