The demand for high quality data from AI labs and AI startups right now is unprecedented. Leading AI labs will now pay $2k+ for sophisticated reasoning traces in advanced domains (e.g. legal, healthcare, etc), and companies like Surge & Mercor have grown revenue at unprecedented scale serving this market.
For the most part, data startups that I have seen attempting to serve this market have focused on finding efficient ways to procure labor and then assign them annotation tasks. e.g. Scale will go to OpenAI and say “we have a huge pool of neurologists who can do annotation tasks for you”. This, of course, has worked well for many of these companies given the demand for these sorts of data annotation tasks, but it also means that many of them look more like labor arbitrage services business than true software businesses.
This plays out both in margin profile as well as in repeatability. Many of these labeling companies end up having to constantly update/adapt their labor pools to match the current annotation-task-du-jour - e.g. Scale invested huge resources into AV labeling in the early days, and then had to re-do many aspects of labour sourcing, labor evaluation, and internal tooling to adapt to RLHF annotation tasks for foundation model companies.
I suspect there is opportunity for more creative business models in this space, particularly those that play with gamification or clever incentive structures to create high quality data as an implicit output. These would look more like scalable, software-driven approaches that passively generate annotation data—often as a byproduct of something users already want to do.
ReCaptcha is a good historical illustration of what this might look like. ReCaptcha is the anti-bot measure you have likely seen logging into many online accounts before, where it has you do things like “click the squares with traffic lights”.
ReCaptcha is offered as a free service because there is a secondary benefit that stems from all these users clicking squares in the image - it produces free image segmentation & annotation data for Google.
Other examples of this concept that I find interesting include FoldIt, which turned protein folding annotation into something that essentially felt like a game and as a result crowdsourced one of the largest protein folding datasets, and Kaggle, which created a competition-dynamic around modeling tasks.
I strongly suspect that these sorts of ideas of offering a service for free as a way to indirectly generate high volumes of annotation data could be applied to some of the very-high-value foundation model annotation tasks that exist today, such as reasoning trace generation or RLHF data.
ChatBot Arena is one such example of this. If you’re not familiar, ChatBot arena essentially turns comparative evaluation of language models into a simple game, letting users compare responses to prompts and rank which model was better or worse. This allows for not only the creation of an interesting leaderboard much more aligned with user preferences than many standard eval benchmarks, but it is also essentially creating a high quality RLHF dataset.
I think there is a lot of room for creative ideas that take this concept much further. For example:
Could you offer free tools which integrate into the systems of engagement that certain key knowledge workers use (e.g. lawyers, doctors), in return for capturing the usage data from that worker? There are huge amounts of latent thinking & reasoning data implicitly generated in systems like EHRs, CRMs, etc
Could you create games or gamified experiences that implicitly generate unique reasoning traces?
Could you create marketplaces with unique incentive or compensation structures that help match certain types of long-tail, specialized labor with annotation tasks? (analogous to HackerOne in security)
Could you build free utilities in high-value domains where there is very little public data? e.g. free podcasting utility to create multi channel audio data, free 3D editing utility or Blender plugin for 3D data
Done properly, such a business could have very compelling dynamics as it could scale more like a software business and could have strong network effects because most of these ideas implicitly have both a demand & supply side and/or look like marketplaces to some extent.
I think this type of business-model insight is under-appreciated in the data annotation space, and I’d love to connect with teams thinking creatively along these lines.
I feel like there could also be some blockchain parallels here as well (not that I am a big web3 person). Some kind of system with a proof-of-intellegence (aka reasoning trace as opposed to the traditional hash-based proof-of-work) that creates a financial incentive to solve some set of hard verifiable problems as fast as possible.