Generative AI & the Shift from Creation to Validation

Why simulators, compilers, experimentation, and similar validation techniques are increasingly critical in a world of generative AI

Apr 26, 2024

One obvious impact of foundation models proliferating is that the cost to generate new ideas is converging to zero. If you want an image for a blog post, you can enter a prompt and generate hundreds of ideas instantly. If you want to refactor your code, an AI copilot can come up with countless plans to do so that you can evaluate. If you want to come up with a title for your blog, an LLM can come up with thousands.

In a world where idea generation is cheap and no longer the bottleneck in many creative and design-oriented endeavors, the new bottleneck quickly becomes validation - of so many options, which is correct or optimal? Correspondingly, many job functions which previously spent a lot of time on idea generation - imagine a script writer or a copywriter or even a tattoo artist - will now spend most of their time on idea evaluation, selection and refinement.

This dynamic greatly increases the importance of tools which assist us with understanding, evaluating, and selecting among an abundance of ideas, and may in many cases shift the way that the core systems of engagement we use are designed. This blog explores a few examples of the ways I see this already happening, and what it might mean for the future.

Experimentation & A/B Testing

Experimentation techniques have been widely adopted in the technology industry for quite some time, but I think their importance goes dramatically up in a world where all job functions in a company (software engineering, product, design, etc) make greater use of AI.

Consider landing page optimization. Traditionally, a growth team might spend time brainstorming a few different permutations of copy, page layout, and images on the page and test those permutations against click-through or conversion rate. Today, you could likely have an AI system continuously identify new ideas worth testing across all these dimensions, and constantly feed them into a system that constantly A/B tests all such permutations, feeding such results back into the model for further improvement. This is essentially a self-reinforcing system, where the AI generates ideas and the experimentation layer provides validation signal to the AI which then influences the subsequent ideas generated. Coframe is one cool startup exploring ideas like this.

You can take this idea much further though, for example:

Email campaigns could be dynamically adjusted via a combination of statistical response rate and LLMs modulating the email content
If an AI can design entire websites for you, perhaps you should just A/B top website variations that you like instead only being able to fiddle with small components like call-to-actions or images?
I suspect almost all programmatic advertising campaigns will become self-optimizing systems like this

Previously, A/B testing would not have been worth it in many of these markets because there were not enough good ideas to test and/or not enough human labor to generate sufficient variants. Because this has changed, I think many, many more companies will adopt platforms like Eppo, and experimentation will be natively baked into far more vertical products.

Predictive Modeling

The value of “classical” deep learnings models which predict the outcome of an action or event will also become more valuable in a world of AI.

Anyword is a good example of this. Anyword layers on top of LLM-based copy generation tools and provides a highly accurate model that predicts the engagement rate of performance marketing copy, taking into consideration the audience that company is targeting and their brand guidelines & policies & tone of voice. This model was trained on a massive corpus of historical, well-labeled performance marketing data.

Many large companies use Anyword in conjunction with tools like ChatGPT or Copy.ai because marketing teams quickly run into a new problem in a world of AI - it is now easy to generate a lot of content ideas, but no one knows what will perform or convert.

This form of labeled/supervised “reward modeling” will likely need to become much more widespread, allowing knowledge workers who utilize AI to develop more informed opinions about what content suggested by AI is actually good.

This is especially important given the fact that in many cases, AI is democratizing who can do certain tasks well beyond those who have developed “intuition” or “taste” for that task. For example - most professional photographers know about the “Rule of Thirds”, but I suspect the vast majority of people generating photographic content with diffusion models do not. In Anyword’s case, most professional copywriters know various heuristics for producing high converting copy, but the 100s of other people who can produce marketing copy with ChatGPT don’t.

While some of these concepts can likely be baked into large pre-trained models via well curated training sets (e.g. Midjourney has clearly baked in many aspects of “good” cinematic imagery into its product), in many cases I think you will want secondary models which help evaluate outputs and teach a user what is good or bad about different generated ideas.

Simulation

Large, pre-trained AI models are starting to gain widespread adoption in many chemical, biological, and physical markets where we either do not understand the physical world well enough to deterministically simulate behavior, or where the computational complexity of simulation is so enormous that it is unfeasible to simulate at scale.

Good examples of this include protein folding, molecular property prediction, and differential equations/physics applications such as thermodynamics.

While many people think about simulators in such environments as a way to generate training data for foundation models, it is also interesting to think about the converse - using foundation models to “sample” the right regions to explore, and then classical simulation techniques to more rigorously dive into them or evaluate them.

For example - consider an AI Copilot in a CAD application that might assist you in creating design variants. A natural next step might be to rapidly simulate these design variants using simulation tools such as Simulink, Collimator, Solidworks, Ansys, or similar, depending on the nature of what you are building or designing.

While this workflow of “design then simulate” exists today, I suspect that the increasing capabilities of generative models will structurally change these markets. Simulation will become a more critical part of the workflow and will capture more value. Correspondingly, it will become more important for simulation to be natively integrated into AI-assisted design & editing tools to allow for this form of “simulation-in-the-loop” design optimization.

Navier is a good example of a next generation simulation startup thinking in this way. MindsEye is a good paper demonstrating how a physics simulation engine can complement a large pre-trained model.

Similarly, many startups working on large pre-trained models in molecular property prediction utilize density functional theory as a simulation step to further validate new material compositions that their models suggest. Density functional theory is too computationally intensive to run over the entire search space of possible new molecules one might want to explore - a single simulation of a single molecule might take days - but it is fantastic as a feedback loop for a model with comes up with probable things to try.

One of my favorite examples of this AI + simulator architecture is Quilter, which uses generative adversarial networks to automate the layout of printed circuit boards. Because you can then use physics simulations to evaluate the properties of any given printed circuit board, they are able to create a closed optimization loop between the GAN and the physics simulator, allowing the system to rapidly converge on high quality design layouts despite the computational complexity of the problem.

Optimization & Operations Research

In some domains, it will be possible to use operations research & classical mathematical solver/optimization techniques to complement generative models.

This blog by Doordash is a good example of what this looked like with classical predictive models, though I think the idea cleanly extends into use cases involving large pre-trained generative models.

The problem Doordash was trying to solve is dispatch optimization - essentially, deciding how to get each order from the store to the customer as efficiently as possible. They solve this by combining a suite of predictive models which predict a number of attributes about the dispatch - e.g. how long will it take for the store to produce the food, how long will it take for the driver to go from the restaurant to the consumer - with a global solver that then optimizes the dispatch & routing given all the known and predicted constraints.

Compilation & Execution

Code generation via LLMs is a particularly interesting example of the power of validation techniques. Because code can be statically analyzed, compiled and executed, there are a range of direct feedback signals you can get about AI generated code.

For example, Espresso, which uses AI to optimize SQL queries, uses formal verification techniques to guarantee that its AI generated code is functionally equivalent to the baseline code - the AI generates a range of possible options, and mathematical analysis selects only generated outputs which are logically equivalent to the baseline. Clover out of Stanford is an interesting exploration of similar ideas. Many companies in the CoPilot space make extensive use of static analysis & compiler techniques to improve completion accuracy and reduce hallucinations.

This dynamic stems from two fundamental traits of code which are true in many other areas as well - code is structured and it is executable. Any kind of structured output which is generated via AI can be analyzed via secondary mechanisms - e.g. a generated movie script could be analyzed relative to patterns you would expect to see in a movie script. Any kind of executable output can then have its execution observed or analyzed, which again provides a secondary feedback mechanism.

Any domain where the core thing being created by people has these properties is likely a great fit for AI copilots or assistants. Some good example of this that come to mind are spreadsheets, financial models, and proof assistants such as Isabelle.

Broader Considerations

This is meant to be an illustrative, not exhaustive, list of ways in which verification techniques can be combined with generative AI, and the power that can stem from systems built in this way. Simply put - this combination can allow for self-reinforcing systems which teach and improve themselves at a very different scale than those which have existed in the past.

From a startup perspective, I think this trend has a few interesting implications:

Some of the best markets for applied AI startups will be those with natural validation or evaluation mechanisms (e.g. code)
The bottleneck of many design processes is going to shift from idea generation to idea validation & refinement. This will likely structurally alter where value accrues in many markets, and may be a vector of disruption for startups to displace incumbents
It will be more important than ever for validation systems to be natively integrated into design workflows, rather than these being disjoint or separate systems (as you, for example, see in many areas of CAD today)
AI startups benefit greatly from having expertise and team DNA in simulation/validation techniques. E.g. all AI code startups should have compilers experts, all companies training molecular property prediction models should have experts in density functional theory, etc.

Davis Treybig

Discussion about this post