Synthetic Product Feedback
What if you had a synthetic UXR study, A/B test, and user survey on every pull request?
In recent years, many startups have focused on using large language models (LLMs) to automate the functional testing of code — checking whether it works as intended and is bug free. Companies like Greptile and Graphite tackle code review; others like Ranger focus on QA automation.
However, there is a large class of software testing that does not fall under this functional testing paradigm, but rather is focused on the qualitative effectiveness of the software. Is the new feature understandable by users? Does it improve conversion? Is it easy to use?
These sorts of questions are historically answered by an entirely different set of processes, such as:
User Experience Testing - Have people try to use the product, often with different structured goals in mind, and get their reaction
A/B Testing - Roll out the new version of the product to a % of users, and compare those users against a baseline using key product metrics
Product Instrumentation - Measure how people are using the new product/feature with descriptive analytics such as conversion rates or user funnels to see if your new code drove the right change in behavior
User Interviews - Show users mocks or a demo of your new feature, and gauge their reaction. Is this compelling? Is this how you would want it to work?
Surveys - Collect aggregate data from different public pools of people in your ICP on something new you have built or are launching
Design review - A designer goes through the workflow and critiques it from a usability and user experience perspective
What is interesting is that - it is very clear that LLMs and agents are capable of automating and/or producing synthetic versions of almost all of these forms of qualitative feedback, yet almost no startups have focused on it.
What if LLMs made it possible to collect usable, directional qualitative feedback on every pull request, with no marginal cost? I will give a few examples of what this might look like:
Synthetic A/B Testing
Imagine if, on every pull request, you spun up hundreds or thousands of preview environments - half with the previous version of the code (control), half with the new version of the code (variant). In each preview environment, run a computer use agent that is tasked with performing the key product goal associated the given code change. Prompt all of these computer use agents to mirror the distribution of user personas in your user base.
If you did this, you would in theory have run a synthetic A/B test, and you would get a synthetic measure of whether or not your code change improved the user flow.
This is not just conceptual - I have now actually seen multiple people experiment with this idea to some success. Agent A/B is a recent research paper that explores this direction and shows that this type of synthetic experiment can in fact predict real world user behavior on eCommerce sites. A friend of mine, Greg Dale, ran some simulations of this sort of idea and also found that synthetic variations of experiments seem predictive in cases where we have ground truth real world experiments. I have included one of his results below.

What is particularly interesting about synthetic A/B testing is that it solves what I would consider the biggest challenge in the A/B testing market - sample size. Historically, only the largest marketplace and prosumer/consumer companies have been able to regularly A/B test ideas because without sufficient sample size, it was impossible to draw statistically significant conclusions. But, such a synthetic approach would be applicable to everyone.
While there are numerous gotchas here - particularly how to evaluate and benchmark synthetic experiments, as well as how to properly mirror the user distribution especially in more “niche” product workflows - the technique is certainly intriguing.
Synthetic User Research and UXR Testing
Imagine if, on every major pull request, two things happened:
An agent analyzed Gong, Zendesk, and similar data sources to identify users who had given the most feedback related to that code change, and then proceeded to email them saying something like “Hey, we are working on improving X thing which we noticed you had complained about. We’d love to preview the change to you and see if it addressed your feedback”
A fleet of browser agents prompted to act like world-class UX designers and PMs were spun up and tasked to use the feature and provide qualitative feedback. This feedback would then get summarized, and you could also analyze each specific agent’s stream of thought as it used the feature similar to how you can see the stream of thought for existing browser agents trying to complete a task

Existing browser agent products, like Browserbase, already provide a monologue as they try to achieve a task. It is not a long shot to consider them being prompted to specifically identify or highlight UX or product issues as they do so.
This is the sort of work that might take a PM or UXR days to do traditionally - and whose cost is so high that it is typically reserved only for the biggest feature releases. What if it default existed for every single code change?
The Bigger Picture
These two examples highlight what I would consider to be the two broad classes of techniques worth considering here:
LLMs can be used to personify or impersonate humans, and that can be a source of synthetic human feedback
LLMs can be used to automate the execution of collecting qualitative human feedback - such as reaching out to people and conducting interviews or user studies
In conjunction, these two techniques would allow you to automate a version of every single qualitative testing modality I described at the start of this blog.
Obviously, the quality of “synthetic” feedback will likely be worse than if it were real human feedback, or a human were actually collecting it. But, this belies what is interesting about these LLM-based approaches approach.
Doing any of these qualitative feedback methods well takes a lot of time & effort, and in many cases is still not feasible. 90% of companies don’t have a large enough user base to effectively run A/B tests consistently. Most companies barely even have a single full time user researcher on staff to run UXR studies. Even companies with the right resources only have the capacity to do these sorts of things on the biggest of changes - a given enterprise PM only has so much time to “preview” new features to their customers, and can likely only focus on the biggest customers.
But, if the cost & effort to run these sorts of analyses falls to zero such that they are “default available” on every pull request, then even if the predictiveness is only 50-75% of what it would be doing things the traditional way, that tradeoff is likely worth it. You can, of course, still do the traditional qualitative feedback workflows in key moments or for big decisions where you want to be certain.
I strongly suspect a startup could be built that essentially offers “synthetic feedback by default” on every pull request. There is now quite a bit of prior art that these methods work better than you might expect - e.g. market research startups like Evidenza which use LLMs to mock brand feedback see correlation coefficients of 90%+ when backtesting against historical surveys that brands have run. When you combine this with how capable computer use models are becoming, I see almost no reason why it shouldn’t be possible to collect extremely high fidelity synthetic product usage feedback.
Such a startup should focus on selling primarily to companies who have never had access to these sorts of techniques. E.g. sell synthetic A/B testing to the companies who have never been able to run an A/B test in their life. You’ve suddenly given them a “superpower”, and their comparison point is doing nothing, rather than doing a real A/B test.
Key challenges in this space, and therefore areas to build technical depth, include:
Data integration - How do you get the right data to effectively model the company’s userbase and user distribution? This will be the determining factor in how good vs. bad the synthetic techniques are
Trust building & evaluation - How do you know when to trust synthetic product feedback vs. not? When does it hallucinate or get things wrong? If the user has no way to back-test the results, how do they learn to trust this?
When to do the traditional stuff - I think a key trait of a successful product here would be informing you of when a synthetic approach is unlikely to be predictive, and helping you do things the traditional way when it makes sense to
Modalities and Sequencing - Which analysis modalities should one focus on vs. not? Which are easier to get right today vs. in the future? For each modality, what is the critical problem to solve to make it work well?
Cost - Some of the techniques I describe would be very, very expensive to do today. While model cost is going down exponentially, you will need to be clever about how to handle this in the short term - particularly because I think it is essential that this can just be applied by default to every code change or PR, rather than be something someone has to think about
Taste - How do you get models to accurate reflect good vs. bad UX design? This is easier said than done. A good analogy is the vibe coding space, where so few products actually produce tasteful outputs from a design and usability perspective (e.g. see MagicPatterns), and most produce AI slop.
I am extremely interested in talking to anyone who wants to build something in this space - davis @ innovationendeavors.com


Lots of thoughts about this. But I'll share one insight that I think gets at the core of why this is hard.
The hard part of UXR work is not generating evidence, it's convincing people to act on it. I've seen many teams ship experiences over the strenuous objections of their UXR partners. A heuristic evaluation, like you discuss here as taking "days to do," is (in my experience) routinely dismissed by teams as "not sufficient evidence." It's an "N of 1" study. Similarly with survey results. If the data confirms the team's prior, great. If it does not? They tend to undermine it and de-scope it.
What IS easier to do is to convince teams not to ship something with a negative A/B test result.
Where would evidence of this sort land?
My expectation is it lands firmly on the UXR-side of the spectrum. If it confirms people's priors, great. They cite it as a reason to ship. If it questions their work, they work past it. "Oh it's just a hallucination." If teams are happy to overrule a human, they will even more happily reject a post-commit bot that has thoughts about the UX.
On top of that, we have a problem of insight coherence. Part of what research (and data) teams offer is a single, evidence-backed POV. It is well-known that you don't want separate data teams because you get a phenomenon where leaders are forced to reconcile disparate views on the question from different data teams that lob methods critiques at each other. An LLM-driven insights world is even more pluralistic and even harder to drill down. Why does my agent say this UI is great and your agent says it's bad? Is it just noise? Did the minor change in the instructions for the task to complete flip it from shippable to "needs revision?" We run head-long into LLM-explainability for moment-to-moment product decision-making.
So, I am not bullish about this sort of automation. I AM bullish about tools that enable humans to collect data faster. But the matter of turning data into insights is one that I think should remain the domain of humans because they have organizational power and can make sure the "truth" (such as it is) gets heard and acted on.
Hey David, building in this space since B.C. (Before ChatGPT)
check out https://www.syntheticusers.com/