Autonomous product loops
There are now a number of high-profile startups like Resolve and Firetiger that aim to create some type of autonomous observability loop. The basic idea is that an agent continuously monitors production data, identifies errors or concerning changes in metrics, and then acts like an on-call engineer, submitting PRs for code changes to fix such issues. Assuming it works well, you end-up with a “self healing” product.
I think an identical pattern can be applied to product loops - the cycle of collecting usage data on your product, identifying gaps & areas of improvement, and then having a coding agent submit changes to hill climb against those gaps. What’s interesting is that while there are probably 10+ companies doing observability loops, almost no one is doing this for product.
Analogous to how observability agents monitor logs, metrics, and traces, the signals to consider here would be:
Instrumentation - Event data on product usage, including things like conversion funnels
Session replay - Recorded user sessions
Experimental data - Results from A/B tests or other forms of experiments and staged rollouts
What’s interesting is that AI can, I think, heavily influence the way we have traditionally thought about setting up and monitoring all of these signals.
The classic issue with instrumentation is that it is a pain to setup and manage. The PM is often the one who wants to instrument things, but the engineer has to write the instrumentation logic. Features get shipped all the time without someone remembering to instrument them. And instrumentation logic becomes sprawling - with tons of duplicate event types being created for the same logical concept.
What one could instead imagine is a coding agent that sits on every PR, automatically identifies the right way to instrument it given the existing codebase, the way existing features are instrumented, and the goal/purpose of the feature, and auto adds instrumentation logic. Basically - CodeRabbit but for product instrumentation.
The classic issue with session replay is analysis - if you store tends of thousands of session replays, how do you draw holistic conclusions? How do you separate signal from noise outside of someone manually watching certain replays.
VLMs can play a massive role in changing this - semantically analyzing large sets of replays and turning them into structured, qualitative insights like “Users who visit the checkout page tend to get confused by the Apple Pay feature”. This means you are no longer sitting on a repository of replays which your PMs can watch 1/1000th of, but rather a set of constantly refreshed insights like you had a UXR team watching every user use your product 24/7.
The classic issue with experimentation is experimental setup & analysis - it is difficult to go through the end-to-end process of setting up the right feature flags with the right randomization and then analyzing the data properly, often bottlenecked by 1-2 data scientists on your team familiar with experimentation. But, coding agents are very capable of adding feature flag logic to your features, and then running data analysis on the results. Similar to instrumentation, you can imagine this happening pseudo-automatically, suggested by an agent during PRs and then analyzed asynchronously as the data comes in. You can also imagine the agent running simulated “experiments” synthetically to handle low sample size environments.
If you tie these things together, you can imagine a form factor roughly like the following:
An agent constantly monitors all product data in the background, deriving clusters of insights from product telemetry, experiments, and session replays. Users can query these insights via MCP/API in tools like Cursor as they develop new features.
As obvious areas for improvement emerge based on the product data, the agent submits PRs to the team with proposed fixes and their associated rationale. For example - the agent might submit a PR suggesting that two pages in the onboarding flow are merged because the recent addition of a new onboarding page drastically lowered activation rate.
An agent runs on every PR, analyzing whether the PR merits changes in how product data is captured and whether the functional change is aligned with what the product telemetry suggests. This could look like suggesting an instrumentation event is added, or saying that the premise of a new feature is inconsistent with the data from a bunch of recent session replays, with links to those specific session replays.
One could build this directly on top of whatever tools the customer is already using for product measurement - StatSig, PostHog, Amplitude, etc - though I suspect there will end up being an opportunity to rethink the storage layers of some of these systems to better account for agent based analysis. But the easiest entrypoint is an agent with API access to whatever tools you use, which can basically automate these “reasoning” layers on top of the data.
If done correctly, I think this makes the entire product analytics & experimentation category essentially headless - there will almost never be a need to visit a UI to see conversion funnels or similar, because the system should simply surface the relevant data as it is needed during feature development and/or play the role of the person who analyses the data to determine changes that should be made.
I have almost no doubt this will work to at least some extent, because it already works in observability. And while I think observability is slightly easier to get right because production errors are black and white things that must be fixed, whereas improving your product tends to be more subjective, at their core they are the same concept, just with different signals that need to be considered.
Posthog Code is the closest thing I have seen to this so far, though I don’t love the IDE-like form factor. I think the better insertion points for something like this is in the PR, and in existing coding agent tools via MCP/CLI. And then I think there is a lot of value in starting tool-agnostic.

