Davis Treybig

Autonomous product loops

Davis Treybig — Tue, 16 Jun 2026 19:26:04 GMT

There are now a number of high-profile startups like Resolve and Firetiger that aim to create some type of autonomous observability loop. The basic idea is that an agent continuously monitors production data, identifies errors or concerning changes in metrics, and then acts like an on-call engineer, submitting PRs for code changes to fix such issues. Assuming it works well, you end-up with a “self healing” product.

I think an identical pattern can be applied to product loops - the cycle of collecting usage data on your product, identifying gaps & areas of improvement, and then having a coding agent submit changes to hill climb against those gaps. What’s interesting is that while there are probably 10+ companies doing observability loops, almost no one is doing this for product.

Analogous to how observability agents monitor logs, metrics, and traces, the signals to consider here would be:

Instrumentation - Event data on product usage, including things like conversion funnels
Session replay - Recorded user sessions
Experimental data - Results from A/B tests or other forms of experiments and staged rollouts

What’s interesting is that AI can, I think, heavily influence the way we have traditionally thought about setting up and monitoring all of these signals.

The classic issue with instrumentation is that it is a pain to setup and manage. The PM is often the one who wants to instrument things, but the engineer has to write the instrumentation logic. Features get shipped all the time without someone remembering to instrument them. And instrumentation logic becomes sprawling - with tons of duplicate event types being created for the same logical concept.

What one could instead imagine is a coding agent that sits on every PR, automatically identifies the right way to instrument it given the existing codebase, the way existing features are instrumented, and the goal/purpose of the feature, and auto adds instrumentation logic. Basically - CodeRabbit but for product instrumentation.

The classic issue with session replay is analysis - if you store tends of thousands of session replays, how do you draw holistic conclusions? How do you separate signal from noise outside of someone manually watching certain replays.

VLMs can play a massive role in changing this - semantically analyzing large sets of replays and turning them into structured, qualitative insights like “Users who visit the checkout page tend to get confused by the Apple Pay feature”. This means you are no longer sitting on a repository of replays which your PMs can watch 1/1000th of, but rather a set of constantly refreshed insights like you had a UXR team watching every user use your product 24/7.

The classic issue with experimentation is experimental setup & analysis - it is difficult to go through the end-to-end process of setting up the right feature flags with the right randomization and then analyzing the data properly, often bottlenecked by 1-2 data scientists on your team familiar with experimentation. But, coding agents are very capable of adding feature flag logic to your features, and then running data analysis on the results. Similar to instrumentation, you can imagine this happening pseudo-automatically, suggested by an agent during PRs and then analyzed asynchronously as the data comes in. You can also imagine the agent running simulated “experiments” synthetically to handle low sample size environments.

If you tie these things together, you can imagine a form factor roughly like the following:

An agent constantly monitors all product data in the background, deriving clusters of insights from product telemetry, experiments, and session replays. Users can query these insights via MCP/API in tools like Cursor as they develop new features.
As obvious areas for improvement emerge based on the product data, the agent submits PRs to the team with proposed fixes and their associated rationale. For example - the agent might submit a PR suggesting that two pages in the onboarding flow are merged because the recent addition of a new onboarding page drastically lowered activation rate.
An agent runs on every PR, analyzing whether the PR merits changes in how product data is captured and whether the functional change is aligned with what the product telemetry suggests. This could look like suggesting an instrumentation event is added, or saying that the premise of a new feature is inconsistent with the data from a bunch of recent session replays, with links to those specific session replays.

One could build this directly on top of whatever tools the customer is already using for product measurement - StatSig, PostHog, Amplitude, etc - though I suspect there will end up being an opportunity to rethink the storage layers of some of these systems to better account for agent based analysis. But the easiest entrypoint is an agent with API access to whatever tools you use, which can basically automate these “reasoning” layers on top of the data.

If done correctly, I think this makes the entire product analytics & experimentation category essentially headless - there will almost never be a need to visit a UI to see conversion funnels or similar, because the system should simply surface the relevant data as it is needed during feature development and/or play the role of the person who analyses the data to determine changes that should be made.

I have almost no doubt this will work to at least some extent, because it already works in observability. And while I think observability is slightly easier to get right because production errors are black and white things that must be fixed, whereas improving your product tends to be more subjective, at their core they are the same concept, just with different signals that need to be considered.

Posthog Code is the closest thing I have seen to this so far, though I don’t love the IDE-like form factor. I think the better insertion points for something like this is in the PR, and in existing coding agent tools via MCP/CLI. And then I think there is a lot of value in starting tool-agnostic.

How Agents Use Systems Differently

Davis Treybig — Thu, 14 May 2026 15:09:42 GMT

Increasingly, coding agents are the ones provisioning and interacting with systems such as databases, distributed systems, computing runtimes, APIs, and similar types of cloud services. This is driving a need to redesign such systems to account for the fact that agents use them quite differently than humans, and have somewhat divergent requirements.

This blog post outlines the patterns I have seen thus far that seem to be most universally beneficial and important for serving agents. Note that I am not aiming to discuss “agent experience” here - which gets more at things like how to make it easy for agents to discover, understand, and use your service (e.g. account creation via CLI, markdown based docs, everything as an API) - but rather how systems should be designed differently to serve agent workload patterns.

Snapshotting & Time Travel

Agents make a lot of mistakes, but are good at back-tracking and correcting those mistakes if there is an easy way to do so. Systems that therefore make it easy to regularly snapshot system state at a low overhead and then recover from that start are particularly amenable to agent query patterns.

Good examples of this include Replit’s snapshot engine and how every sandbox vendor is pushing hard on supporting full memory + disk snapshots (e.g. Cloudflare, Daytona). See also Tigris recent post on snapshotting.

Branching & Copy-on-Write

The ability to rapidly branch the state of a system at a low relative cost is extremely valuable for agents.

Agents benefit from high degrees of parallelism, exploring many different ideas or options simultaneously and then converging on the optimal ones. As a result, agents tend to want to branch the system state much more readily than humans typically do, and it is thus essential that techniques like copy-on-write are used to make this more efficient.

This Databricks blog post about how agents do software engineering mirrors the broader way agents use systems - large numbers of parallel experiments run at a high rate

Furthermore, in such scenarios, you often want each agent’s exploration to be sandboxed such that it can not interfere with other agents, nor mess up your production state. For example, if your agent is developing a new data pipeline that hydrates a production table, you ideally want the agent to be able to fully explore and test doing that without there being any risk of corrupting the production state.

Correspondingly, I think the importance of “git-like” semantics in systems is going up a lot - because if you are branching, you also want first class resolution operations like rebasing, merging, etc.

I will note that, in spite of the large number of databases advertising efficient branching, there is still a lot of performance optimization work to be done in this space. See for example the BranchBench paper

We evaluate state of the art systems including Neon, DoltgreSQL, Tiger Data, Xata, and PostgreSQL baselines, and find a fundamental tension: systems optimized for fast branching suffer up to 5−4000× slower reads as branches deepen, while systems optimized for fast data operations incur 25−1500× higher branch creation and switching latency. Further, no current system supports the representative workloads at scale. These results highlight the need for branch native DBMSes designed specifically for agentic exploration.

Good examples of database doing this and advertising it specifically as a useful tool for agents include Neon DB Branching, Bauplan Git-for-Data, Tigris bucket forking, Databricks lakebase branching, and Ardent Postgres cloning.

Another good example of this playing out recently is how git worktrees have become enormously popular with coding agents, when they were traditionally quite niche versus simply doing branches. Worktrees are essentially a more “complete” version of branching than a git branch since they encapsulate the environment and the code.

Yet even here, there is more work to be done - see for example the following thoughts from one of the leading sandbox vendors, Daytona:

Consider coding agents running in a sandbox. Today, you have two options - you can create real branches, but they leak into the remote, or you can skip branching and lose clean merge-back. What you really want is complete branch, rebase, and merge functionality within the sandbox that doesn’t leak to the remote until you want it to. The primitive still isn’t there - Vedran, CTO of Daytona

Rapid scale up & scale down

Agent workloads tend to be very spiky - with much higher variance jumps up and down than human workloads.

As an illustrative example, consider the role of a data analyst. A human investigating something might:

Run a query
Spend a few minutes analyzing the results
Spend a few minutes updating the query or writing a new query
repeat

Each query is interleaved by a lot of human time to analyze, consider, and identify the next step - with the end-to-end workload being effectively spaced out over intervals.

An agent data analyst investigating the same thing will collapse all the above steps into a rapid series of back-to-back queries all run in a very short time span, then be done. The overall workload got condensed from maybe an hour to <1 minute.

This dramatically increases the need for infrastructure that is designed to rapidly scale up & rapidly scale down. Without this, you will either be unable to respond to agent demands, or you will be massively over provisioned and thus very cost ineffective.

Ivan Burazin (Daytona)

Most traditional systems infrastructure - especially more stateful systems - has generally assumed some degree of always-running components and some fixed floor on operational cost. Serverless, highly elastic infrastructure has, for the most part, been more of a niche.

Agents are completely upending this. Databricks has a fantastic blog post showing how on Lakebase, thanks to agents, the median database compute time is less than 10 seconds (visual below). If you do not have a database system designed for this degree of ephemerality - likely built around concepts like stateless compute workers, no specialized coordinators & leaders, and separation of storage & compute - it is almost impossible to serve the workload economically.

There is now research that suggests that extreme burstiness of agent query patterns can actually create more even profound problems systems. This paper by MongoDB highlights how what they call “agent spikes” - super rapid, high volume, retry heavy agent query patterns - are actually so different from human patterns that they can cause ongoing degradation or failure of a system because they get into a negative feedback loop with dynamic mechanisms like load shedding and queue management. In other words, human system design is so poorly adapted to agents that agents can simply break the system.

See also Xata’s recent OSS launch focused on the platform being designed for agents, highlighting “automatic scale-to-zero and wake-up on new connections” as a core reasons why.

High-concurrency

Agents are capable of much more parallel query patterns than humans. Web search is good example of this. If you ask Claude to do deep research on a topic, it will do 10+ web searches in parallel. This is profoundly different from how humans do web search, which looks more like a single threaded series of sequential queries.

Claude’s Deep Research will run a ton of parallel web searches - each shown in a section on the right - vastly more than a human would

This behavior is largely a function of the fact that agents have a much higher effective information processing bandwidth than humans. This means that systems designed for agents need to account for much higher degrees of concurrency and parallelism, particularly in the context of an individual client.

I suspect that this may substantially influence how a system caches. The more parallelism, the higher the need for, and opportunity for, intelligent re-use of work. A parallel swarm of agents trying to solve something for a given user are very likely to make semi-redundant queries that produce similar results.

For example if a bunch of agents are doing exploratory analysis on a Snowflake table, and agent 1 runs:

SELECT col1 FROM table WHERE date = January

While agent 2 runs:

SELECT col1 FROM table WHERE date >= January and < March

One could essentially rewrite query 2 as:

SELECT col1 FROM table WHERE date >= February and < March 
UNION
SELECT the previous query from cache

While this sort of semantic caching optimization could be useful for humans, its importance goes up dramatically in more concurrent query environments because it increases the likelihood of a cache hit. Thus, the concurrency present in agentic may pressure systems towards more sophisticated cache optimizations - some interesting recent examples in the OLAP space being LiquidCache and Bauplan’s differential cache (reminiscent in some ways of Noria)

Isolation & Contention Management

In many cases, systems designed for agents will need to deal with a much higher rate of contention given the dramatically higher net volume of queries & requests being made by agents in parallel.

This increases the importance of techniques that help mitigate contention and encourage isolated edits/changes. Branching is one such approach, but there are others.

For example, designing a storage layout to account for many agents wanting to make many very small edits in parallel without having to place locks on very large segments of data - an advantage Mesa’s versioned filesystem has over solutions like S3 Files.

Another example of this is conflict-resolution techniques like CRDTs. For products where a large number of humans and agents are jointly collaborating on shared state, data structures like CRDTs that allow for high degrees of parallel edits or manipulation of data may become more popular.

High volume, small data

In general, agents gravitate towards doing high volumes of “small” data work. Coding agents are creating 10000x the number of repos that existed before, but each repo tends to be much smaller on average. Data analysis agents tend to do a lot of small to medium sized analyses, but far fewer giant queries. Agents running sandboxes tend to produce a huge number of very small filesystems.

An immediate impact of this paradigm is it breaks the rate limit assumptions of a lot of existing infrastructure - e.g. companies like Lovable literally can not use Github because they exceed repo creation rate limits by multiple orders of magnitude, leading to companies like Mesa & Relace. The more subtle impact is it fundamentally changes how you want to store data and metadata.

Storage systems are designed around assumptions on the rough shape of a workload - the ratio of (number of objects) to (size per object) to (operations per object), the metadata-to-data ratio, the typical lifetime where an object is “hot”, the read-to-write ratio, etc.

If you suddenly have 1000x more cardinality, a totally different metadata to data ratio (as you need to keep track of a lot more data but each object is much smaller), and a different object lifecycle - things can utterly break. The “Small File Problem” in Hadoop is a classic, older example of this pattern.

This has a substantial impact on system design. Advanced namespace techniques like hash based addressing, prefix sharding, and other more efficient ways to traverse large name spaces become important. It becomes more critical to ensure there is not a high metadata overhead per object - e.g. large footers or headers. As the read-to-write ratio of the workload changes (in the git repo example, you are creating thousands times the number of repos but each repo is accessed way less often, and in many cases a repo might never be accessed again after 30-60 seconds), you need to compact differently. Your sharding, file layout, metadata management, and storage format strategies may also need to change.

Consider filesystems in agent sandboxes. If you have huge numbers of tiny sandbox filesystems, what is the right implementation? The simple solution is an individual volume per sandbox, but this is extremely expensive. The more complex solution is a shared volume base with per-sandbox copy-on-write overlays - which is what many of the more advanced agent sandbox vendors now do. But this then introduces new questions - what is the set of bases you store? When is it better to have a new base vs. store a diff? How do you re-balance your bases over time as the filesystems evolve? A whole new design space emerges to try to solve the problem.

As another example, the CEO of Firebolt recently put together a good presentation for Iceberg Summit discussing how agent workloads require a substantial rethinking of some combination of the Iceberg storage format and query engines above it.

Ben Wagner - Firebolt

Beyond being higher volume, agents also tend to be far more iterative than humans, and are often most effective assuming rapid iteration with a tight feedback loop. A human that needs to do a big analysis might meticulously write the query that does the entire thing, all at once. Agents tend to be better at writing a baseline query, looking at the results, repeating at high rate. This lends itself much more towards these sorts of “small data” systems, where a lot of small queries build up on one another.

A good example of this is the popularity of grep in coding agents over more complex, high scale search databases - it is easier for the agent to run a massive number of smaller grep queries in many situations.

As a result, systems designed for high volumes of “small data” tend to work better for agents - e.g. SQLite, Motherduck, Bauplan, etc. These systems can not support extremely large scale data workloads and tend to be designed to run only on single machines (vs. distributed), but in return they are simpler to run, lower cost, and more performant within that scale.

High volume, low cost

Agents tend to dramatically increase the overall load on systems for a few, multiplicative reasons.

First, agents tend to be expansionary in terms of how often infrastructure is used - e.g. vibe coding platforms and Claude Code are resulting in many orders of magnitude more code in the world than existed before, data analyst agents are resulting in way more people running way more data analysis queries than before.

Second, as I mentioned earlier, agents’ more parallel approach to querying systems also mean each of those individual tasks tend to involve way more queries to the system.

As a result, the effective query load on a system is likely to go up by many orders of magnitude relative to what it was before, which can explode cost. One portfolio company I work with, Bauplan, is seeing data workloads grow by 10-100x once agents use their DB vs. humans. Clickhouse’s visual below captures this explosion of volume well.

Clickhouse - a data platform for agents

This greatly increases the importance of cost reduction mechanisms and architectures that solve for cost - such as tiered storage architectures, sophisticated caching, or compression techniques.

As an example, Turbopuffer’s object storage native architecture gained a lot of traction over earlier in-memory vector DBs like Pinecone thanks to offering an insane cost reduction (100x+). The primary user of vector databases are agents running retrieval queries as a tool to gather context, and complex agents want to run tons of retrieval queries to solve a task. The cost component was a fundamental enabler of the desired agent workload pattern.

To my earlier point about “high volume, small data” - Turbopuffer’s architecture also optimized for a large # of (relatively) smaller vector index namespaces, rather than a small number of very large vector index namespaces which is what was much more common in traditional search use cases like consumer web search or eCommerce recommendation systems. This tradeoff maps very well to vertical RAG & agent companies - e.g. Turbopuffer has to serve Cursor, which has millions of customers each with their own set of search namespaces which are dynamically queried by the Cursor agent.

Session-based, stateful queries

Relative to humans, agents tend to be much more session-oriented in how they engage with systems. Consider web search as an example. A given human looking something up might type a query, look at the results, and may perhaps repeat once or twice but for the most part the search is singular and one-off.

In contrast, agents default to highly, highly sequential queries that all build off of each other. You see this, for example, with how coding agents use grep - it is not uncommon for a coding agent to run 5, 10, or even 30 sequential search queries in a row against your codebase to find what it is looking for, each refined from the last.

From a systems perspective, what makes this pattern interesting is that it is possible to natively bake the idea of a “session” into a query system, not dissimilar from how recommendation systems in products like TikTok have traditionally used the full sequence of actions a user has taken to inform the next suggestion.

Going back to the web search example, most web search APIs today have no concept of stored state across queries - each query is stateless, not informed at all by previous queries. This is because the traditional way web search APIs have been used is extremely one off. Recommendation systems generally only work when there is a lot of sequential interaction data - humans don’t demonstrate this except in consumer product categories like social media, but agents demonstrate it everywhere.

What if instead, the web search API had a session ID that could be used to allow each successive query to be partially informed by previous queries, skewing the search away from results you’ve already seen and towards a more refined understanding of what you want?

I think there is a lot of optimization that could be done in a such a case, such as:

Caching of incremental or partial results likely to re-appear in subsequent queries (see this paper as a similar idea in DBMS)
Intelligent reformulation of query N via a semantic understanding of what the user attempted to do in queries 0...N to avoid repeated or redundant work
Predictive querying of data in anticipation of what an agent is likely to query next

Clickhouse touches on the idea of a stateful session in their recent blogpost on “Agent Facing Analytics”:

A server-side state for AI Memory

AI systems can retain and recall information over time which can help them make better decisions, personalize responses, or improve performance based on past interactions. This is often referred to as “AI memory”.

In the database, we can envision server-side features to support maintaining a state for the agents, the same way interactive users can maintain sessions with their settings and preferences preserved. This can be extended to various cache levels if recurring queries are submitted (especially relevant for data discovery queries) and will require reliable ways to identify agent users and the scope of their tasks.

This idea, by the way, can also be mapped to the parallel query patterns I described early - agents both branch wide and recurse deeply, and really you want the system to understand that all of this maps to a single “session” where the agent is trying to analyze or understand something.

Open the engine

I generally think that the rise of agents manipulating systems will lead many systems to expose a lot more, lower level APIs and “open their engine” more, because agents are FAR more capable of manipulating these low level APIs than humans.

Agents - as a general rule - want thin, dumb APIs, not “thick” APIs. Doug Turnbull’s blog does an excellent job articulating why this is the case for search engines (see also liberating search from the search engine). Most search engines today natively bake in a lot of logic around query understanding, query expansion, ranking, boosting, rescoring, etc. Such engines expose some ability to tweak this, but a lot of it is more like black magic in the engine.

But when LLMs are your user, you have transitioned from a dumb client, smart server assumption to a smart client assumption. The LLM is, in many cases, better at doing some of these things like query expansion and query understanding than your system is! And so the traditional breakdowns of what a system should do on its own vs. what it should let the client influence change, and in many cases you want to allow the client to manipulate all of these things in a much more precise, controlled fashion.

Imagine a web search API that exposed all the ranking systems, all the search metadata, all the query expansion, all the indices, and more to the agent, allowing it to very precisely articulate what it does and doesn’t want to do for each query. I have spoken to multiple senior people at leading AI labs who have begged the major web search API providers for agents to build this as they know it would improve their agent’s quality - but thus far no one has built it. They have clearly seen in internal evals that each incremental degree of control improves the agent’s quality on web retrieval. Instead, we’re left with web search APIs that still treat the entire search engine as a black box, supporting at most a few query parameters on top of your query.

When you are serving humans, exposing too much customization and flexibility can really shoot you in the foot - the system becomes too complex to learn, it is too easy for your users to mess things up, and the benefits your users get from such degree of an open system is often minimal. But, this all changes when you are serving agents - who in many cases are much better at conceptualizing the full scope of the system, and are often better at dynamically tweaking all of these parameters.

This blog post by Shaped is a great illustration of this concept - discussing how “information retrieval is moving from a static pipeline to a dynamic decision tree, where the agent builds and provisions the right tool on the fly based on the data”. They argue the right architecture for that is an information retrieval system that lets an agent dynamically compose arbitrary sets of retrieval systems as it does its work, rather than a more opinionated retrieval system that always does things in a pre-determined or pre-configured way.

The same idea likely applies to many other types of systems - such as query planning and query optimization in SQL databases. We already see agents being potentially more effective at aspects of query optimization than traditional SQL optimizers - such as join order selection (caveating cost/latency considerations). Why shouldn’t agents be able to influence rebalancing, write vs. read amplification, consistency vs. availability, and other common tradeoffs in systems?

I correspondingly think the value of composable, flexible, and “plugin”-able systems will go up a lot. For example, in SQL engines, might Datafusion end up being a more powerful tool than DuckDB for agent usage given Datafusion is so much more modular?

Simulation and sampling

Agents benefit from being able to easily run preliminary tests of ideas in a fast, cheap, non-destructive way before fully doing something. Aside from branching, there are a number of other ways a system might be able to support such workflows.

One is simulation - tooling that makes it easy to mock or mimic what might happen to a production system were a given change to happen. Vera is a cool example of a third party tool trying to do this, though I also think there is an opportunity for production systems to more natively support this kind of workflow.

Second is sampling - a system that can return a preview or directional overview of what will happen if something is done, before it fully occurs.

A simple traditional version of this is how the statistics engines in most SQL query engines can give a likely estimate or bound of the number of rows that will be queried or the estimated cost of a query. These sorts of flows make it easy to give the agent rapid feedback before a high cost or irreversible action is taken.

It’s interesting to consider what it might look like to really design a system around these sorts of probing or exploratory queries such as “SELECT * LIMIT 5” or “SELECT category, COUNT(*) AS C From table GROUP BY 1”. A few things that come to mind include:

Approximate query processing - Exploratory queries are, almost by definition, not very correctness constrained. One could in theory take advantage of this in numerous ways, such as having a long-lived cache for sampling queries you are okay being an hour stale, or allowing the user to provide a correctness bound as a query parameter. Sketches are a popular version of the latter in OLAP databases to efficiently compute an estimate of mean, median, etc, though one could imagine the idea being taken much further - such as allowing a more generalized “precision” parameter on a query.
Specialized storage - Why not have an optimized physical layout on disk for a sample of a table, if it will be one of the most popular types of query an agent runs on that table?

Clickhouse directly touches on this in a recent blog where they discuss how “improved discoverability” is likely one of the key tenets you need to design around when building an OLAP system for agents. They propose building SQL extensions that make this easier - “Think of it like a server-side version of pandas.describe() designed specifically for agents.”

The “Supporting Our AI Overlords” paper outlines a much more ambitious version of optimizing for sampling queries. They discuss designing a database around “probes”, which are semi-structured queries that the system can answer to steer the querying agent. These might include exploratory queries that aim to identify relevant tables or sub-tables (”Which tables relate to sales?”), contextual queries that give background context which allow the system to nudge the agent in the right way (”I am trying to gut check if east coast sales team is doing better than west coast”), or accuracy & termination criteria (”I only need to get a rough sense on whether the east coast team is doing a lot better than the west coast”). The paper explores how one could dramatically redesign a database to support these sorts of inquiries natively, essentially treating this as a first class type of query with its own optimizer alongside traditional SQL queries.

Taking this idea further, you might imagine how a system could be designed to proactively recommend followup queries or related lines of inquiry after an agent makes a query - essentially becoming more like a proactive recommendation system than a passive execution engine.

The broader point is that agents benefit a lot from small nudges, directional answers, and fast feedback - and there are a lot of ways to design a system around this.

Local Execution

Agents particularly benefit from having local, embedded, lightweight versions of systems that they can easily test and iterate on without having to go through the headache of manipulating production infrastructure which has more state, more risk, and more cost associated with it.

As an example, I think that coding agents are massively increasing the need for a way to reliably emulate github actions locally - leading to projects like Agent CI. I would argue grep’s popularity in coding agents is another very good example of this - in many cases it is more efficient for an agent to use the lightweight local search system than the heavy remote search system.

MotherDuck’s original vision of an OLAP data processing system that was built around dual client + server side execution by decomposing a query plan across local storage and cloud storage is interesting idea to potentially revisit. I could see these sorts of hybrid execution architectures becoming more popular in a world of agents where you want the compute speed of the system to be impedance-matched with the reasoning speed of the agent.

Query Complexity

Agents tend to write much more complex queries than humans. Often, they are far more “well-specified”, by that I mean taking advantage of a full set of parameters or operators.

Search is an obvious case of this, where a human search query is often no more than 4-6 words (whether on web, an eCommerce site, or somewhere else), but agents will write much longer search queries mapping to exactly what they are looking for.

This structurally changes how you design a search engine. For example, see how Turbopuffer has updated its search techniques to account for this type of query. Many agents actually try to write web search queries that exceed the query length permitted by most web search engines of 32 words - their system is not structurally designed to fulfill queries longer than that.

In a related vein, I think “operators” - or specialized precise filters in web search engines such as “site:” and “link:” - will need to make a come back for agents. Many web search operators slowly deprecated many operators over the past decade as humans tend not to use them, and they add indexing and query engine complexity on the backend.

This idea applies broadly across query engines and APIs. Because agents are quite capable at understanding documentation and fully taking advantage of very specialized query parameters, the juice may be worth the squeeze to add far more niche query configurations or API parameters if they allow an agent to more precisely define it’s intent and objective.

Other, smaller ideas

Fail fast - As we’ve discussed, agents benefit a lot from fast feedback loops. Another version of this is designing systems to be able to fail much sooner via pre-processing, type checks, and other types of safety checks. This is harder when serving humans because humans tend to be lazy (e.g. see how humans love Python), but agents are very good at annotating code that they write. So, it may be much more tenable to build SDKs and CLIs that have strong typing and other types of validation annotations natively built in
Transactionality - I think agents will favor systems that have strong transactionality guarantees built into to complex sequences of operations. For example, if a data agent needs to ingest data, transform it, and move it somewhere else, the entire loop should be reversible and atomic. It should be impossible to hit a partial execution result. This is a very difficult primitive to build in complex distributed systems, but makes it much more tenable for the agent to try, fail, and iterate.

Putting it together - an observability case study

I’ll round this out with a case study from Firetiger, which recently wrote a great post on designing an observability database for agents.

Observability databases for humans need to primarily optimize for the dashboard use case - low latency queries over very large datasets. A lot of the traditional optimizations in observability serve this use case by attempting to minimize the cost and latency of these large queries, such as by pre-aggregating data, sampling data, caching results, or similar.

As we have discussed, agents, in contrast, want to explore large amounts of data in a much more parallel, exploratory, and dynamic way. The Firetiger team touches on how this greatly increases the importance of:

Supporting exhaustive, high cardinality data - as agents have semi infinite processing bandwidth, can sift through everything, and don’t just want the top level aggregates
Supporting immense query volume due to parallel explorations - increasing the importance of separating compute from storage, having serverless compute workers that can spin up and down quickly, and having strong isolation between workers so they can not block or compete with each other. Basically, you want to optimize for throughput and concurrency, not latency
Dealing with the small files problem for fresh, quickly accessible data - via techniques like an ingest service that merges concurrent writes into tables, and intelligently compacting and expiring older data & metadata
Data discoverability & sampling - Tools for agents to discover what data exists and how to query it
Schematization - To make it easier for agents to understand how to navigate what is normally unstructured telemetry data and increase token efficiency

Note how many of these are hard system tradeoffs - optimizing for aggregates vs. random specialized reads, optimizing for latency vs throughput, optimizing for concurrency vs. serialized operations. These are not minor differences or abstraction sugarcoating.

I thus think it is invariable you will see a new version of every type of computer science system emerge that is designed for agents. The sheer number of ways you can redesign each system is too substantial for this to not be the case.

More fundamentally, I suspect that agents will end up blurring the lines of traditional systems categories, creating new categories all together. One obvious example of this is that agents seem to blur the lines between OLTP and OLAP significantly, which is partially why Databricks is pushing the “lakebase” architecture so much (see also this Clickhouse talk from last year). But, I think this will play out in more complex ways that are difficult to predict right now.

Infrastructure companies deeply exploring these ideas

I am sure there are some I am missing, and of course some of these ideas were popularized in systems completely outside of the scope of AI agents but become more important when serving agents, but nonetheless here is a rough take on the startups more aggressively pursuing the types of ideas outlined in this blog post, separated by category:

Git & Filesystems - Relace, Mesa, Pierre, ESRC
CI - Blacksmith, AgentCI
Object Storage - Tigris
Web Search - Exa, Parallel, Firecrawl
OLAP - Bauplan, Motherduck, Clickhouse, Firebolt
OLTP - Neon (Databricks), Xata, Turso
Observability - Firetiger, Nominal
Search & Recommendations - Turbopuffer, Chroma, Shaped, Lance, Hornet
General Sandbox/Compute - Modal, Daytona, e2b, Sprites, OpenComputer

I am extremely interested in investing in startups using ideas like these to rebuild or rethink infrastructure categories, and in talking to people redesigning large existing systems along these lines. Please get in touch if so - davis @ innovationendeavors.com

Thanks to Chris Riccomini, Jacopo Tagliabue, Apurva Mehta, & Vedran Jukic for providing feedback & input on this.

Subscribe now

The two software development stacks

Davis Treybig — Thu, 16 Apr 2026 19:27:54 GMT

Today, virtually all developer tool startups are spending their time trying to rethink their product to serve software engineer with fleets of AI coding agents at their disposal. AI coding agents are changing most, if not all, workflows in the software development lifecycle and are putting enormous pressure on existing tooling for humans to adapt.

Yet, at the same time, a parallel version of the software development lifecycle is emerging - the “agent only” stack. This is the software development stack for code that no human ever sees — code that is written, built, tested, and deployed by an agent completely outside of the scope of a traditional software engineering team. A simple example of this would be any “ephemeral” code written Claude as you interact with it.

Almost no one is thinking about building developer tools for this parallel SDLC, yet I am increasingly confident that all layers of the existing SDLC will be rethought for it.

My central argument in this post is that more developer tools companies should consider going after this market, rather the “human + AI” market, as it is very new, very rapidly growing, utterly underserved, and will likely ultimately eclipse the size of the traditional software development lifecycle tools market.

The agent-only stack

The most obvious version of the “agent-only” stack today is vibe coding applications. Companies like Lovable, Bolt, and Replit are products where, every time a user interacts with the product, a substantial amount of code is being generated behind the scenes. That code needs to be tested, compiled, executed, and hosted somewhere. But it is completely separate from the way we traditionally think about a software engineering lifecycle - there is no pull request, no human code review, and in 99% of cases no human will ever look at that code. The code is an invisible implementation detail rather than an artifact a team of engineers collaborates on.

You are already starting to see “developer tool” startups serve this market in a specialized way - e.g. a few of the major vibe coding startups now use code.storage as their headless git storage layer. This is interesting for two reasons. First, Pierre, the parent company of code.storage, was originally trying to displace Github for humans for a few years, and recently decided to just pivot fully into this direction instead. Second, code.storage is very clearly not attempting to serve the human software engineer use case at all - but is rather going all in on trying to serve the agent-only use case.

I would also argue that the sandbox companies are are version of this - they are essentially cloud development environments for agents, not humans. Indeed, one of the most prominent companies in agent sandboxes, Daytona, was originally building CDEs for humans. This is another success case of a developer tool startup that went all in on the agent-only stack, to great success.

While vibe coding was the first customer segment, the market for these sorts of ideas now extends far beyond just vibe coding startups. Any product that produces dynamically generated UIs or encourages “personal software” as a pattern are using code generation agents to produce code under the hood. Almost every agent using a sandbox is using code generation for things like tool calling, file manipulation, data processing, and similar even if the agent is not in of itself a code gen agent. Any AI tool that supports that idea of “Skills” is writing persistent code that needs to be stored, versioned, and ultimately support operations like merging or approvals. Any Claude Cowork style product is producing numerous files & artifacts that need to be treated like code.

I am quite positive that this agent-only software development stack will end up recreating most of the key ideas that exist in the human SDLC. Code storage is the first step, after which will follow testing, CI, collaborative editing/approval workflows (amongst agents), CD, and monitoring, exactly mirroring the progression that occurred from 2005-2020 with the cloud software development stack for humans.

In simpler terms - there is going to be an invisible, “headless” SDLC that ends up being a core component of most, if not ultimately all, agents.

The strategic question for dev tool startups

This brings me to what I think is the most consequential strategic question for developer tool startups right now: which stack do you build for?

If you are a dev tool startup working on some aspect of the software development lifecycle, you have three options:

1. Adapt for the human-assisted stack — Take what you are building and evolve it for a world where human engineers use agents as part of their existing workflow. This is the obvious move, and it is where most of the market’s attention is focused today.

2. Build for the agent-only stack — Pivot or orient your product entirely around serving the needs of agent-generated code that humans never touch. This space is much earlier, more unknown, and more rapidly changing — but it is also completely greenfield.

3. Try to serve both — Build a product that straddles both worlds. In some cases, this is feasible. An AI code review tool, for example, could plausibly serve both the human use case (by integrating with GitHub Actions and reviewing PRs before humans look at them) and the agent use case (by being a Claude code hook that runs micro-reviews as the agent writes code, completely outside the scope of Github or humans reviewing the review).

Increasingly, my perspective is that if you are a dev tool startup, the right answer is to go all in on the agent-only use case and to deprioritize — or even ignore — the human use case. There are several reasons for this.

First, the market is larger and growing faster. I believe that very shortly, the volume of agent-generated code produced in agent-only contexts — vibe coding apps, vertical AI agents, autonomous systems — will surpass the volume of code produced in human software engineering teams.

Second, the opportunity is dramatically more underserved. The human-assisted stack, while evolving, has a rich ecosystem of existing tools and well-understood workflows. The agent-only stack has essentially nothing purpose-built for it today and no “status quo” solutions. Consider my code storage example earlier - even if your code storage paradigm was dramatically better than git and Github for humans using AI agents, displacing Github is extraordinarily difficult because of the ecosystem & network effects around Github. Every other developer tool integrates with Github, every software engineer is used to Github, and every team has a huge amount of config state stored in their CI pipelines. None of this is an issue if you serve the agent vibe coding an app in Lovable.

Third, the design constraints are fundamentally different. Agent-only developer tools must be built API-first and headless, essentially becoming more like “infrastructure” than a SaaS tool. Deployment may need to be cloud-prem rather than SaaS. The business model must be usage based, not seat based. UIs and workflows don’t matter, and in many cases you need to rethink the core abstractions to serve the agent use case. Finally, fundamental assumptions regarding infrastructure may differ - for example, see how many vibe coding apps struggle with github rate limits on repo creation because repo creation was traditionally a very rare/occasional thing for a human software team, but a vibe coding app probably wants to create a repo per user session. These sorts of factors are also why it will be very hard to serve both agents and human - these are hard fork decisions that are difficult to blend.

Consider observability. What would an observability solution look like for the agent-only stack? All config and setup would need to be via API, dashboards would be completely irrelevant, you’d want to design around an autonomous feedback loop where issues in the deployed software immediately trigger the agent to write new code or update code, and you may very well want to redesign or reconsider the database layer because you will need to deal with a very high volume of “micro apps” rather than a small number of “heavy apps”

Another interesting one to consider is task management. What would a Linear or Asana look like for the “agent-only” stack? Is there room for a product like this?

The hard part about building for this market is it is still so new and emerging. But, that is also what makes it exciting. Today, there are an extremely small number of companies I am aware of focused entirely on this market segment:

Code storage & versioning (e.g. Github for agent-only stack) - https://code.storage/, https://mesa.dev/, https://www.freestyle.sh/
Sandboxes (eg. developer environments for agent-only stack) - Daytona, e2b, Runloop, etc
Essentially nothing in testing, review, CICD, observability, task management, etc

As a result, I think this is a really rich area to explore as a startup right now.

Subscribe now

Engines & kernels, not applications

Davis Treybig — Wed, 28 Jan 2026 00:49:49 GMT

A large cohort of the most valuable software companies in the world can be thought of as a very sophisticated engine or kernel, with a UI on top. Examples of this include:

Solidworks - CAD Geometry Kernel + UI
Ansys - Simulation engine + UI
Figma - Web-based vector graphics engine + UI
After Effects - Motion graphics kernel + UI
Premiere - Video processing engine + UI
DaVinci Resolve - Color grading engine + UI
Unreal Engine - Game engine + Editor UI
Wolfram Mathematica - Symbolic computation engine + UI
Gurobi - Operations research solver + UI
Revit - BIM engine + UI
Hex/Jupyter - OLAP Python kernel + UI

Essentially, almost any product serving sophisticated “builders” (engineers, architects, designers, etc) looks like this.

For the most part, I would argue that somewhere between 90-99% of the IP and value of these products lies in the computational complexity of the engine, with the UI and user workflow for the most part really just being a UI that maps to the API surface of the engine. Indeed, aside from Figma, I think almost none of the the products I list above can be considered “elegant” or “simple to use” - but people learn to use them anyways because they are powerful.

Traditionally, it was critical for such products to build out and think about the UI as the main (if not only) way for people to use these products. An engine without the workflow would have been completely inaccessible to the people who need to use these tools.

However, large language models (specifically coding agents) are beginning to change this dramatically. Coding agents are very good at assembling on-the-fly UIs. They are also amazing at mapping a complex user request into a set of API calls. And correspondingly, they seem quite promising for solving the classic challenge these products have in terms of accessibility, learning curve, & product complexity.

BlenderGPT is a fun, but simple example of this, which uses LLMs to script Blender, the 3D design software. The “stack” essentially becomes LLM + Blender Engine, with the UI of Blender becoming somewhat irrelevant.

I think we are already at the point where this paradigm of a Claude CoWork style interface on top of one these engines is in some cases more powerful and flexible than simply using the application itself.

The first vector by which this occurs is that AI allows for automation of larger, complex, and especially more procedural tasks. It can map semantic intent to a large loop of doing a bunch of things in a certain order - something you could of course do yourself, but which would take forever.

The second vector by which this occurs is that AI can help with discovery and fully “accessing” the breath of what the tool offers. Even very experienced professionals in these domains are often unaware of the full range of functionality embedded in these very complex engines. As an example, even hardware engineers with 10-20 years of experience often don’t feel like they are experts in CAD simulation. AI makes these sorts of workflows that matter, but are outside of the “core”, accessible.

The third vector is flexibility. You can have something closer to personalized, ephemeral UI as you do things - which in some cases is more powerful than the fixed UI paradigms of these tools. This also allows for things like progressive disclosure of complexity.

You are starting to see bits and pieces of this in the wild. Marimo is popular partially because the notebooks are stored as pure python, meaning that you can treat Marimo more like the “engine” of a data science notebook. Rive is interesting because it can be used as an embedded animation rendering engine fully programmatically, something note really possible with AfterEffects.

Similarly, there are now a number of startups that are applying ChatGPT style UIs on top of these engines - similar to BlenderGPT.

As LLMs continue to improve and progress, I think that the right way to think about this type of software will be purely as an engine or kernel, not really as an “application”. And I think this will be very, very disruptive to many of these categories as a result.

I specifically think this creates two general themes of startup opportunity that are interesting:

Code-gen based automation of existing “Engine + UI” Tools - Build an LLM based, conversational UI that allows one to dynamically manipulate products like Blender, Unreal, Premiere, or similar via code generation. Essentially, use AI to treat the existing products like engines, not applications. I think in many cases this can both expand the scope of who can use these sorts of tools, as well as vastly increase the productivity of the people already using these tools
Build an “engine” startup in one of these categories designed for programmatic use - Ultimately, most of these products were not designed to be used more like infrastructure than like applications. While many expose some kind of scripting or extension API, they are ultimately limited. And more importantly, if you were purposefully designing the interface for an agent writing code against it, it would almost certainly look different. Systems for agents look different, as we have seen with how many traditional software infrastructure categories are being rewritten for agents.

What’s interesting about #1 is that you can finally side-step the biggest traditional challenge in these spaces - feature density - because you can access all existing features via code generation. What’s interesting about #2 is that you can start with a very small/specific scope, because in the programmatic case you often just need to do 1-2 things better, vs. rather than having to do everything better because you are trying to replace the core system of engagement. In other words - LLMs structurally remove the classic reasons it has been hard to build new startups in these categories.

A few years ago, I invested in a company, Modyfi, that built a completely novel GPU-accelerated graphics engine that ran inside the browser. It allowed for real time rendering & compositing of motion graphics in situations where AfterEffects might have needed 10+ minutes to render. At the time, they also attempted to build a novel UI & workflow tool to make use of this engine, but this was very difficult to do because it was going to take forever to be at feature parity with AfterEffects from a UX perspective and workflow perspective. The company eventually sold to Figma.

I think that if Modyfi were started today, the better approach may actually be to try to monetize the engine mostly via AI-based code generation as the interface. Essentially, Modyfi becomes more like a “tool” in the agent world than an application. This would have saved the team so much time & money, let them get to market faster, and honestly played more into the companies’ unique strengths.

I am very interested in teams going after ideas along these lines. All of these feature dense “engine” categories are much more up for grabs than they ever were before. Shoot me a note if you’re building in this area - davis @ innovationendeavors.com

Subscribe now

All agents will become coding agents

Davis Treybig — Tue, 13 Jan 2026 23:08:43 GMT

Yesterday, Anthropic launched Claude Cowork, a new consumer product experience that helps you get work done by putting a rich UI on top of Claude Code. Anthropic themselves admits that this came as a result of seeing countless Claude Code power users using Claude Code for things that have nothing to do with software engineering - like managing your personal todo list or handling your emails.

It turns out that the “LLM + Computer” agent paradigm that Claude Code pioneered, where an LLM has access to a file system, a bash terminal, code generation, and similar linux computing primitives is exceptionally powerful - regardless of whether you actually need to write code as part of your actual task.

I increasingly think it is likely that all agents will move towards this architectural design pattern. In other words, all agents will become coding agents.

This post will explore early examples of this across different applied AI startups, discuss reasons for why this architecture is so powerful, and explore its downstream implications on startup opportunities in applied AI and AI infrastructure.

Code generation as a universal tool

Why does code generation matter for non-coding agents? Lets begin by exploring the common set of reasons why code generation is so effective for non-software-engineering agents.

Code as a reasoning layer

The vast majority of AI startups need to do nuanced, mathematical reasoning of some form as part of their core function. AI accounting startups need to do spreadsheet manipulation. AI financial research startups need to be able to pivot and filter financial data. AI document processing startups need to manipulate numerical data extracted from tables and aggregate/summarize it. AI research scientists will need to analyze research results.

Doing this type of precise reasoning in token space is very, very unreliable due to how language models tokenize language. What is much more effective is treating all numerical manipulation as a code generation problem, similar to AgentMath.

Code as a tool-calling layer

Traditional tool-calling paradigms (ala ReAct) have the LLM repeatedly process the context, identify the next tool to use, run that tool, and repeat.

This is both inefficient (since it requires calling an LLM for every tool call) and more prone to mistakes than to ask the LLM one time to produce code to run a sequence of tool calls that allow the agent to make progress.

For example, imagine an agent is tasked to synthesize a bunch of financial records, and has access to a company search tool & a fetch financials for company tool. What you want to do is:

Search for all companies
For each company, fetch financials
For each financial record, compute a score
Save the result

This task is much more effectively achieved by having the LLM write the code to execute this loop once, and to then procedurally execute the loop, rather than having linearly call tool > reason about next step > call tool > reason about next steps etc.

This blog post is a great deeper dive on this concept, and this recent post by Replit reflects the same idea. Claude is moving its default tool-use paradigm to Claude Skills, which stores tools in a file system which can be found via bash and file system commands, is a direct reflection of the market moving in this direction.

Code as a context management layer

Computing environments provide a powerful substrate for dealing with context management. This video and blog about context engineering in Manus are some of the best illustrations of this that I have seen.

Manus stores context in the filesystem of a computing environment, utilizes bash commands & file paths to progressively disclose context and tools, and decomposes most user tasks into some combination of: 1. Web access, 2. Code generation, 3. Context search, 4. Computing utilities (bash, file system, etc).

The reason this works so well is that most context & tools are, by default, hidden from the LLM at any step, but the LLM always has access to a few utilities (e.g. bash commands) that allow it to progressively disclose data or tools it may need. This avoids context rot, and also saves you a lot of cost and latency thanks to minimizing token usage.

Note that using code for tool-calling is a particularly important version of this. Loading up a ton of MCP servers completely overloads the context window of most models, but having a small set of composable tools + code as an orchestration layer allows this to be avoided.

A lot of research work is now taking these ideas further. For example, RLMs explore an agent architecture where the context is represented as a variable in memory that is not shown in any way to the LLM unless it specifically asks for it using various tools (grep, peeking, etc).

I think this will become the predominant pattern for context engineering - treat context as data that can be dynamically accesses via a small set of powerful primitives (e.g. search).

Code for universal interoperability

The most effective AI products meet you where you are, allowing you to input data in any way (e.g. upload any file type) or integrate them into whatever systems you use. However, it is difficult if not impossible to pre-build every integration all your users may want, or a custom tool for any kind of input the user may provide.

What can work - though - is allowing your agent to write code to dynamically process any input, write any last mile integration, or do anything on the web (via browser use) or on your computer (via computer use).

A simple, but illustrative, example of this is how recent advances in code generation models have suddenly allowed an influx of new “AI Copilot” products to be built that help automate IDE-like products that lack robust plugin or extension ecosystems. You can’t build a Cursor-style product for tools like Adobe Premiere, Adobe After Effects, Solidworks, Ansys, and Cadence because these tools are not open source and don’t have fully expressive extension ecosystems, but they all have scripting languages one can write code against.

It is almost a truism that AI products are more effective the more data that they can reference, and code generation is the best way to allow maximum flexibility on data ingestion.

Code is infinitely expressive

ChatGPT used to have specialized pan & zoom tools to manipulate images that users upload. Now, it just writes code to do image processing, which is a much more flexible architecture because the system can do far more than only panning or zooming - but rather achieve the vast majority of image processing tasks.

If you want your agent to be able to do almost anything your users want, the best way to achieve it is to allow the agent the write code. It is near impossible to fully enumerate all the tools or capabilities a given agent should have in any domain, and so having code as a fallback almost always improves agent quality.

Code generation for ephemeral software

Everything we have discussed so far has more to do with the internals of how an agent thinks & processes data. However, I also think code generation has a lot of value as a UX paradigm for interacting with agents.

While natural language interfaces for agents will not go away, there is clear value in having structured UIs that complement or augment natural language. And the best way to achieve this is to allow the agent to dynamically write ephemeral, last-mile software for the user as it does its job.

Claude Artifacts was the first great example of this, and you now see this pattern of “conversation on the left, ephemeral UI on the right” becoming much more common in consumer AI products.

I suspect that all agents will end up benefitting massively from this, because regardless of what type of user your AI product serves, they will benefit from having micro-apps and structured UIs to interact with.

The TLDR of the above is that - making the core of your agent a coding agent with access to a computing environment gives you such a powerful baseline that it is hard to imagine not doing it at this point.

Indeed, one of the biggest themes I see right now amongst AI startups are “Claude Code Wrappers” going after first-wave RAG/agent startups. In many cases, this strategy actually allows you to build a superior product in weeks thanks to all the benefits of code generation architectures. Claude Code taking so much market share from Windsurf/Cursor/etc despite being a random side project at Anthropic initially is a great example of this - the “LLM + Computer” agent paradigm was so powerful in of itself that it outweighed years of feature development.

I’d love to, for example, see a startup apply these ideas to “Deep Research” as a category. I think you could almost immediately build a 10x better deep research product than what OpenAI & Gemini currently offer by:

Storing all primary data (e.g. web search results, scraped web data) in a file system for ongoing access & context management. Today’s deep research products discard this data, when instead they could create a dynamic data lake associated with the research report that lets you continue to query/process the data without needing to restart the web search.
Using code for mathematical manipulation. Almost all deep research involves numerical manipulation & data analysis. This should be treated more like a code generation problem
Making deep research outputs dynamic, living artifacts rather than just static word documents - think a lightweight javascript app rather than a word document.
Using code generation to enable additional functionality for data ingestion - e.g. allow me to provide auth credentials or an API key for the deep research agent to write code to login private/proprietary systems and combine that with public web research

Downstream implications

Let’s presume that everything I have said thus far plays out, and all vertical agents start to make heavy use of code generation. What is the impact of this, and what opportunities might emerge from it?

Computing sandboxes become a default agent primitive

The first, and most obvious implication is that computing sandboxes for agents become a universal infrastructure need.

If every agent writes code, then every agent needs to isolate untrusted code execution, and every agent needs access to a filesystem and a terminal. Manus’s architecture revolving so heavily on e2b is illustrative of this.

Today, sandboxes are mostly used by pure-play code generation agents, but as more of the market moves to code-centric architectures, more AI startups will need to adopt tooling in this space. I suspect that the computing sandbox will be as universal of a need as search/retrieval engines like Turbopuffer, Chroma, & LanceDB.

While it will certainly be possible to DIY an agent sandbox on top of k8s containers or MicroVMs, I suspect best-in-class products will end up winning, analogous to how few serious AI startups simply use PGVector or FAISS for retrieval. There is a lot of room for technical innovation in this space across:

Virtualization - e.g. cold start times, isolation boundaries
Distributed systems - e.g. dealing with persistent state, syncing to remote state, cross-sandbox state, rapidly attaching large state to the VM
Environment definition/harness - e.g. what are all the right tools to expose in a sandbox and the right abstractions. Feels like file systems and git are particularly rich areas
Features/ergonomics - e.g. passing data in/out of sandbox, remote view of sandbox, etc

While there are already a number of products in this category - both specialized startups (Modal, e2b, Daytona, Runloop) and offerings from large cloud vendors (Cloudflare, Vercel) - my feeling is that the market here is still very early and there is a lot of room for further innovation.

I also think it’s possible that the idea starts to extend beyond "sandbox” to “cloud” - e.g. each agent has access to its own cloud account with a range of computing primitives (VMs, Queues, Databases, Object Storage, etc) rather than centering around just the VM. Pertinent tweet here.

A new SDLC stack will emerge for ephemeral code

If all agents begin to write large amounts of “ephemeral” code for reasoning, context management, function calling, and last-mile micro apps, we will invariably start applying many the same concepts that exist in normal software development to this code, including version control, unit tests, integration tests, code review, CICD workflows, and more.

I think this will end up looking like a high performance, “headless” Github that rhymes with existing human-centric SDLC workflows, but differs substantially in a few key ways, such as:

Performance - You’ll want to be able to run an end-to-end CICD pipeline for the code in the span of seconds (or less)
Automation - The end-to-end workflow will need to be fully automated, and not contingent on human triggers or involvement
Git - I think you can probably rethink many aspects of VCS for this use case. For example, you likely want a VCS paradigm where every change is a commit, that natively supports semantic diffs, and that allows for extremely parallel branching & conflict resolution. You may also want to store agent trajectories in git (Meta now does this) and use a storage format that allows for better search & OLAP style queries over the git history for the agent to understand what was tried previously.
API Design - My guess is many aspects of the Github API would be done differently for this workflow. For example, you probably want much more flexibility and control over what verification or review layers are applied to each piece of code, rather than a “monolithic” CICD pipeline. Some core concepts may also benefit from being modified - for example I am not sure the “repo” as the core unit of work is still the right idea given agents will write a lot of very small/micro code chunks.
Consumer Facing UI - You may end up wanting some kind of UI component that allows the end user (e.g. the user of the AI agent product which writes ephemeral code) to somehow visualize or understand or modify the code that was written or used, especially for any ephemeral UX components created

A few companies are early in exploring ideas here like Relace and Freestyle, but I think there is a lot more to be done.

Specialized “computing environment” tools

My guess is that there will be startup opportunities to build best in class version of each major “computing environment” tool. For example - if file systems remain one of the marquee components of agent architectures, then what would a file system offering built from first principles for agents look like?

I think most of these opportunities will look like open sourcing a free library that can be included in whatever sandbox provider or offering you are using, and then monetizing a cloud layer on top of that that is needed for multi-agent systems, very long running agents that might pause & resume, and/or cases where the agent must manipulate data that exceeds the memory constraints of the sandbox.

Successful teams will do a lot of work on harness engineering for that tool, ensuring that the abstraction is optimal for agent quality. The benefit of using the specialized provider will be this plus the cloud sync distributed systems layer.

Beyond file systems, I think this might also apply to OLTP databases, OLAP databases, search engines, durable execution/parallelism/threading, and git.

If you are applying these architectural ideas to vertical agent categories, or building infrastructure centered around these ideas, I’d love to talk to you - reach out at davis @ innovationendeavors.com

Subscribe now

Documentation as tool for agents

Davis Treybig — Thu, 13 Nov 2025 19:21:02 GMT

I was recently vibe coding a wedding website with Replit, and noticed that Replit Agent now includes a documentation step as part of its workflow. After it finishes a task and thinks it is at a checkpoint, it will autonomously decide to update the documentation in the repository. This was not something I asked it to do.

Replit documentation step

If we look at the Replit.md it produces, we see a README style document that touches on overall intent, design goals, key features, architectural design, and similar.

This is very interesting because, from the perspective of the user, the Replit.md file is mostly an implementation detail. While Replit does allow you to find the file in its file viewer UI, this is fairly difficult to get to, and in the course of “vibe coding” it is not something you are particularly meant to look at.

So, why does Replit do this? My suspicion is that documentation is a tool that improves agent quality. I think this is an under appreciated concept, and one that will become much more prominent over the coming years.

Documentation as an index structure

One lens to think about documentation in the context of coding agents is as an index structure.

Coding agents today use a variety of tools to retrieve and manage context about the codebase - including navigating the abstract syntax tree, lexical & semantic search indices, and search utilities like grep.

Documentation fulfills a similar purpose for coding agents:

It acts as a guide that instructs the agent about what parts of the codebase relate to what functionality, helping the agent know where to look
It elucidates the design intent of code, often sharing details about why something was done, what alternatives were considered, and what was not done. These details are useful, but rarely in the code themselves.
It acts as a materialized cache for reasoning - conveying higher order facts/concepts/principles that could be gleaned from the code, but which would require reading & reasoning over large sections of the codebase

Correspondingly, one might expect that there is a lot of value in having a system attempt to autonomously create, maintain, and update documentation for the agent - not as an artifact for humans, but as an implementation detail that improves agent reasoning.

Yet, this is not really done in the coding agent space at all today. While most mainstream coding agents can of course be used to produce documentation - e.g. you can ask Cursor or Claude Code to produce documents for X or Y - none of them autonomously create a lot of documentation as part of their operations to be used as a data structure in future runs.

Agents.md, which Replit.md is likely a riff on, is the closest to what I am describing, but I think it still leaves a lot to be desired.

The first challenge is that Agents.md is really something that a software engineer is meant to maintain. Anthropic has entire guides dedicated to setting up and managing your Agents.md, and Claude Code expresses the # command for the user to update Agents.md. In some sense - Agents.md is in between a fancy system prompt for Claude Code and an index structure that the coding agent system maintains.

The second challenge is that Agents.md is a relatively simple/naive implementation - it’s a single file that is supposed to globally describe everything in the codebase. You can find various guides online about how to nest hierarchical Agents.md files to try to fix this, but ultimately the onus is on the user to figure out and manage the right Agents.MD structure for their codebases.

Advice on how to nest Agents.md files in a larger monorepo

Play this out a bit more, and what you end up with is essentially a human manually having to tune a new type of complex search index structure for their coding agents.

My argument is that we likely need to move to a world where this type of documentation is more autonomously constructed, managed, and updated by the coding agent itself. In other words, documentation becomes a *tool* that the agent can use as part of its operation, just like how the Replit agent decided to write down some documentation updates after I gave it a task.

Such an approach would allow for much richer optimizations of how documentation is structured & laid out, ultimately resulting in far improved coding agent quality.

Blending Human and AI-oriented documentation

I think this is also the right recipe/strategy for starting to blend human-oriented documentation and agent-oriented documentation. It is actually interesting to consider why Agents.md has emerged as this sort of divergent documentation corpus separate from the documentation many engineering teams are already writing.

Indeed, the same guide I listed above that discussed how to create a nested Agents.md structure essentially implicitly recommends against referencing external documentation.

External documentation can create problems for agents

There are certainly well-founded reasons for this - context rot is a real concern, human documentation is likely more verbose and less information dense (worse for agents), and most critically a lot of human documentation is very stale or out of date.

Yet, these are solvable concerns. A proper documentation tool would be able to optimize the right file structure, information hierarchy, and documentation updates to ensure that both humans and AI have effective, useful documentation. You can imagine a world where every codebase has the following:

Human-controlled Agents.md file(s) that allow a human to provide simple preferences or rules it wants the AI to followup - mirroring what tools like Claude Code offer today
A complex hierarchy of “index structure” esque documentation that is primarily managed and maintained by a documentation tool the agent has access to, and viewed as an implementation detail for how to make coding agents more effective
A set of human-oriented documentation that is produced as a sort of materialized view on top of the code + the index structure documentation

My guess is that building an effective agentic system for creating, updating, and managing documentation is a sufficiently rich technical problem that it warrants specialized startups going after it, just like we see with things like memory or semantic search indices.

My intuition is also that this idea extends far beyond just coding agents. Any vertical agent that must regularly reason over a large corpus of data could likely benefit from this concept - e.g. companies like Trunk Tools in construction, Harvey in law, Sierra in customer support.

Rethinking the outer development loop in a world of AI-generated code

Davis Treybig — Tue, 04 Nov 2025 17:14:40 GMT

For the most part, the processes and systems that underpin the way we develop software have not changed for the last two decades. Amidst vast differences in the way we architect software (e.g. the move to the cloud) and the infrastructure we use to build software (e.g. React), the so-called “outer development loop” of software engineering still looks eerily similar to what it was when Github first emerged - e.g. git based version control, human-reviewed pull requests, unit & integration tests running on merge, staging environments & canary rollouts.

In some respect, this is to be expected - these systems and processes have more to do with the way humans collaborate to write software than they do with the nature of the software itself.

But, over the last few years, we have seen a new stakeholder emerge in the process of writing software - the AI coding agent. And so it is no surprise that many of these human-oriented processes are beginning to feel immense stress in the wake of a new type of AI “software engineer” that thinks, and acts, very differently from what we’re used to.

A good, and obvious example of this is validating code. AI coding models and agents make it very easy to produce a lot of code. But, does that code work? Many engineering teams I talk to nowadays not only feel that they now spend more time reviewing code than writing code, but also that the code review process has also become dramatically more complicated. Common problems include:

How do you more rapidly triage PRs to understand what is worth paying attention to vs not?
How do you know which PRs were particularly agent-driven, and might therefore require a closer look?
How do you more rapidly analyze a PR? Is there a way to more easily understand the semantic changes in a PR & the lines of code that map to each change?
How do you know which lines of code were written by an AI vs. not in a PR? Is there a way to see the reasoning or thinking that led to such a line, in that case?
How do you accelerate the code review workflow so you’re not now spending half your day looking at PRs?
Should you be using AI to review PRs? In what situations?

In a nutshell: we need to rethink the entire code review process in a world of AI coding agents. However, code review is just one piece of the way we develop software, and it increasingly feels like these sorts of fundamental questions can be asked about every step of the software development lifecycle.

We’ll need to be better at testing software. Previously “niche” testing techniques like load testing, fuzz testing, formal verification, and similar are likely going to become far more commonplace. And we’ll need to get smarter about where, and when, to run these tests.

We’ll need to be better at specification. Most tickets assigned in tools like Jira lack detail, and rely on the vast amounts of implicit context that the typical human software engineer on a team has. This won’t work for AI agents, and will prevent us from reliably evaluating agents relative to what they were supposed to do. Test-driven development will make a comeback, but in a different form and based more on extremely tightly specified requirements documents.

We’ll need to further explore the ways AI can be used to leverage humans in the software development lifecycle. AI code reviews are a good start, but we’ll also want AI systems to do versions of design reviews, product reviews, QA testing, and much more to lighten the load for human SWEs and PMs.

We’ll need to rethink the core infrastructure primitives the outer development loop is built upon. Git isn’t really designed to handle thousands of agents concurrently modifying the codebase, and improvements in areas like conflict resolution, concurrent edits, and treating each edit as a commit will be essential. We also need to more natively bake metadata about AI-changes vs. human changes into version control systems, such that we can treat AI-generated changes differently throughout the entire SDLC. For example, a company may want different policies for testing & deploying AI written PRs vs. human ones.

I’ve been interested in this theme of the downstream impact of agentic code development for the past few years, but had never found a team attempting to holistically rethink the outer development loop in a world of coding agents, as opposed to just building point solutions for software testing or verification.

That was, until we met Oliver and Ben, two founders who had such a profound vision on what the software development lifecycle of the future would look like. So today, we’re announcing our lead investment into Mesa’s seed round.

Mesa launches today with a platform for designing custom fleets of specialized code review agents - think the architectural review agent that catches architecture drift, or the database review agent that focuses on your core data model. This allows for much more precise, high signal-to-noise reviews that continuously adapt to you, rather than one-size-fits-all code reviews that mostly focus on pedantic details like a syntax error on line 35.

Despite being <6 months old, Mesa is already being used by a number of companies in lieu of more established AI code review tools. This is just the start, and you can shortly expect Mesa to launch a number of other capabilities, including a more substantial rethinking of the UX for doing human code reviews, and a suite of capabilities for test-driven development for agents.

Beyond functionality, what you’ll immediately notice about Mesa is the beautiful branding and design, evocative of a future where software engineers help architect & guide fleets of agents to build the future. This design-first lens is essential to successfully re-inventing how humans and AI agents will collaborate to develop software in the future, and a key part of why we were excited to partner with Oliver & Ben.

Try Mesa today for free at https://mesa.dev/

(Crosspost from this IE blog)

Pricing for Abundance in AI

Davis Treybig — Tue, 29 Jul 2025 19:05:45 GMT

TLDR - I think you can differentiate as an AI SaaS startup by being more expensive and embracing usage based pricing, rather than trying to hide it. This allows you to build reasoning-native features, avoids the need for dark patterns like model downgrading that have crept into seat-based products and alienate the most engaged users, and aligns more with the mental model that forward thinking buyers have.

Amp is unconstrained in token usage (and therefore cost). Our sole incentive is to make it valuable, not to match the cost of a subscription - Amp Code Owners Manual

Pricing of AI-native SaaS tools has been widely discussed over the past year or two - particularly the margin challenges that can be associates with AI SaaS products (Windsurf Gets Margin Called) and the idea that you can sell “outcomes”, not work, with AI (Sell the Work).

However, there is one overlooked angle of pricing AI products which I increasingly find interesting from a startup perspective: the idea that usage based pricing, particularly in a world of a reasoning models, can act as a competitive advantage. In other words - I think more products should price for abundance.

Most AI SaaS products out there today try to map their pricing back to some kind of seat based model. Cursor charges $20/month, Greptile charges $30/month, Perplexity charges $20/month. While this is a simple and predictable pricing model for users, it comes at a cost - because of the massive variance in COGS across a user base when you offer a low fixed seat cost, you often end up needing to adopt dark patterns to control margins.

For example - you can find quite a few discussions online about how various AI code generation tools seem to “invisibly” downgrade the model they use behind the scenes if you use the product too much in a given billing period. This leads to a ton of user frustration as the product suddenly seems worse for a reason the user can’t quite explain.

Invisible or opaque downgrading or rate limiting of AI products is widespread

Claude Code’s recent rollout of invisible “usage limits” reflects the same problem - they are likely trying to maintain their per-seat-per-month business model, but as a result they have to curb extreme use because otherwise the top 5-10% of users will ruin the margins of the entire product.

This approach feels fundamentally flawed to me. First, it alienates your biggest power users - many of whom would likely pay much higher prices given the value they receive. Indeed, many of the more forward thinking executive teams I know are not aiming to reduce token usage, but increase it (see Shopify CTO quote below). Why would you design a pricing model counter to this?

Shopify CTO on why you should reward token spend - Shopify maintains a leaderboard of the TOP token spenders and treats it as a goal to climb the leaderboard

Second, this business model prevents you as a company from leaning in to what is special about foundation model-based SaaS tools - namely, that you can pay more for them to do more. This is especially true in the wake of reasoning models and foundation model systems - where quality & accuracy of results are directly correlated with how much money you are willing to spend on incremental compute.

My central argument is that you can actually build a differentiated product experience by embracing this concept, rather than trying to obfuscate it behind a seat-oriented pricing model.

For example:

What if I could set a policy for Cognition’s Devin agent to spend a lot to try to automatically fix sev0 or sev1 bug reports from customers, but use weaker models & less reasoning for standard customer feature requests?
What if I could set a policy for a code review agent to think a lot longer with a large reasoning model for code changes that touch the database, but spend far less to review simple client-side visual changes?
What if I could modulate how much effort I need to put into drafting a contract as a lawyer with an AI legal tool based on how important & complex of a contract it is vs. how non-important it is?

You can map out these sorts of “dynamic effort policies” for basically all key categories of AI SaaS tools. There are always cases where you are willing to spend more for higher quality work to get done faster - and architecturally, it is well known how to get foundation model systems to achieve this (e.g. more reasoning, repeated sampling, best of N parallelization, etc). It’s simply a question of whether your business model allows for these sorts of ideas to be built into the product. I think it is basically impossible to truly embrace reasoning-model product paradigms in a SaaS tool if you do not have usage-based pricing front and center.

If executed correctly, I think this approach may actually allow you to invert the phenomena you see with Claude Code - where they alienate power users to maintain a simple, low-cost seat based model for non-power users. In gaming, it is long established that extreme power users subsidize everyone else - e.g. in mobile gaming the top 1% of users generate ~60% of all revenue. It is becomingly increasingly clear that in many of these AI categories - you see a similar distribution of user behavior, with top users using the product 1000x times the regular user if not more, but the business models for the most part do not accomodate this. If you instead embraced this idea, could the power users of a tool like Claude Code actually subsidize the low end?

Taking a step back - never before in the history of SaaS was it possible for you to pay more for a product to do more. Products like Docusign, Salesforce, Hubspot, and similar were, for the most part, fixed functionality that could not modulate quality no matter how much they were willing to spend on compute for a given task. Even traditional ML startups could not really achieve this - e.g. a demand forecasting model is going to keep producing the same results for the same features regardless of a customer’s willingness to pay more for forecasting X than Y.

Correspondingly, I think foundation model SaaS tools should lean into the fact they they can offer differential quality based on compute - as this is part of what allows you to counterposition and build a better product than existed before. Note that this is not necessarily outcome based pricing - which is nice when you can do it but is realistically unfeasible in most domains where there is not a clear, obvious outcome to charge against. Rather, it is “effort based” pricing.

There are, of course, numerous downsides with this type of pricing strategy. It is often harder to get users comfortable with usage based pricing. How much will it actually cost me? How do I explain it to my CFO? How do I ensure it doesn’t go completely out of bounds? But, in balance, I think it is better to address these problems head-on (e.g. support usage caps, limits, gaurdrails, alerts, real-time spend dashboards, etc) rather than deal with the impedance mismatch that occurs when you build an AI-Native SaaS with a seat oriented model.

You are starting to see some startups embrace this. I began the article with a quote from Amp Code, which has embraced this pricing philosophy the most in the coding agents space. I suspect that this pricing philosophy will rapidly become more commonplace in applied AI startups. The startups which adopt it will be able to innovate on product more and take dramatically more advantage of recent model capabilities in reasoning, because they will not need to deeply consider & model COGS for every single product decision they make.

Subscribe now

State of Foundation Models - 2025

Davis Treybig — Wed, 25 Jun 2025 13:02:46 GMT

I recently put together an extensive, 100+ slide presentation covering the state of the foundation model market in 2025. You can find it at foundationmodelreport.ai.

I also put together a live presentation of a condensed version of the slide deck here, if you prefer listening vs. reading.

2025 State of Foundation Models - Youtube

You can think of this as a spiritual successor to my 2023 “Foundation Model Primer”. I aim to recreate a lot of what people liked about that presentation - particularly how it more holistically covered the space end-to-end.

Some of my favorite tidbits from the deck this year:

Generative AI has gone mainstream - 1 in 8 workers worldwide now uses AI every month, with 90% of that growth happening in just the last 6 months. AI-native applications are now well into the billions of annual run rate
The pace of research progress is wild — the set of tasks an AI model can reliably do is doubling every ~7 months, the cost for a given unit of intelligence is going down >100x year over year
Training costs balloon, but OSS convergence continues - A typical model now costs >$300M to train, but also only stays a top model for about 3 weeks
Substantial progress in “newer” modalities - We are at a ChatGPT moment for video models. Science models in areas like protein folding & materials are starting to get interesting. Robotics models, world models, & voice-to-voice models are improving rapidly.
The venture frenzy is intense - 10% of all venture dollars in 2024 went to foundation model companies, and 50% of venture dollars in 2025 have gone to AI startups
Reasoning models are the new scaling frontier - As a result, tons of focus right now on verifier models, reward models, and reinforcement learning. Are we going to see quality generalist reward models?
Agents are starting to work, but design patterns & architectures are still so early - Model pickers are like picking your web video codec in 1998. “Systems” paradigms like sequential sampling or fan-out/fan-in are under-appreciated and will become more mainstream.
MCP is exciting, but the agent-computer interface for tool use underrated - Many good startups I know forgo MCP for purpose-built integrations as a result.
Good data curation, retrieval, and evals are still under-appreciated - For example, a well architecture RAG system is 10-100x better than a long-context only model in latency, quality, and cost for even straightforward use cases.

And honestly - that’s just scratching the surface! Would love to hear what you think — and if you find it valuable, feel free to share it with others building in the space.

Subscribe now

Synthetic Product Feedback

Davis Treybig — Tue, 24 Jun 2025 21:50:00 GMT

In recent years, many startups have focused on using large language models (LLMs) to automate the functional testing of code — checking whether it works as intended and is bug free. Companies like Greptile and Graphite tackle code review; others like Ranger focus on QA automation.

However, there is a large class of software testing that does not fall under this functional testing paradigm, but rather is focused on the qualitative effectiveness of the software. Is the new feature understandable by users? Does it improve conversion? Is it easy to use?

These sorts of questions are historically answered by an entirely different set of processes, such as:

User Experience Testing - Have people try to use the product, often with different structured goals in mind, and get their reaction
A/B Testing - Roll out the new version of the product to a % of users, and compare those users against a baseline using key product metrics
Product Instrumentation - Measure how people are using the new product/feature with descriptive analytics such as conversion rates or user funnels to see if your new code drove the right change in behavior
User Interviews - Show users mocks or a demo of your new feature, and gauge their reaction. Is this compelling? Is this how you would want it to work?
Surveys - Collect aggregate data from different public pools of people in your ICP on something new you have built or are launching
Design review - A designer goes through the workflow and critiques it from a usability and user experience perspective

What is interesting is that - it is very clear that LLMs and agents are capable of automating and/or producing synthetic versions of almost all of these forms of qualitative feedback, yet almost no startups have focused on it.

What if LLMs made it possible to collect usable, directional qualitative feedback on every pull request, with no marginal cost? I will give a few examples of what this might look like:

Synthetic A/B Testing

Imagine if, on every pull request, you spun up hundreds or thousands of preview environments - half with the previous version of the code (control), half with the new version of the code (variant). In each preview environment, run a computer use agent that is tasked with performing the key product goal associated the given code change. Prompt all of these computer use agents to mirror the distribution of user personas in your user base.

If you did this, you would in theory have run a synthetic A/B test, and you would get a synthetic measure of whether or not your code change improved the user flow.

This is not just conceptual - I have now actually seen multiple people experiment with this idea to some success. Agent A/B is a recent research paper that explores this direction and shows that this type of synthetic experiment can in fact predict real world user behavior on eCommerce sites. A friend of mine, Greg Dale, ran some simulations of this sort of idea and also found that synthetic variations of experiments seem predictive in cases where we have ground truth real world experiments. I have included one of his results below.

Synthetic A/B test benchmarking. The three right sets of columns are all LLM based synthetic experiments measuring the same thing as the “benchmark” real experiment. Note that all synthetic experiments predict the estimated lift in the control vs. treatment fairly accurately.

What is particularly interesting about synthetic A/B testing is that it solves what I would consider the biggest challenge in the A/B testing market - sample size. Historically, only the largest marketplace and prosumer/consumer companies have been able to regularly A/B test ideas because without sufficient sample size, it was impossible to draw statistically significant conclusions. But, such a synthetic approach would be applicable to everyone.

While there are numerous gotchas here - particularly how to evaluate and benchmark synthetic experiments, as well as how to properly mirror the user distribution especially in more “niche” product workflows - the technique is certainly intriguing.

Synthetic User Research and UXR Testing

Imagine if, on every major pull request, two things happened:

An agent analyzed Gong, Zendesk, and similar data sources to identify users who had given the most feedback related to that code change, and then proceeded to email them saying something like “Hey, we are working on improving X thing which we noticed you had complained about. We’d love to preview the change to you and see if it addressed your feedback”
A fleet of browser agents prompted to act like world-class UX designers and PMs were spun up and tasked to use the feature and provide qualitative feedback. This feedback would then get summarized, and you could also analyze each specific agent’s stream of thought as it used the feature similar to how you can see the stream of thought for existing browser agents trying to complete a task
Existing browser agent products, like Browserbase, already provide a monologue as they try to achieve a task. It is not a long shot to consider them being prompted to specifically identify or highlight UX or product issues as they do so.

This is the sort of work that might take a PM or UXR days to do traditionally - and whose cost is so high that it is typically reserved only for the biggest feature releases. What if it default existed for every single code change?

The Bigger Picture

These two examples highlight what I would consider to be the two broad classes of techniques worth considering here:

LLMs can be used to personify or impersonate humans, and that can be a source of synthetic human feedback
LLMs can be used to automate the execution of collecting qualitative human feedback - such as reaching out to people and conducting interviews or user studies

In conjunction, these two techniques would allow you to automate a version of every single qualitative testing modality I described at the start of this blog.

Obviously, the quality of “synthetic” feedback will likely be worse than if it were real human feedback, or a human were actually collecting it. But, this belies what is interesting about these LLM-based approaches approach.

Doing any of these qualitative feedback methods well takes a lot of time & effort, and in many cases is still not feasible. 90% of companies don’t have a large enough user base to effectively run A/B tests consistently. Most companies barely even have a single full time user researcher on staff to run UXR studies. Even companies with the right resources only have the capacity to do these sorts of things on the biggest of changes - a given enterprise PM only has so much time to “preview” new features to their customers, and can likely only focus on the biggest customers.

But, if the cost & effort to run these sorts of analyses falls to zero such that they are “default available” on every pull request, then even if the predictiveness is only 50-75% of what it would be doing things the traditional way, that tradeoff is likely worth it. You can, of course, still do the traditional qualitative feedback workflows in key moments or for big decisions where you want to be certain.

I strongly suspect a startup could be built that essentially offers “synthetic feedback by default” on every pull request. There is now quite a bit of prior art that these methods work better than you might expect - e.g. market research startups like Evidenza which use LLMs to mock brand feedback see correlation coefficients of 90%+ when backtesting against historical surveys that brands have run. When you combine this with how capable computer use models are becoming, I see almost no reason why it shouldn’t be possible to collect extremely high fidelity synthetic product usage feedback.

Such a startup should focus on selling primarily to companies who have never had access to these sorts of techniques. E.g. sell synthetic A/B testing to the companies who have never been able to run an A/B test in their life. You’ve suddenly given them a “superpower”, and their comparison point is doing nothing, rather than doing a real A/B test.

Key challenges in this space, and therefore areas to build technical depth, include:

Data integration - How do you get the right data to effectively model the company’s userbase and user distribution? This will be the determining factor in how good vs. bad the synthetic techniques are
Trust building & evaluation - How do you know when to trust synthetic product feedback vs. not? When does it hallucinate or get things wrong? If the user has no way to back-test the results, how do they learn to trust this?
When to do the traditional stuff - I think a key trait of a successful product here would be informing you of when a synthetic approach is unlikely to be predictive, and helping you do things the traditional way when it makes sense to
Modalities and Sequencing - Which analysis modalities should one focus on vs. not? Which are easier to get right today vs. in the future? For each modality, what is the critical problem to solve to make it work well?
Cost - Some of the techniques I describe would be very, very expensive to do today. While model cost is going down exponentially, you will need to be clever about how to handle this in the short term - particularly because I think it is essential that this can just be applied by default to every code change or PR, rather than be something someone has to think about
Taste - How do you get models to accurate reflect good vs. bad UX design? This is easier said than done. A good analogy is the vibe coding space, where so few products actually produce tasteful outputs from a design and usability perspective (e.g. see MagicPatterns), and most produce AI slop.

I am extremely interested in talking to anyone who wants to build something in this space - davis @ innovationendeavors.com

Subscribe now

The Visual IDE

Davis Treybig — Mon, 14 Apr 2025 17:27:37 GMT

Over the last few months, I have spoken to numerous designers & product managers about how their job is changing in the face of AI coding tools. While there are numerous downstream implications of LLM-based coding assistants and agents, there is one predominant theme that has been front and center in almost every conversation - product managers and designers are increasingly writing software, not just specs/PRDs/mocks or other peripheral descriptions of software.

Basically every good PM & designer I know is now using tools like Windsurf, Bolt, Loveable, and similar to prototype application changes, and in some cases, even submit PRs on their own. This is remarkable given that many of these people are not technical, didn’t study computer science, and had never really written a line of code in their lives prior to LLMs.

Some direct quotes I have heard recently in discussions:

“Honestly, I don’t see very much value in using Figma anymore. I am better off just forking the codebase and using Cursor to create living prototypes. This is faster, avoids all the skew/drift of having separate mockups, and greatly simplifies hand-off to the engineers I work with” - Former design lead at Hashicorp

“I was recently trying to convince our team to work on a major new initiative, but I couldn’t get buy-in. I finally decided to just prototype a mini version of the idea over the weekend so I could directly show people what I meant. It was only when I showed my working prototype that I suddenly got everyone onboard. Before Cursor, I never would have had the capacity to do this, and we likely never would have pursued this direction” - Staff product manager at Duolingo

So we just launched this new feature for visualizing certain data and we needed to pick colors for it - and it is sort of a complex feature in terms of colors because there might be 20+ lines on a chart, and all these colors must be look good together and be high contrast against background etc. The designer sent me an initial set of colors and it didn’t work, and then he sent another set and those didn’t work. So what he then did instead was use Cursor to build a version of this application which included a color picker that I could flick through and see what many variations would look like directly. And seeing him do that, I was like, okay - This is going to change everything. Right? Like, designers can build applications” - Senior engineering lead at Cloudflare

If you look online, you increasingly see this sentiment everywhere - Gokul Rajamaram recently shared an example of a VP of product at a public company discussing how he is building a new feature or concept on his own every few days.

The key thing here is agency - tools like Cursor empower product managers and designers to be 100x more independent in ideation & driving impact because they no longer constantly need to wait for engineers to get things done. Every designer has 100+ UX papercuts they wish they could fix but which never get prioritized. Every PM has 5 ideas they wish they could concept to show customers to get feedback on, and has 100 minor feature requests and bug requests from customers they wish they could address. AI coding tools suddenly allow these things to be fixed outside of begging an engineer to include it in the next sprint.

Extreme behavior

In my opinion, this is a classic example of “extreme behavior” - where people are going out of their way to do something unnatural because the benefits are so substantial. Specifically, I do not think that code-first IDE are actually the optimal interface for PMs & Designers to prototype or make small changes to the product.

PMs/Designers are not software engineers and, for the most part, are not deeply familiar with the numerous concepts implicitly embedded in tools like Cursor, such as:

Git semantics - cloning, branching, etc
Interacting with a terminal and viewing console log outputs
Reading & writing code!

While you can, of course, ignore these things and “vibe code” your way through figuring things out until they work, this has obvious pitfalls. Tools like Bolt & Loveable partially solve this problem by abstracting much more “software engineering” concerns from the user, but are ultimately very oriented towards independent, “net new” apps rather than maintaining and updating a large, established codebase. They are also still quite code-oriented in terms of what they show the user.

So, while the behavior of some software engineering shifting right to PMs & Designers is here to stay, I think it is unlikely that the optimal end state of this is everyone using a software-engineer oriented IDE.

What would an “development environment” for the non-SWE persona look like?

If you thought about solving this workflow for non-software engineering persona from first principles, what might it look like? These are some of the principles that I think would matter a lot:

A more visual editing experience oriented around a staging preview of the app

IDEs treat code as the first class citizen and previews of the application as a secondary workflow. I think this should be flipped on its head for non-engineers - the “thing” you are working on should probably look like a live staging preview of the app, and code should secondary or tertiary.

You see hints of this direction in products like Replit which increasingly leans towards just showing you a natural language conversation with the agent plus the web app preview, and have increasingly deprioritized code & the terminal - though I think I would take the idea a lot farther than this if orienting towards PMs & Designers.

Replit

You can then imagine a range of editing workflows that would be more visual & language oriented than what you see in coding IDEs. For example:

Click through a workflow of the app and then describe how you want it to change in natural language
Use an “inspect element” like flow to click or select certain parts of the app, and then describe how you want it to change
Drag in a spec, PRD, Jira ticket, or similar and draw on top of the application like a canvas - highlighting the portion of the application the change should apply to

As you make such changes, you can imagine an AI coding assistant that progressively helps you implement this in the staging environment at increasing levels of sophistication - starting with DOM changes to confirm basic visual patterns, then HTML/CSS changes, then backend changes - corroborating the intent and purpose of your change with you as it goes.

This mirrors the experience of a designer sitting next to an engineer in an office - saying what they want and providing live feedback on the application as the engineer progressively implements it.

You’ll note that this is de-facto how non-engineers who “vibe code” in tools like Cursor work today - they basically just keep interacting with the localhost preview and telling the AI what to change until it works, completely ignoring the code being written for the most part.

A focus on simple, visual changes, not backend changes

Part of the power of a product like this would come from scoping down the breadth of what you aim to enable. Obviously, engineers will still be required for complex changes or major features in complex codebases - it is the small, mostly visual, mostly front-end changes that make sense to offload to PMs & Designers.

I would focus the entire product experience on these sorts of changes - perhaps even starting by only allowing changes to client-side HTML/CSS/JavaScript rather than messing with the backend at all.

This would allow you to simplify a lot of things - from the way the AI coding agents work under the hood to the user experience and workflows you need to support. Software IDEs are designed around enabling someone to change anything in a codebase - but this, in my opinion, is a non-goal for a product like this and is part of how you par back complexity to empower the non-software-engineer.

Prototyping as a wedge, but not the end state

A lot of the way tools like Cursor and similar are used today by product managers and designers is prototyping. By this I specifically mean - creating sample versions of the application which are used to demonstrate or communicate a concept, but which engineers don’t use as a baseline for code changes.

While I think prototyping is likely an important wedge use case to focus on as a startup in this space, in my opinion the major opportunity lies in going beyond that and finding the situations where the end deliverable is actually a pull request.

Consider the workflow I mentioned earlier - where the user interacts with the staging preview and iteratively refines it in collaboration with an AI coding agent. Once the application reflects the intent of the product manager or designer (e.g. the frontend/visuals/UX look right), the system can then spend many hours thinking about how to turn this change into a proper pull request, inclusive of additional changes that may be needed across the database, the backend, or other adjacent components. This can ultimately turn into a PR that is delivered to an engineer, enriched by all the specification the PM/designer did when iterating on the feature.

I think that going all the way to the pull request is how you capture real value as a startup in this space - as you are no longer just a prototyping tool but a key part of the software development lifecycle. Indeed, I think the natural adoption curve for a tool in this space would be:

Offer a free prototyping tool which can be adopted by any PM/designer on their own and requires no integration into the codebase or other systems. I would probably have this work mostly around screenshots - where you can screenshot pieces of the app you want to prototype changes to, the system creates a non-functional prototype of that UI (since it doesn’t actually know anything about your full codebase or application), but you can update it visually as you please to communicate ideas in much more expressive, interactive ways. This is sort of like a persona-specific Bolt that is designed to “branch” off of existing applications
The paid product would integrate into and fully understand your codebase - and would allow for the creation of truly functional prototypes built on staging versions of your application that can be turned into PRs

Of course, the quality bar for creating a pull request is much higher - which leads me to my next point.

Treating the hand-off to engineers as a first class citizen

One of the most significant challenges, and opportunities, in this space is tackling the handoff to engineer. An obvious failure case here is that all the PMs and Designers at a company start thinking they can make all sorts of changes on their own and this all gets sent as PRs to engineers, causing the engineers go crazy if many of these PRs are bad, low quality, or require a lot of changes to be production-grade.

I think this is solvable and indeed is part of where you would build a large set of features that do not need to exist in a tool like Cursor/Windsurf. For example:

Allow engineers to set policy for the types of changes that can or can not be turned into automatic PRs (e.g. “If it touches the database at all, it can’t be turned into a PR”)
If the AI agent determines after thinking for awhile that a given change is overly complex or unlikely to be implemented in an automated fashion correctly - it can simply go back to the PM/Designer and say “Hey I think this is too complex for me to handle, let’s just share the prototype with the engineers but I shouldn’t actually try to implement this”
The “feedback” flow would need to be seriously thought through. This idea breaks many of the assumptions in the standard pull-request workflow, where engineer one submits a PR and engineers 2/3/4 provide feedback which is then responded to by engineer one. If the reviewing engineer has feedback on the PR, what happens? Does this get sent to the PM/Designer? If so, how do they address it? If not, who is responsible for implementing those last-mile changes

I actually think getting this piece right is the hardest, and most critical part of building a product like this. Done correctly, engineers will also end up pulling you into their company, because they will be tired of their PMs/Designers vibe coding in Cursor and submitting lots of slop to them without guardrails.

Concluding

I am fairly convinced there is a massive startup opportunity in this space. I think design and product management are changing more right now than they have in the past 20 years thanks to AI-assisted coding. As a result, I think there is a unique opportunity to create an entirely new type of “integrated non-developer environment” which is oriented toward visuals and app interaction, not code.

I think it is exceptionally unlikely that this is done by existing incumbents (e.g. Figma, Productboard) because it obviates the core mental model of such products - the primary reason people worked in these lower fidelity mediums is because the burden of writing code was so high and the skillset so specialized.

While I don’t think that vector graphic-based designs or PRDs will go away, I think there are a lot of situations where it will now be much better to simply build and work on the real thing. You see this empirically today with all the designers/PMs using AI codegen tools instead of their traditional workflows.

If you are interested in building something in this space, please reach out - davis @ innovationendeavors.com

Subscribe now

LLMs as compilers

Davis Treybig — Thu, 03 Apr 2025 23:34:24 GMT

Recently, there have been a few interesting papers that essentially treat LLMs as a compilers - most notably KernelBench and Sakana’s AI CUDA Engineer (though note that the Sakana paper had some issues around the model reward hacking).

Each paper used an LLM to take higher-level Pytorch code and write optimized GPU kernels for this code. Roughly, the idea in each case is as follows:

Take the high level Pytorch code
Have LLM write a sample CUDA kernel implementation
Test whether the CUDA code compiles, whether the CUDA code appears to be correct (via sampling input output pairs and comparing to the pytorch code), and whether the CUDA code is faster than the pytorch code via profiling the execution
Have a feedback loop between #3 and the LLM - iteratively refining the CUDA kernel until you get a result you like (or you hit a pre-determined stop threshhold)

The results of this are, in my opinion, really interesting. Specifically, let’s consider the KernelBench paper. While many of their results were not correct and/or did not speedup the kernel relative to torch.compile(), they also showed various situations that led to dramatically faster, correct kernels. For example, their diagonal matrix multiplication kernel was 12x faster than torch.compile(), and another another kernel that performs matrix multiplication, division, summation, and then scaling was 3x faster than torch.compile().

What’s cool about this, though, is that because writing kernels is automatically measurable via verification loops, the negative cases don’t really matter relative to the positive cases. You can simply search until you find good results that show speedups, use those, and throw away everything else. This is a great example of the sorts of “design + verify” style systems that LLMs are so, so well suited for.

This is an interesting paradigm because you are, in essence, replacing a compiler with a large language model based system - and each of these have profoundly different tradeoffs. Compilers are deterministic so you are guaranteed correctness, and they also tend to generalize very well, but they often leave juice to be squeezed relative to hand-crafted implementations. This LLM-based process is non-deterministic and often produces invalid results, but can also produce extremely specialized custom kernels for specific functions you want to optimize against.

This trade-off feels like it may especially be worth it in the area of ML systems - where in many cases the goal is to optimize very specific types of operations to the nth degree rather than support general purpose code optimization in a robust way. Furthermore, sacrificing correctness for speed/performance is not a new concept in machine learning systems - quantization paradigms do exactly the same thing.

ML has slowly been creeping into systems optimization for awhile - and I suspect can be taken much further with LLMs in many cases now. It wouldn’t surprise me that these ideas can be applied in many other areas of systems as well - e.g. distributed systems, databases, etc.

Subscribe now

ReCaptcha for reasoning

Davis Treybig — Thu, 20 Mar 2025 19:55:12 GMT

The demand for high quality data from AI labs and AI startups right now is unprecedented. Leading AI labs will now pay $2k+ for sophisticated reasoning traces in advanced domains (e.g. legal, healthcare, etc), and companies like Surge & Mercor have grown revenue at unprecedented scale serving this market.

For the most part, data startups that I have seen attempting to serve this market have focused on finding efficient ways to procure labor and then assign them annotation tasks. e.g. Scale will go to OpenAI and say “we have a huge pool of neurologists who can do annotation tasks for you”. This, of course, has worked well for many of these companies given the demand for these sorts of data annotation tasks, but it also means that many of them look more like labor arbitrage services business than true software businesses.

This plays out both in margin profile as well as in repeatability. Many of these labeling companies end up having to constantly update/adapt their labor pools to match the current annotation-task-du-jour - e.g. Scale invested huge resources into AV labeling in the early days, and then had to re-do many aspects of labour sourcing, labor evaluation, and internal tooling to adapt to RLHF annotation tasks for foundation model companies.

I suspect there is opportunity for more creative business models in this space, particularly those that play with gamification or clever incentive structures to create high quality data as an implicit output. These would look more like scalable, software-driven approaches that passively generate annotation data—often as a byproduct of something users already want to do.

ReCaptcha is a good historical illustration of what this might look like. ReCaptcha is the anti-bot measure you have likely seen logging into many online accounts before, where it has you do things like “click the squares with traffic lights”.

ReCaptcha is offered as a free service because there is a secondary benefit that stems from all these users clicking squares in the image - it produces free image segmentation & annotation data for Google.

Other examples of this concept that I find interesting include FoldIt, which turned protein folding annotation into something that essentially felt like a game and as a result crowdsourced one of the largest protein folding datasets, and Kaggle, which created a competition-dynamic around modeling tasks.

I strongly suspect that these sorts of ideas of offering a service for free as a way to indirectly generate high volumes of annotation data could be applied to some of the very-high-value foundation model annotation tasks that exist today, such as reasoning trace generation or RLHF data.

ChatBot Arena is one such example of this. If you’re not familiar, ChatBot arena essentially turns comparative evaluation of language models into a simple game, letting users compare responses to prompts and rank which model was better or worse. This allows for not only the creation of an interesting leaderboard much more aligned with user preferences than many standard eval benchmarks, but it is also essentially creating a high quality RLHF dataset.

I think there is a lot of room for creative ideas that take this concept much further. For example:

Could you offer free tools which integrate into the systems of engagement that certain key knowledge workers use (e.g. lawyers, doctors), in return for capturing the usage data from that worker? There are huge amounts of latent thinking & reasoning data implicitly generated in systems like EHRs, CRMs, etc
Could you create games or gamified experiences that implicitly generate unique reasoning traces?
Could you create marketplaces with unique incentive or compensation structures that help match certain types of long-tail, specialized labor with annotation tasks? (analogous to HackerOne in security)
Could you build free utilities in high-value domains where there is very little public data? e.g. free podcasting utility to create multi channel audio data, free 3D editing utility or Blender plugin for 3D data

Done properly, such a business could have very compelling dynamics as it could scale more like a software business and could have strong network effects because most of these ideas implicitly have both a demand & supply side and/or look like marketplaces to some extent.

I think this type of business-model insight is under-appreciated in the data annotation space, and I’d love to connect with teams thinking creatively along these lines.

Subscribe now

LLM-Assisted Coding & Feature Dense Startup Categories

Davis Treybig — Thu, 20 Feb 2025 20:40:47 GMT

I spend a lot of time looking at feature dense startup categories, where in order to compete you must build a massive number of table stakes features in order to seriously be considered by your users.

Good examples of this include spreadsheets, architecture, computer aided design, anything serving creative professionals (e.g. video editing, motion graphic editing, photoshop, Figma), 3D design, integrated developer environments, chip design (e.g. Altium), simulation (e.g. Mathworks), and CRMs. A good rubric for this type of startup is that a full time professional “lives” in the tool - e.g. it is the primary tool someone interacts with as part of their day-to-day job. Most highly skilled knowledge workers have a core system of engagement that looks like this.

The classic challenge with building startups in these categories is that achieving feature parity is often necessary in order to be seriously considered. It doesn’t matter that your product is multiplayer, or has cool cloud features, or has AI built into it, etc unless you also still allow the professional to do everything they are used to doing.

As a simple example - if you’re trying to displace Excel but are missing one critical built-in function or one critical hotkey, your user will instantly get super annoyed/frustrated and go back to excel. The root cause of this is that users of such tools tend to become extreme experts of such tools, and their productivity is deeply tied to their proficiency with their tool. Correspondingly, they can viscerally feel themselves becoming less productive the second they can’t do X thing they are used to doing all the time.

This is the blessing and curse of such categories - it is super hard to compete, but if you hit escape velocity you have a crazy hard to breach moat. This is a key reason why Salesforce, Altium, Autodesk, and similar are such enduring giants in spite of generally being old, sort of shitty pieces of software.

The classic implication of this for startups is that you basically have two choices if you want to attack a category like this. Option one is raise a lot of money, spend a long time building, and survive until you hit sufficient feature density - e.g. Figma building for 4-5 years before really making money. Option two is attack the periphery of the space - e.g. Frame.io attacked the collaboration layers around Adobe Premiere, but didn’t attack Premiere in of itself. This can work but often limits upside.

What I find interesting is that LLM-assisted code generation tools may dramatically increase the feasibility of option one. The reason for this is that the functionality required in these categories is typically well understood, well defined, and “obvious” to some extent - it just takes a really long time to build it in totality. This is an insanely good match for LLM copilots & agentic systems, which are quite effective at churning out obvious or well understood functionality.

I am starting to see this play out in various startups I work with - who find that tools like Github Copilot have so much latent knowledge from their pre-training corpus about common features in established categories that they can very, very quickly get you to 90% implementation. e.g. Sequence is able to race through implementing many common audio & color features in video editors because LLMs inherently understand them very well. Similarly, I observe that many of the newer CRM startups (e.g. Clarify) are able to race through covering a lot of the functional requirements in CRMs that traditionally would have been a bigger barrier to entry.

This idea likely extends to other categories of startups where defensibility mostly comes from lots of “grunge work”. Integration companies such as MuleSoft, Plaid, and similar are another good example of this - where a lot of the value came from a team chewing glass for years as they built custom integrations into every single API or bank. These tasks are so amenable to LLMs in many cases that I think this form of moat is less relevant today than it was before.

This is exciting to me because I think it allows startups to focus more on the “craft” of seriously improving the user experience in these categories and innovating on fundamental interaction models, rather than mostly having to focus on table-stakes feature parity.

Mechanisms for Test Time Compute

Davis Treybig — Fri, 10 Jan 2025 02:27:57 GMT

One of the most interesting research trends in LLMs right now is the rise of reasoning models which spend time “thinking” before giving you an answer. OpenAI o1 is the predominant public LLM doing this today, though Deepseek R1 and Qwen QwQ are other notable recent releases in this domain as well.

This technique is broadly described as “test time” compute - i.e. reasoning at inference time. While the idea of models which apply search or deeper reasoning at inference has been around for a while - e.g. AlphaZero, this paper applying similar ideas to the traveling salesman problem before transformers were a thing - it has sort of re-entered the zeitgeist with o1.

What seems particularly exciting is that this form of test time compute may demonstrate similar scaling laws as pre-training - in other words, there is a theoretically predictable exponential increase in model capabilities as you allocate more compute during inference time before giving an answer, just as there has been a predictable exponential increase in model capabilities as you train them with more compute.

OpenAI’s visualization of test time scaling for O1 - accuracy increases predictably relative to log-scale compute, indicating an exponential relationship

But, what is actually happening under the hood for models like o1 to do this, and what are the different mechanisms or techniques by which test time compute scaling can be achieved? I have not found a good, intuitive overview of this anywhere, and OpenAI is very tight lipped about what exactly they are doing, so this is my attempt to create one.

In this blog I aim to outline, in simple terms, the major ways test time compute scaling can be achieved, based on both reviewing a lot of the recent literature in this space as well as talking to a number of ML researchers at research labs.

Basic mechanisms for test time compute

Best of N Sampling, Majority Voting, & Similar

Have a language model generate many possible outputs during inference time, and then use some kind of sampling, voting, or other evaluation/verifier methodology to pick the “top” answer. This is a simple idea that requires essentially no changes to how the model is trained, but does seem to be an effective baseline.

Large Language Monkeys

The first nuance here is in the verifier. Simple approaches like majority voting generalize well but have limits in their utility. Specific domains like coding or math have specialized verifiers that can be used (e.g. unit test & compilers for code, symbolic engines for math), but these are not general purpose. An increasingly common technique is fine-tuning an LLM to be a specialized verifier - e.g. see here.

The other issue is that - it is likely that for many more complex problems, no matter how many times you sample a “standard” model, you will not get the right answer (or it will take an unfeasible amount of compute to generate the right answer at a sufficiently high probability). As we will see, the right way to solve this is likely to either train on better reasoning traces, or to have a reward process that can help “nudge” the model through a complex reasoning trace.

Chain of thought

The second approach is to have the language model generate a very long, detailed chain of thought reasoning trace as a way to improve reasoning capabilities. This is just a single model auto regressively producing a lot of tokens as it, essentially, talks to itself - there is no secondary system or control flow. OpenAI shows examples of this in its o1 announcement.

While a basic version of this can be achieved via prompting (e.g. “think step by step”), the advanced version of this involves specialized pre-training and post-training techniques which optimize for these sorts of long reasoning traces.

The nuance here is in how exactly the models are trained to be better at these long reasoning traces. Roughly, there are a few ways of achieving this:

Supervised learning - In theory, you could train a model to be good at very long CoT reasoning via lots of supervised examples of very long, human-written chains of thought. In practice though, it is extremely difficult to produce enough data of this form to be useful - there are simply too few examples of high quality, long form reasoning in the public domain, and it is too expensive to produce these manually.
Synthetic reasoning traces - In certain problem spaces/domains, you can synthetically generate complex reasoning traces via procedural methods - e.g. see here as a great example where they use a knowledge graph to produce question/reasoning/answer pairs which are guaranteed to be correct. In areas like mathematics and computer science, you can also use formal systems (e.g. symbolic engines, languages like Lean, compilers & build systems) to produce synthetic reasoning chains. These can be used as training examples for the model.
Sample & Verify - Methods where you ask the LLM to produce many sample reasoning outputs, and then use some form of verification or reward model to identify good vs. bad reasoning chains which then becomes a reinforcement learning dataset for post-training. A critical distinction here is whether to use outcome reward models (ORMs) which validate/invalidate the final reasoning output, or process reward models (PRMs) which can apply a reward value to partial chains of thought (e.g. here). This is a very rich domain space as there are many, many ways of sampling generations, training or designing verifiers, and designing the reinforcement learning system to integrate the verifier rewards.

The critical consideration here is what scales in terms of 1. data 2. computational feasibility and 3. human labor? The fact OpenAI mentions their o1 technique is “data-efficient” is a representation of the fact that they likely are heavily relying on some combination synthetic data and RL-based verification techniques, as opposed to some kind of human-curated reasoning dataset.

Synthetic techniques can be effective but tend to be limited to specific domains and types of problems that are more easily quantifiable - as such there are questions about how well they generalize.

The challenge with sampling techniques is that the computational search space of reasoning for many interesting problems is too large to exhaustively generate and very complex to efficiently validate. This makes it look a lot like other areas of reinforcement learning, like robotics, where you need to get clever on how to simulate or “search” the space of outcomes and how you design reward functions.

This is the critical driver of why process reward models are interesting - they allow you to terminate solutions early which are on the wrong track, and focus on branching out from intermediate states with high likelihoods of success (good discussion of this in section 3.3 here).

There are a lot of interesting explorations on the right way to structure reasoning traces for effective training. For example, Dualformer selectively obfuscates parts of reasoning traces during training in order to (theoretically) help the model learn mental heuristics analogous to fast, system 1 thinking in humans. Stream of Search highlights potential benefits of having reasoning traces which make a lot of mistakes (e.g. include backtracking, admitting mistakes, changing your mind) as opposed to “perfect” reasoning traces towards a conclusion. This paper similarly demonstrates the value of having mistakes and poor reasoning chains with backtracking in a training set. Beyond A* tries to teach models how to search by constructing training examples which replicate well known search algorithms like A*.

Inference-time search (and other secondary systems)

The third major approach for inference-time scaling is to actually utilize some kind of search technique during inference time. In other words, inference becomes a systems problem, not just a model inference problem, and you have some kind of control flow or orchestration at inference time as opposed to a single model purely generating token output.

Some interesting examples of this paradigm outside of “standard” large language models include AlphaZero, where a trained neural network guides a Monte Carlo tree search algorithm to select the best next move, and AlphaProof, where a pre-trained large language model + RL algorithm generate solution candidates which are then proved or disproved with the Lean proof assistant language.

The most common variation of this in LLM research today is to integrate some kind of “search + verify” technique at inference time - where the model generates a set of N candidate next steps of reasoning, a verifier or reward model is used to grade or score or invalidate those candidates, and then the process is repeated among a subset of the best candidates. Note that you could consider the “Best of N” sampling approach discussed earlier as a sub-set of this.

HuggingFace overview of test time compute via search + process reward models

Good research examples of this include Tree of Thoughts, Self-Evaluation Guided Beam Search for Reasoning, and Reasoning with Language Model is Planning with World Model - each of which utilizes a search technique (breadth first search, depth first search, beam search, Monte Carlo tree search) coupled with a verifier to guide the language model reasoning generation. I like the visual depiction of these in the LLM Reasoners paper shown below. Conceptually, all of these ideas & approaches are very similar.

You’ll note that this approach of search + verifier + generative model is almost identical to the approach outlined in the chain of thought section above - the only difference is whether these techniques are applied offline to produce post-training RL datasets or applied online during inference time. In either case, however, you are scaling with test time compute - the former teaches the model to reason longer at test time via training, and the latter guides the model over a larger set of generative outputs during inference time.

Beyond using search algorithms to guide generation, there are other types of secondary systems that can be integrated at inference time which complement the generative model. The RAP paper is a particularly interesting example of this - where they use a secondary LLM as a “world model” which tracks the state of the environment. In other words - as the generative LLM is producing a continuous stream of reasoning actions that include backtracking, thinking, consideration, etc, the world model tracks the “state of the world” at the end of each possible action.

Visual comparison of a standard CoT sequence of actions vs. world model approach where the “state of the world” is preserved after each action

In theory, this makes it easier for the model to reason about the impact a subsequent next action will have, relative to a single stream CoT output where the model must implicitly playback the sequence of actions to understand the current state of the world.

The reasoners papers mentioned above posits an interesting formalism for trying to unify all these different approaches and ideas (e.g. majority voting, CoT, search techniques, etc).

They argue that all these techniques are ultimately a combination of:

A reward function to decide preferences for different reasoning steps
A world model to specify reasoning state transitions
A search algorithm to explore the expansive reasoning space

In this framing, standard chain of thought reasoning has a reward function equivalent to the default model likelihood output, a world model that is just a constant appending of reasoning actions to a full action history, and a “greedy” search algorithm that always does a single sample of the output probability distribution.

I find this to be an interesting way to think about the space. The paper also does some interesting benchmarking and finds that search techniques consistently beat CoT, and RAP (world model + search) consistently beats pure search.

This recent meta overview by Stanford of reasoning models describes a similar mental model - arguing most of these approaches are “integrating generator, verifier, and search components”, which is essentially the same framing.

Additional considerations

Verifiers

As you can see, a lot of this hinges on verifiers and the quality of verification. Heuristic/automatic verifiers can be effective but definitionally are domain specific (e.g. test cases for coding questions). Learned verifiers can work, but require high quality training data in the given domain - e.g. see this early OpenAI paper which trains learned verifiers for math problems. There is a lot of progress on simply using LLMs as verifiers, but there may be limits to what is feasible with this. Process based verifiers seem important to get right, but are more difficult to design than outcome based verifiers.

MuZero is an interesting reference point for where this space likely needs to go - it is a model-free reinforcement learning based system that can learn to play a wide variety of complex games at an elite level. Model-free means nothing is encoded about the specific game it is playing in the RL algorithm.

This sort of domain-independent verifier design seems critical for models to get generally better at reasoning. Of course, the question is the extent to which this can be mapped to domains with less clear reward functions than Go, chess, shogi & Atari.

Will this generalize?

This is an excellent blog post discussing the challenges with applying RL for reasoning, specifically in the context of OpenAI o1.

o1 uses RL, RL works best in domains with clear/frequent reward, and most domains lack clear/frequent reward.
…
OpenAI admits that they trained o1 on domains with easy verification but hope reasoners generalize to all domains. Whether or not they generalize beyond their RL training is a trillion-dollar question. Right off the bat, I’ll tell you my take:
⚠️ o1-style reasoners do not meaningfully generalize beyond their training

Anecdotally, it does seem like many of the current test time compute models are a lot better for very specific problem spaces (math, logic, computer science), but don’t seem dramatically better in other domains. Indeed, many folks I speak with who have tried these test-time compute models anecdotally feel that they get a lot worse at many conventional or standard generative tasks. Whether or not RL for reasoning can generalize well to harder-to-verify domains is an interesting open question.

Reasoning in token vs. latent space

Somewhat orthogonal to all of this is the question of whether token-space is the optimal way for a model to reason. There is some interesting research on having models reason directly in latent space - where during the reasoning period the hidden states are passed back to the model rather than the decoded token.

In theory, this may have advantages because the hidden state represents a probability distribution of next token generations, whereas the token is essentially a “sample” of that probability distribution, and it may be effective to reason across all possible next states vs. picking one as this mirrors how humans reason.

A potential downside of this approach is that such a model would not “show its work” to the user, though given that companies like OpenAI are already hiding their reasoning steps from the user, perhaps this is irrelevant. I suppose it may be possible to still visualize the token output but reason on the latent output, though this may create a divergence between what the user sees and how the model actually reasoned.

Agent reasoning

One thing I am particularly interested in is how all of this maps to agents. There is an extreme parallel between optimizing a model for complex reasoning trajectories that span many sub-steps, and optimizing an agent for complex reasoning trajectories which span many sub-steps. The only difference is that an agent’s sub-steps are broken up into different model calls and typically involve more moving pieces (e.g. function calling, etc).

I observe that many of the leading agent startups building agents for X (e.g. Cognition, Basis, etc) apply many of these ideas to their agent design. For example, I have spoken to multiple agent companies that take their agent traces, replay them with some kind of search technique + reward model to explore counterfactual reasoning paths, and use those counterfactual trajectories as fine tuning examples for improving their agentic system.

This approach becomes critical if you are working on an agent doing 50-100+ chained LLM calls to solve a given action in a complex environment with many tools, given the combinatorial complexity of actions the agent can take for a single request.

What I find particularly interesting about this is that it is much more feasible to design domain-specific search algorithms and process reward models than it is to generally solve complex multi-step reasoning at the model layer.

This is an interesting corollary of the blog post I mentioned above questioning if these techniques will generalize - perhaps RL for complex reasoning will be difficult to generalize well at the model provider layer, and instead will become a core competency (and layer of defensibility) for most applied agent startups in domains that require very complex reasoning to solve tasks (e.g. accounting, tax, finance, architecture, etc).

I also suspect tooling will emerge that aids agent startups with this sort of task - analogous to the ecosystem that has emerged around fine tuning (e.g. MosaicML). Such tools would make it easier to build search + verifier layers and generate datasets with them for a given applied agent use case.

Other interesting resources in this space

Sasha Rush’s Speculations on test time scaling and associated slides
The State of Generative Models 2024 - see the section on reasoning and o1
Towards System 2 Reasoning in LLMs

A lot of this blog is me attempting to explain this all to myself. If you think I am missing something or wrong about something, let me know - davis @ innovationendeavors.com

Subscribe now

The AI-Native Product Manager

Davis Treybig — Fri, 20 Dec 2024 19:47:18 GMT

The role within modern tech companies that I think has most disproportionately benefitted from recent advances in AI is the product manager. It’s not a stretch to say that a good PM today who knows how to use AI tools well is probably 10-20x more leveraged than they were just a few years ago. This takes shape in a few ways:

AI-assisted coding

Tools like Cursor/Cognition/etc have created a true step function change in what a technically proficient PM can achieve on their own. It is now within reason for a PM to create v0 prototypes of many features or new ideas without any need for external engineering resources.

While it is of course true that many good PMs could write basic software on their own before AI copilot tooling, in my experience the juice was often not worth the squeeze given that even the most technical PMs are typically not proficient software engineers.

However, AI software tools have so dramatically lowered the cost/effort/time to ship functional prototypes and have so raised the ceiling on what a semi-technical person can achieve, that I think this paradigm has fundamentally flipped. Good PMs will start to prototype all new features/ideas themselves before handing off to engineers, and I think this will allow PMs to be almost completely independent for a lot of zero-to-one discovery and validation work.

I suspect that as this plays out, you will see certain coding tasks more fully delegated to product managers. A good example of this is product instrumentation. PMs are typically the primary stakeholder who cares about this, the nature of instrumentation code changes are typically very simple/straightforward, and in my experience engineers typically don’t like working on PRs for it. It wouldn’t surprise me if this sort of task became more fully the responsibility of the product organization.

AI-assisted design

Most good product managers dabble as “part-time” UX designers - utilizing tools like Figma to riff on user workflows or to mock up simple changes. Advances in AI-assisted design will similarly up-level what is possible for a product manager to achieve on their own in this domain. Examples of this include:

v0, which makes it easy to create sample frontends
Tooling like Figma AI, which will make it easy to create high fidelity visual layouts of common components/workflows/etc (especially as design systems become more common and these AI products can just compose your design system building blocks)
LLMs are going to more broadly minimize the gap between the production assets and design files - making it easy to “bootstrap” a design file which mirrors the current production app precisely. This will make it much more accessible for PMs to do many smaller design changes/iterations themselves, as they will not need to either hand-compose the UX from scratch or ask the designer where the canonical design file is.

Similar to my point around how this might allow product instrumentation to shift from engineers > PMs, I ultimately think these sorts of AI-assisted workflows will enable UX designers to spend less time on low-creativity, low-inspiration UX design tasks that need to be done (since the PM can now own those themselves), and focus more on the complex UX design work they specialize in.

AI-assisted customer feedback analysis

The most critical job of a PM is to interface with the market and structure requirements & feedback into a plan for the engineering & design organization. The degree to which LLMs have structurally enabled this job to be done differently is unbelievable.

It is now within reason for a PM to basically analyze every single piece of customer feedback & input across all channels. Emails, bug reports, sales calls, online user sentiment on platforms like X, and similar can all be synthesized with LLMs systems and interrogated by the PM - see platforms like Enterpret and ProductBoard AI as intriguing early examples of this.

Traditionally, PMs had to disintermediate themselves from a lot of this feedback, and rely on, for example, the customer support team’s aggregation of Zendesk feedback to influence product priorities & roadmap. The degree to which a small number of PMs can now get source-of-truth data about all customer input has fundamentally changed.

This will also allow prioritization and roadmap planning to be done differently. LLM tooling should be able to easily reconcile discrepancies between the full universe of customer feedback relative to roadmap prioritization - and probably provide very high quality “first pass” roadmaps. I can imagine something like Granola but for product planning, where the top level strategic goals are input by the product team, and all the other “details” get filled out from there based on user feedback.

AI-assisted communication, alignment, & process management

Finally, LLMs are starting to play an influential role in how PMs communicate & align their teams. The most obvious breakout version of this right now is ChatPRD, which is a copilot for PMs to write high quality product requirement documents. PRDs are the core artifact by which most PMs communicate, and LLMs can clearly play an elegant role in helping to draft them more quickly & improve their quality and rigor.

Things can likely be taken much further than that, however. Some larger companies like Google have built internal infrastructure which use AI to auto-convert PRDs into Jira/Linear tasks - decomposing the product scope into its sub features. You can also imagine AI systems which analyze PRs and preview environments relative to the PRD, acting as a “first pass” product review without the PM having to manually review every single major feature that enters staging.

More broadly, so much of triaging, routing, and assigning bugs and tickets falls to product managers in many organizations, but could almost certainly be done more automatically in a world of AI. Why can’t an LLM triage incoming bug requests relative to the roadmap and determine which should go to the backlog vs. which should be prioritized? The more that AI is integrated into the project management cadence, the more PMs can focus on actually valuable strategic work.

What are the implications of all of this?

Assuming all of this is true, the next question is how do product organizations commensurately evolve?

Organizationally, I suspect the ratio of product managers to engineers & designers goes down. It will be more common to have a very small set of extremely senior PMs, who are ridiculously levered thanks to all of this AI tooling. The key driver of this will be that so much of the “grunt work” that can suck up a huge amount of a PMs time can now be heavily automated or accelerated with AI - whether analyzing user feedback or triaging issues. Correspondingly, more day-to-day product work will be owned by engineering & design teams, who are empowered with AI tooling to act more like product owners.

Procedurally, I think the product development lifecycle will change. PMs should be able to be much more self sufficient in doing early market discovery + solution validation thanks to being able to do much more end-to-end prototyping on their own, only handing off to engineering & design when it is very clear that something should be built.

The relationship between product, engineering, and design will probably also change. PMs should hopefully be able to take on a lot of the “low hanging fruit” tasks often assigned to engineers and designers (e.g. product instrumentation, simple UX changes for minor features). Engineering & design should will likely not require as much input/feedback from PMs if it becomes much easier to write extremely high quality, rigorous PRDs (thanks to tooling like ChatPRD) and if agents can act as a “first pass” PM review or feedback channel in many situations.

I suspect this may also lead to the opportunity for a next generation product management platform - ala Pendo/Productboard 2.0. So much is changing in terms of how PMs work and what will be expected of them that I think you can likely redesign the system of engagement/system of record for product work. Imagine a tool that enabled the following integrated workflow:

Synthesize/search/query all the current user feedback across all channels via Enterpret-esque LLM analysis, helping you create a ranked priority list of key things to address
Rapidly create extremely detailed PRDs for the top N priorities via ChatPRD style copilot system, contextualized by the prioritized user feedback
Bootstrap UX mocks and mini-app prototypes for each PRD, which the PM can use to quickly do first-pass customer validation. LLMs could even assist with determining which users to run the idea by (and potentially help run UXR style survey-style workflows to collect feedback on the solution)
AI-assisted auto-refinement of PRD based on market feedback
Once the PM feels the feature is worth implementing, the tool can do a first pass on auto-decomposing it into Jira tickets and subtasks
Finally, an LLM agent can act as “product guardrails” as the new feature gets implemented, reviewing incremental artifacts (e.g. PRs for each subtask, preview environments for subtasks, etc), providing immediate feedback to eng/design and helping determining what to bubble up to the PM

Something like this would have been totally inconceivable to build until recently, but now I think it is quite possible. Furthermore, while the above touches mostly on the core product iteration loop above, I think LLMs can also address additional issues in product management - such as quantifying product velocity and aggregating product progress across a larger organization.

A tool like this will need to consider engineering, design, & product all as core stakeholders. AI will make everyone in the organization more of a participant in product work - shepherded by a few very senior, AI-enabled product leads - and correspondingly product management tooling which primarily serves PMs becomes less relevant. Almost all engineers will become “product engineers”. This will require a re-envisioning of what product tooling looks like and how it is bundled.

I’d be very curious for input/feedback from product managers on these ideas and observations - especially product managers who have tried to really embrace AI-tooling in their job function. Shoot me a note at davis @ innovationendeavors.com

Subscribe now

Adwords for Software APIs

Davis Treybig — Fri, 15 Nov 2024 23:16:04 GMT

A trend I see more and more as I speak with developers that use copilot tools (e.g. Github Copilot, Cursor, Augment) is that they increasingly discover and choose third party libraries, services, and APIs based on their software copilot.

E.g. rather than Google “Video APIs”, find the top few options, read through their documentation, and then implement one, a developer may simply ask their copilot to help them implement a video rendering API. If you ask this of Cursor today, it will list Mux as a “(Recommended)” provider over Cloudflare, AWS MediaStream, or other options.

Cursor recommends Mux as a video rendering service provider

In essence, copilot products are becoming a marketing & distribution channels for developer services.

As these tools mature from mostly assisting with autocomplete and chat-with-your-codebase into agentic systems that provision infrastructure or implement larger scale projects on their own (ala what you see with Replit agent or Cognition), the degree to which this is true will go up dramatically. You will expect the AI to evaluate tool/service options, pick top contenders, and make recommendations on which services to use in order to implement the feature or PRD that they are assigned.

If this plays out, it is inevitable that you will see a formalization of marketing tools focused on the software copilot & agent channel. Companies which offer developer services will be desperate to figure out:

SEO - What is the equivalent of “SEO” aimed at software copilots & agents? What types of content are LLMs good at picking up and understanding? How do you make sure that your service is always listed if someone asks a copilot about your product category?
Ranking - What do you need to do to get copilot products to rank your service as “better” or “recommended” vs. others (as we see in the Mux example above)? How do you ensure that the software agent is factually correct in terms of understanding your offering and comparing it to alternatives?
Analytics - How do I measure what leads or signups are coming from software copilots & agents, and then optimize this metric over time or test changes against it? This is essentially Google Search Console or Ahrefs, but for copilot marketing.

As this space matures, you will almost certainly see a proper advertising market formalize - essentially, what is the Adwords equivalent for software libraries/APIs which want to promote themselves in copilot products? A very simplistic way this could work is as follows:

Advertiser (e.g. Stripe) negotiates deal with copilot vendor (e.g. Github Copilot) for queries related to certain terms (e.g. “Payments API”)
Github Copilot, when a user enters a query or question related to a “Payments API” in the chat window, stuffs marketing collateral relates to Stripe’s payment API into the context of the data sent to the model. It instructs the model to mark any suggested output related to Stripe as “promoted” or to wrap that part of the output in special tokens
The Github Copilot UI visually delineates “promoted” suggestions from “standard” suggestions that the model implicitly came up with from its pre-training data set
Github tracks autocompletions or similar “acceptance” metrics that come from promoted suggestions, and uses that to measure the success of the advertising campaign, analogous to how Meta/Google price on clicks

There would then be a marketplace for bidding on certain keywords across different AI code generation vendors.

There are numerous other ways this could be handled or implemented, and the above solution would have some issues - but I share it just to give a sense of one way it could work even without any major changes at the model layer.

Some of the open questions on my mind in relation to all of this:

Will account creation & payments end up being bundled into copilot products? E.g. if Cursor tells me I should use Mux, why can’t I just sign up for a Mux API key in my IDE via Cursor? If an autonomous agent implementing a PRD wants to use Mux or Neon or Vercel or any other services, does it need to ask me to go create an account and add payment details in all those cases and then give it the API key, or is there a better solution?
Is this any different from non-developer-facing products who want to figure out how to promote their content in ChatGPT? In some sense, the same problem I am describing exists for any SaaS product that wants to be referenced by ChatGPT when someone references a certain term. Why can’t one vendor solve SEO/Analytics/Advertising for LLMs broadly? I do think there are some specific nuances in the developer-oriented version of this, but it is a fair question
LLM Vendors vs. copilot/codegen providers - If an advertising market emerges here, does the supply side directly interact with the demand side (copilot/codegen tools), or do the companies training LLMs need to be directly involved? Today, a lot of service provider suggestions in tools like Cursor come implicitly from the model pre-training - e.g. I doubt Cursor has anything to do with listing Mux as a recommended Vendor. But, over time, I think that advertising in LLMs will look more like in-context learning because you want to be able to dynamically inject or adjust promoted content separate from model training. In this world, the model provider would not be as relevant.

I suspect there are startup opportunities in and around this area - just as you saw with the rise of mobile (AppLovin) and web search (DoubleClick, AdMob) as discovery channels.

Subscribe now

All Applications are Becoming Distributed Systems

Davis Treybig — Fri, 04 Oct 2024 22:18:55 GMT

Traditionally, most SaaS applications followed a relatively straightforward system architecture - a frontend client makes API calls via REST or GraphQL to a stateless backend which does all core data manipulation via database queries.

Such an architecture requires essentially no reasoning about distributed systems (e.g. consensus, coordination, consistency, state invalidation, splitting computation across nodes, etc). Although the application may use a distributed system (a database), the application developer could generally treat it as a black box, with caches being the notoriously annoying classic exception.

This paradigm is rapidly changing. Specifically, I would argue that modern applications are increasingly becoming distributed systems in the sense that the compute and storage model of the application is spread across client, edge, and cloud.

Some examples of this include:

“Local First” applications like Linear & Notion, which treat the client & server as peers which can diverge on state and must regularly come to consensus
Most SaaS apps doing real-time collaboration in their product using techniques like CRDTs, such as Sequence
Companies like API.Video, Womp, and Sequence which do remote video processing (often on edge infrastructure) but then also do some graphics work in the client, essentially splitting their application’s compute model
Companies like RillData & Motherduck who adopt hybrid query execution patterns, running large scale analytical compute both locally and remotely
Architectures like Apple Intelligence, which run variations of ML models both locally and in the cloud and dynamically route across them based on the workload

In simple terms, these architectures trade higher system complexity for faster, more reactive, more responsive application which sometimes offer secondary benefits such as multiplayer collaboration, offline-mode support, and reduced server-side compute costs.

Their emergence is driven by a number of recent technology improvements in areas like virtualization and distributed systems which make it easier to handle this degree of system complexity. It is also driven as a result of all applications becoming data & AI driven, making all applications increasingly stateful & compute heavy, thereby necessitating more extreme methods for reducing latency and improving responsiveness.

This is an interesting trend because it substantially changes how applications are designed, it greatly increases application complexity, and it introduces the need to reason about distributed systems to a new set of engineers who generally never had to think much about it before (application engineers). The hatred most application teams have towards maintaining cache coherence with tools like Redis and Memcached is a good illustration of how messy it can get when people who aren’t trained in distributed systems have to reason about them.

This post explores this trend across a few dimensions:

The technical enablers driving these architecture shifts
The market forces & workload changes driving these architecture shifts
How developer tooling may need to evolve to accomodate these architecture shifts

Technology Enablers

There are numerous recent technology advances which have dramatically simplified building an application in this way.

Embedded libraries

There has been a range of activity recently in single-node, embedded, zero-dependency libraries which offer database-like characteristics in a very small form factor that can be run anywhere (e.g. locally, in the browser, etc).

Good recent examples of this include DuckDB (analytical queries), LanceDB (search queries), KuzuDB (graph queries). SQLite is the original example of this and is what has been used for data storage in most embedded systems, IoT devices, and mobile devices for decades.

Such libraries are very interesting because they, in theory, allow you to run the same compute engine at every layer of your application - client, edge, & cloud. This is an extremely useful basis for “hybrid execution” architectures which dynamically route queries or compute across different nodes based on the query.

“Edge” Compute & Storage

Tooling and infrastructure for provisioning globally distributed edge compute has massively improved over the last few years. Vendors like Fly and Cloudflare (Workers, D1) make it easy to provision compute & storage essentially anywhere on earth, and a range of newer startups like Turso & Tigris are starting to offer higher level cloud abstractions on top of these geo-distributed datacenters.

While edge infrastructure is not a new concept - CDNs have obviously been around for awhile helping to serve static content faster and vendors like Fastly offer an array of edge services for application developers such as web application firewalls - it is only more recently that the full range of cloud compute primitives were really accessible beyond AWS-East.

This vastly simplifies what it takes to build an application that offloads some processing or storage to the edge, which is a fundamental requirement for many compute heavy but latency constrained products. The second you move a significant portion of your application workload to the edge, you start to enter into something that looks more like a distributed system.

Modern Application Frameworks

Tightly coupled with the rise of edge infrastructure has been the evolution of frontend frameworks such as React which abstract certain aspects of client vs. server and edge vs. client. A good example of this is React server side rendering and server components.

CRDTs & Sync Engines

CRDTs, or conflict-free replicated data types, are a class of data structure which allow concurrent changes on different devices to be merged automatically without requiring any central server. Definitionally, any application built around CRDTs is a distributed system since they mediate state syncing & state conflicts across peer nodes.

CRDTs are the basis for many multiplayer SaaS applications (alongside operational transformation, which seems to be falling out of favor) such as Notion & Figma. Libraries like Automerge and Yjs have made it dramatically simpler to build CRDT-based applications, and as a result you see multiplayer increasingly becoming a table stakes component of many SaaS products.

CRDTs are a subset of a broader class of technical solutions to rapidly syncing potentially divergent state across many nodes, which I would broadly describe as “sync engines”.

Linear, for example, has shared extensive information about their synchronization engine which allows their product to be offline-first. Linear does not rely on specialized CRDT-based data structures but rather, fast message passing with basic heuristics for how to handle conflicts. A range of startups, such as Orbiting Hail and Electric SQL, aim to offer something similar to others.

Virtualization

Lightweight, portable virtualization technologies such as Firecracker, V8, and WebAssembly have matured substantially recently, and increasingly make it possible to run the same binary everywhere - whether locally, in the browser, on the edge, or in the cloud.

Web Assembly, in particular, is enabling a whole new class of applications which distribute compute & storage. Good examples of this include:

Electric SQL, which runs Postgres in the browser via Web Assembly and then syncs it with the server
Modyfi, which runs diffusion models in the browser via Web Assembly and also runs them on the server
SQLSync, which uses Web Assembly in the client and server to run reducer logic for state syncing
Shopify uses WebAssembly to power its edge server-side rendering workflows

Streaming & Stream Processing

Advances in stream processing and incrementally materialized views make it dramatically more realistic to ship data across client, edge, and cloud. Passing an entire database or table over the network is a non-starter in most cases, but passing only diffs or incremental changes can be an effective basis for maintaining a local sync or copy of data somewhere.

Improvements in CDC (e.g. Debezium) and incrementally materialized views (e.g. Noria, Differential Dataflow, DBSP, Materialize, Feldera) are making this much more feasible. For example - PlanetScale Boost automatically accelerates DB queries via an embedded KV cache that is kept coherent via maintaining an incrementally materialized view. This sort of pattern would have been very difficult to build just 4-5 years ago.

https://twitter.com/gunnarmorling/status/1609958952040599552

Market Drivers

In addition to it being easier to build software in this way, it is also now more essential as a result of two significant market shifts.

The first is user expectations. The widespread adoption of tools like Figma, Notion, Linear, and similar which offer offline support, multiplayer collaboration, full reactivity, and essentially instant responses for all actions have set a new baseline for user experience in a lot of software.

The second is that applications have become dramatically more state and compute intensive. We are moving to a world where essentially all products are data-driven and AI-enabled, and where as a result data processing and model inference become a core component of the system. As applications naturally become more stateful and compute-constrained, it becomes increasingly essential to adopt architectures which help to mitigate the latency & compute overhead incurred.

Good examples of this include:

Applications which adopt session backends, where you need to maintain a large amount of state in memory for constant manipulation by the user
Any company doing work with LLMs or diffusion models who want to push some inference to the edge, often for latency reasons - e.g. Web Stable Diffusion
Mosaic, a cool data visualization library that helps dynamically coordinate data processing logic across a local database in the client and a remote database in the cloud, in order to make massive scale data visualization tasks feel almost instant

How developer tooling may need to evolve to accomodate these architecture shifts

This shift in the way that applications are being built has some interesting implications for how developer tooling may need to evolve.

First, there will likely be an explosion of tools which aim to abstract many of the paradigms I describe here - particularly the distribution and syncing of data. For example, LiveBlocks aims to abstract CRDTs for multiplayer collaboration, ElectricSQL aims to abstract sync engines, and Jamsocket aims to abstract session backends. In many cases, this harkons back to what Firebase did for mobile app sync long ago.

Second, I think there will begin to be more attention paid to “smart orchestration” engines. Assuming that it is easy enough to distribute and sync data across client, edge, and cloud, the next question is how exactly to distribute the data and route queries and workloads against it?

An interesting example of this is Cloudflare Smart Placement, where Cloudflare will dynamically execute your compute either close to the user or close to a cloud database based on the nature of the query and how stateful it is.

Similarly, companies like Motherduck which build around hybrid execution models will need to consider questions like:

What data do you pull locally vs. keep remote? What is this a function of?
Do you reactively sync data locally, or do you predictively sync data locally?
If a query needs to access both local data and remote data, what is the most efficient way to fulfill this query?
Do consistency concerns ever influence whether I query remote data vs. local data?

Efforts like Apple Intelligence which adopt an analogous hybrid execution paradigm but for RAG systems will also need to consider when to route to the smaller, local model vs the smarter, bigger, remote model, in addition to having all the same data routing concerns for retrieval.

There is likely a huge amount of room for richer research in this domain - one way to think of it is database planning & optimization but applied to distributed systems, not to individual databases. Work such as Suki and Hydro are interesting efforts roughly in this direction. A key question here is how much of this sort of system optimization can be generalized, vs. how much will need to be extremely application specific.

Distributed systems expertise as an underserved differentiation for application engineers

A secondary implication of all of this is that having expertise in distributed systems can be a significant potential advantage as an application engineering team.

Distributed systems engineering forms the basis of many of these more complex architectures, and often is the basis by which you can now build truly differentiated products especially in more compute intensive categories such as design tools, CAD tools, engineering tools, AI-native tools, and similar.

Yet, few application engineers actually have much expertise in this area - most strong distributed systems people work on “traditional” distributed systems problems (e.g. databases, backend infrastructure, etc).

Frontend engineers haven’t fully realized how relevant some old ideas from distributed systems engineering now are to their work - Browsertech Digest

As a result, I think there is a pretty significant talent arbitrage in the market right now for people applying distributed systems ideas to the way modern web applications are built. Companies which push in these areas can build really differentiated products as a result - like RillData in exploratory data analysis, Womp in 3D editing, and similar.

This blog by Tably is a particularly fun example of this showing how the right application of these sorts of distributed systems ideas might allow you to built a really novel spreadsheet product. I think there is opportunity for more teams thinking this way.

Summary

Overall, data is becoming more “distributed” in applications. The world is moving more towards application that assume that data is split between client, edge, and server, which intelligently cache, sync, & store data across these layers and intelligently route queries/workloads across them.

This trend is facilitated by a range of technology advances, especially virtualization layers like Web Assembly and embedded zero-dependency DB libraries like pg-lite, and is starting to become more table stakes thanks to consumer expectations around application behavior and the rise of more compute and ML-intensive applications.

As this trend matures, I think it will necessitate further evolution of developer tooling, specifically with respect to syncing, orchestration, and routing of data and compute across an increasingly complex system architecture.

If you’re building something in and around this area - I would love to chat.

Subscribe now

UX Design for Agentic Systems

Davis Treybig — Wed, 03 Jul 2024 15:58:06 GMT

Most mainstream product design tools like Figma & Sketch treat user interface design as the atomic unit they are built around. Mockups & visual components are the basis of these tools, and are typically the mechanism by which other interaction design concepts like user intent modeling & user state transitions are conveyed.

This paradigm makes a lot of sense for SaaS products where each user intent or workflow maps directly to a specific UI. If you are a CRM and you want to support a workflow where the user can build automations on a list - you build a user interface for that. If you are a stock trading app and you want to let people buy options - you build a user interface for that. In effect - interaction design is done implicitly by laying out user interfaces.

Traditionally, it would have actually been quite difficult to imagine a piece of software that did not follow this paradigm. How can you support a wide array of user workflows, intents, and states with a singular UI?

Yet, if you look at the emerging landscape of AI-native tools, especially those “selling the work” by automating what is today done by humans or services firms, this dynamic has changed. Many such products are quite UI light, but extremely high in complexity of reasoning about user state and interaction patterns.

Exploring an example

As an example, let’s consider Mindy, an AI “chief of staff” assistant that operates entirely over email. You can ask it anything in natural language over email and it will get back to you and try to help.

In one sense, this product has little to no user interface surface area - it is just a set of emails! From a visual and UI design perspective, a designer for such a product might sketch out different email text layouts in Figma for different use cases, but there is ultimately not a lot of work to be done there.

Yet, on the other hand, a product like Mindy has immense UX design challenges that are, in my opinion, actually far more difficult to deal with than a typical SaaS product. These design challenges stem from: 1. Needing to mimick the behavior of a human chief of staff that can respond to a somewhat arbitrary set of long tail questions and 2. The probabilistic nature of AI systems.

You might imagine that a UX designer for Mindy needs to think through the following sorts of things:

What class of questions are people likely to ask Mindy?
What types of questions should Mindy attempt to respond to vs. not?
Should there be specialized, more “productized” workflows for responding to head queries (e.g. perhaps meeting scheduling and research are the two most important use cases)?
What should Mindy do in cases where it is not confident it can answer?
How should Mindy handle various forms of user followups or responses? What could these look like?

Most of these challenges tend to revolve around user state modeling - what states can user be in, how do we classify each of these states, how should the product operate in each of these states, what state transitions should we model out, and what is the user experience we want to deliver in each state? These are the core design principles that would then guide an engineering and product team on how to architect the LLM system.

As an example - if the most important head query to focus on for a product like Mindy is scheduling, it is likely that you then need to design the system to first classify the incoming email, then move into a series of very specific steps to execute a scheduling request via LLMs such as: 1. Collecting key scheduling metadata such as attendees, date, time, location, 2. Confirm scheduling request with user, 3. Sending the event invite to participants, 4. Monitoring/polling for event rejections or rescheduling requests by attendees, etc.

Partial state modeling for a scheduling request for a product like Mindy

A good UX designer would probably go 100x deeper on this flow than I did above, modeling out all these user states and how to handle them. While each of these sub-states and state transitions likely involve an LLM or small agentic system in some capacity (email classification, email understanding, email sending, etc), the overall state flow needs to be more precisely defined to create a good UX in the vast majority of cases.

And while there is certainly some visual design that needs to accompany this sort of state modeling (e.g. email templates for sub-steps), I would argue that 90% of the design work at hand is more so this form of user needs modeling and designing workflows that fit user expectations.

I would argue that this dynamic fundamentally breaks a lot of the design assumptions that products like Figma are built around. While you can, of course, do this sort of higher abstraction workflow mapping in a tool like Figma - indeed, I made the above image in Figma - there are probably a different set of primitives you might build out to really optimize this sort of design task. Stately is an interesting example of what a more state-oriented design approach might look like for this.

The shifting role of UX designers in agentic companies

Complex logic flows and state diagrams decoupled from visual design are not necessarily a completely new thing - Slack’s push notification decision making logic is a fun, older, non-AI related example of something similar. But I think the difference is that in many agentic products, this now becomes the core UX design challenge of building the product all together.

If you are Cognition - how do you model out all the steps and sub-steps that the system might take to build a large software engineering project on its own that is sufficiently understandable to the user and fits a user’s mental model? If you are Dosu, how do you navigate the right way to respond to the wide array of ways someone might ask you questions or respond to you in a Github issue?

In my experience - people building agentic systems like this actually see the biggest improvement in reliability, quality, and performance by doing this form of domain & user modeling and thinking through more structured state transitions. Hex, for example, saw some of its biggest improvements in Hex Magic when they started to much more tightly constrain the agentic control flow - enforcing things like the # of cells that could be output & the relative order of SQL vs. Python vs. data visualization cells. In essence - they more tightly modeled the state flow of the system.

Ironically, while getting the UX and interaction design correct in these systems is one of the hardest things to do well (and one of the most important things to get right), hiring UX designers in these companies is almost impossible, because the UX job to be done is so different from most SaaS apps. Less work needs to be done on UI, but way more work needs to be done on building mental models of user behaviors and state modeling of users. And that work can no longer implicitly be done by just mocking out visual workflows.

The job of a designer in a product like this is also more technical. There is a much tighter coupling between technical system design and interaction design. Proper UX design of agentic systems really requires a somewhat in-depth understanding of AI system design - which is why, today, most of this work is done today by CEOs or engineering leads.

In a way, it feels like interaction design for products like this is fundamentally different in some ways than what it has looked like before. I wonder if this may lead to very different design workflows, design tools, and team dynamics between UX designers and engineering & product in agentic companies.

Non-destructive editing in design tools

Davis Treybig — Tue, 25 Jun 2024 16:37:47 GMT

One of the defining traits in design tools (e.g. architecture, electrical engineering, graphics design, etc) is whether the design paradigm is destructive or non-destructive.

In simple terms - a destructive paradigm means that only the result of an edit or modification is stored, whereas a non-destructive paradigm keeps track of the sequence of edits and dynamically computes the result from this sequence at render/compute/compile time.

To illustrate this, consider that you want to add pixel blur to a 2D image. A destructive approach would directly apply the blur to the baseline image - meaning the original image is no longer stored anywhere. A non-destructive approach would add a pixel blur “layer” on top of the baseline image, rendering the blurred image but also preserving the original image in it’s raw state in the base layer.

You can think about these architectural patterns as essentially a tradeoff between editability, understandability, and composability versus compute efficiency & system complexity.

Destructive design tools have greatly reduced computational requirements to run because they do not recompute the entire modification graph over and over as you add in more changes/edits. However, they are much more difficult to work with because you can not go back and remove or modify individual changes.

In the above example - the non-destructive editor would allow you to go back later and change the configuration of the pixel blur or even change the base image, and everything else would still work. These sorts of retroactive modifications would not be possible in the destructive editor.

More broadly, non-destructive editing paradigms allow for much more “composability” in system design via functional editing paradigms - you can save a sequence of mutations/modifiers as a “function”, combine that with other mutations to create higher order“functions, and repeat ad-nauseum.

Non-destructive editors are also generally easier to reason about - the user can see what the final result was built up from. But, because they are storing this very complex compute graph which must be re-run every time a change is made or you need to show the result, they are much more computationally intensive to run. Similarly, it is much more complicated to build a non-destructive editing tool.

While virtually all tools in the software engineering & product design world are non-destructive - programming languages and software are a great example of a purely non-destructive paradigm, Figma is built entirely around layers and never does direct pixel manipulation - what is interesting is that a lot of the design tooling world is still destructive.

Photoshop is a good example of this - most filters & effects in Photoshop are destructive by default. While it is possible to use Photoshop in a way that makes such changes non-destructive via features like Smart Objects, it is not the “default” paradigm and it is something the user must think about and learn when using the tool.

Similarly, the majority of 3D design tools that revolve around meshes/triangles are destructive or only allow very specific subsets of their functionality to be modeled non-destructively (e.g. Blender). A notable exception to this is many of the newer parametric design tools in CAD such as Grasshopper and Fusion360. In such tools, everything is defined in terms of dimensions and constraints - e.g. this angle is 30 degrees, these lines are parallel - and you can go back at any point to alter those previous dimensions, which then recalculates the entire 3D object.

The impact of computational advances on destructive editing

Given that computation is the primary bottleneck for non-destructive design paradigms, it is interesting to consider how recent advances in computing are allowing for a plethora of new, natively non-destructive design tool startups to be built. For example:

Modyfi makes use of GPU acceleration via WebGPU to offer a fully non-destructive image & motion graphics editing tool. All visual & motion effects in Modyfi are layers, yet everything is still instantaneously previewed & rendered.

Modumate is a browser-native architecture tool built on top of recent 3D game engine improvements, which models buildings not just as surfaces, but as collections of 3D parts (e.g. doors, knobs, handles, walls, studs) which compose into the 3D model of the house.

Womp utilizes edge computing, pixel streaming, & ML-assisted rendering techniques to offer a non-destructive 3D editing tool. You create 3D shapes in Womp not by crafting the boundary conditions of your desired shape, but by combining different baseline shapes which then intersect, add, or subtract into your desired shape.

NTop is a 3D design tool for physical parts/components based on implicit modeling rather than boundary representations like meshes. 3D parts are created via complex combinations of fields and shapes, rather than simply defined by a triangle or mesh topology. This allows for much more sophisticated analysis, simulation, and optimization of 3D parts to be done, and also makes it much more feasible to build highly complex geometries.

Both Womp and NTop are built around signed-distance-fields (SDFs), a mathematical formulation for representing geometries which has been understood for a long time, but was not widely adopted until recently due to its computational requirements. You can read more about the underlying mathematics of creating 3D shapes in this way here.

A critical insight underlying all of these startups is - once it is technically feasible to overcome the computational limitations of non-destructivity in a given design tool category, you often then have the substrate to build a 10-100x better design tool. This improvement comes not just from the benefits that have already been discussed, but also a wide array of second order effects that stem from modeling a system non-destructively, such as:

Community

Modeling systems as a series of mutations allow things to be encapsulated as re-usable functions and components. This can become the basis of a community where people can share components they have built that others can then use, edit, fork, etc. This is “obvious” in the software field, where anyone can create libraries that can be installed via package managers, yet is non-existent in so many other design domains.

Modumate is a great example of this - they offer a marketplace of community and company-sourced pre-built architectural components. This is not really possible in traditional BIM engines which only model the final surface geometry.

Simulation

Systems modeled non-destructively are much more amenable to simulation. It is easier to “sweep through” a range of different configuration options for each layer or node in your compute graph, testing or evaluating the final output.

Grasshopper’s procedural generation workflows are good classic examples of this.

Optimization

Systems modeled non-destructively can be optimized at a “meta” level by the compute engine, which can look at all the changes that should be applied together holistically and “compile” them down to something more efficient.

A simple, somewhat contrived example of this in the graphics domain might be as follows - if you apply a series of 10 visual modifiers to an image and then the 11th layer is a new image with 100% opacity, then you can ignore all underlying layers at render time. Modyfi is able to do real time motion graphics in the browser via optimizations of this sort.

Higher order system design

Non-destructive modeling typically allows for much more complex objects or systems to be built because of its composability benefits. This makes it easier to encapsulate logic, divide it amongst different people on a team, test sub-systems, and similar.

This is a key reason why many of the functional 3D modeling domains such as computer-aided-design for industrial design, electronics, and mechanical engineering have moved so aggressively to parametric, non-destructive workflows - these are exceptionally rich, complex systems which benefit particularly from a non-destructive paradigm. In contrast, 3D models for rendering (e.g. animations, videos, etc) are in theory less rich and complex.

The Startup Opportunity Around Non-Destructive Editing

I observe that rapid advances in many areas of computing, including graphics, machine learning, edge databases, pixel streaming, distributed systems, hardware acceleration, and embedded processing in the browser (e.g. Web Assembly) are fundamentally changing what is computationally feasible in many design tools today.

I suspect that in many cases, this confluence of technology advances suddenly allows a purely non-destructive paradigm to be applied in various design tool categories. As a result, I think we will see many more startups emerge along the lines of Modyfi, Womp, and NTopology which build around non-destructivity as a core wedge to rethink their category.

I think one of the most compelling variations of this is going after categories where non-destructivity is possible, but requires a specialized workflow - photoshop and blender both being good examples. Such products tend to get bloated with a lot of UX complexity, as non-destructivity gets added in bits and pieces on top a fundamentally destructive baseline, requiring significant user education and the user to maintain a mental model of how they have modeled each piece of their system. When you can instead make non-destructivity the ubiquitous default, you actually simplify the product while simultaneously enhancing what can be done with it.