Software Systems in a World of LLMs

As LLMs change how software is written, software systems may need to evolve

Jul 07, 2023

TLDR

Software systems (languages, frameworks, etc) are likely to change in a world where AI systems read & write a substantial amount of code.
Properties like type safety, debuggability, verifiability, testability, and similar will become dramatically more important.
Expertise in areas like compilers, formal methods, and program analysis is going to become extremely valuable for generative AI products which write software
Some software systems may actually end up being used more by AI than humans, and as a result may benefit from designing themselves for an AI user

LLMs are clearly massively disruptive to how software engineering is done. Copilot products are on their way to becoming a universal tool that essentially every software engineer uses, I increasingly see agent-based startups effectively automating very complex software tasks (e.g. Grit, Dosu), and we are likely just at the beginning of purpose built code models (e.g. LTM-1).

If you play this forward, it is very likely that AI models become essentially an equal stakeholder to humans in both reading and writing software. In this situation, it would not surprise me to see software systems (e.g. languages, frameworks, libraries, etc) evolve as a result. This essay explores a few of the ways I think this might take shape.

Code Verification & Formal Methods

One of the most foundational challenges to wrestle with in the LLM space is hallucinations - how do you ensure that the output of the LLM is actually doing what you expect it to do?

There are numerous classical methods of analyzing programs to understand their behavior, and it would not surprise me to see many of them become dramatically more important in a world where a lot of code is written by LLMs but validated & refined by humans.

Strong types and type safety is a simple example of this. Type safety has traditionally been seen as a tradeoff between making it easier to validate software at compile time and making code more readable, at the cost of increased boilerplate and “friction” when writing code. LLMs make this downside much less significant, since the LLMs are writing most of the code, but substantially increase the value of the upside, since humans are now spending more time reading and validating the AI’s code. I suspect this will not only shift behavior towards existing languages and frameworks that are strongly typed (e.g. Rust, Typescript), but could perhaps even eventually lead to purpose built languages or frameworks getting built to better deal with the impact of LLMs (e.g. see the Rusty Types paper for an interesting exploration of even stronger type guarantees in a language).

Beyond type safety, there is likely significant opportunity to do work in compilers more broadly in the context of LLMs. There are already interesting examples in the LLM research domain of compiler techniques being used to improve LLM robustness; this paper utilizes compiler techniques to make LLMs more robust to semantics-preserving edits in programs. Similarly, many startups I meet using agent-based systems for things like code optimization make extensive use of compiler techniques to prove the correctness of their proposed code changes.

There are numerous other techniques in computer science literature for program analysis, including formal verification and static analysis of abstract syntax trees, all of which may substantially improve our ability to verify the output of LLMs and act as a “check” on their behavior in the context of code generation.

In a related vein, some programming paradigms & languages are much more suited to testing and verification. Functional programming is a good example of this - functional programming languages, such as Haskell, Clojure, Lisp, & Erlang, are much easier to derive formal proofs for because the way the language is structured (a reason why they are often taught in academic settings despite not being used widely in industry). This overview of Lisp gives a deeper dive into this concept.

While I do not expect these languages to suddenly become widely adopted as a result of LLMs, I do wonder whether some of the properties of these more mathematically “pure” languages may become more valuable in a world of LLMs. More broadly, I think properties like debuggability, verifiability, & testability will all become dramatically more important as AI writes more software for us.

Interestingly, many of the best startups I see applying LLMs to different coding tasks have deep expertise within their teams in compilers, formal methods, static analysis, and the other areas I am describing. Ironically, I actually think that expertise in these areas of verification is more valuable than expertise in LLM research for such startups, given it is both rarer skill yet also critical to make AI code generation reliable and robust.

Software Written for AI Models

While verification techniques provide ways of controlling the output of language models in coding, it is also important to consider what is input to language models. To generate code, language models must read and understand your codebase, and it is increasingly clear that certain ways of structuring a codebase are much more amenable to allowing a model to understand it than others.

One startup I work with has directly seen that their agent-based system works dramatically better on “flatter” repositories with much less inheritance than on highly structured, object-oriented repositories with lots of inheritance and logic that spans many files. In other words, some software architectures are more “promptable” than others. Interestingly, it appears there is some tradeoff between what humans find the most readable & understandable vs. what AI models find readable & understandable.

Will “design patterns” emerge for writing code in a way that ensures models are most able to understand it? Taking this a step further, might certain software systems end up being used more by AI systems than humans, and as a result, be written explicitly for language models?

One example of this might be public APIs like the Github API. I observe that Copilot is now the default way most developers make API calls like this, because it is difficult to remember public API specs that you rarely use, but the APIs are very well documented so LLMs are very good at writing them. A good heuristic for this type of use case would be any API or library where traditionally, you would have just copy and pasted the answer from StackOverflow.

It is very conceivable to me that LLMs and CoPilot like features become the primary way that code to access such APIs is written. If this becomes the case, I wonder if there would be a different API abstraction that would be worse for humans trying to understand the API, but better for LLMs to understand the API?

Similar thoughts could be explored for everything from software documentation to other software artifacts like API specs such as Swagger or data formats like protocol buffers. As AI becomes a more central participant in how all of these information sources are digested and utilized, it may be useful to reconsider some aspects of how they convey information. Even things like error messages are interesting to consider - if an agent attempts to write code and it doesn’t compile or there is a runtime error, the quality of the error message is going to play a huge role in whether the agent can correct its own mistake; what might “Agent focused” error messages look like?

From Code to Operations

Many startups I see applying LLMs to coding are now delving beyond just writing code, and getting into running or managing production systems. For example - could an agent-based system act as an “on-call” SRE and act as a line of first defense for fixing things which break in production? While it is unclear to me how long it will take for these products to get good enough to be trusted for this sort of “mission critical” task, it wouldn’t surprise me to start seeing this in the next few years in a limited capacity.

In a world where LLMs are “operating” production systems in this vein - configuring databases, provisioning infrastructure, etc - it wouldn’t surprise me to see many such systems re-architected around the idea that both AIs and humans are managing them. What would an “AI Agent” role/scope look like in a cloud account? How will such systems more explicitly define what should be done only by a human vs. what should be generally tweakable by an AI? Will it eventually be a competitive advantage to build certain software products entirely around the idea that they can safely be controlled by an AI? These sorts of questions will be interesting to consider moving forward.

While some of these ideas are perhaps a bit too extreme and out there, I do think there will be dramatic second order effects of LLMs in software engineering beyond the adoption of Copilot-esque tools. Shoot me a note if you have other ideas on how software systems may change in a world where LLMs play a key role in software engineering.

Davis Treybig

Discussion about this post