LLMs as compilers
Recently, there have been a few interesting papers that essentially treat LLMs as a compilers - most notably KernelBench and Sakana’s AI CUDA Engineer (though note that the Sakana paper had some issues around the model reward hacking).
Each paper used an LLM to take higher-level Pytorch code and write optimized GPU kernels for this code. Roughly, the idea in each case is as follows:
Take the high level Pytorch code
Have LLM write a sample CUDA kernel implementation
Test whether the CUDA code compiles, whether the CUDA code appears to be correct (via sampling input output pairs and comparing to the pytorch code), and whether the CUDA code is faster than the pytorch code via profiling the execution
Have a feedback loop between #3 and the LLM - iteratively refining the CUDA kernel until you get a result you like (or you hit a pre-determined stop threshhold)
The results of this are, in my opinion, really interesting. Specifically, let’s consider the KernelBench paper. While many of their results were not correct and/or did not speedup the kernel relative to torch.compile(), they also showed various situations that led to dramatically faster, correct kernels. For example, their diagonal matrix multiplication kernel was 12x faster than torch.compile(), and another another kernel that performs matrix multiplication, division, summation, and then scaling was 3x faster than torch.compile().
What’s cool about this, though, is that because writing kernels is automatically measurable via verification loops, the negative cases don’t really matter relative to the positive cases. You can simply search until you find good results that show speedups, use those, and throw away everything else. This is a great example of the sorts of “design + verify” style systems that LLMs are so, so well suited for.
This is an interesting paradigm because you are, in essence, replacing a compiler with a large language model based system - and each of these have profoundly different tradeoffs. Compilers are deterministic so you are guaranteed correctness, and they also tend to generalize very well, but they often leave juice to be squeezed relative to hand-crafted implementations. This LLM-based process is non-deterministic and often produces invalid results, but can also produce extremely specialized custom kernels for specific functions you want to optimize against.
This trade-off feels like it may especially be worth it in the area of ML systems - where in many cases the goal is to optimize very specific types of operations to the nth degree rather than support general purpose code optimization in a robust way. Furthermore, sacrificing correctness for speed/performance is not a new concept in machine learning systems - quantization paradigms do exactly the same thing.
ML has slowly been creeping into systems optimization for awhile - and I suspect can be taken much further with LLMs in many cases now. It wouldn’t surprise me that these ideas can be applied in many other areas of systems as well - e.g. distributed systems, databases, etc.