Trends & Open Questions in Retrieval for Foundation Models
I have been spending a lot of time recently thinking about information retrieval for foundation models. This is an overview of some current observations on the space & some thoughts and questions about where it may go next.
Broad Observations
LLM applications continue to look more and more like traditional recommendation systems. Ideas from information retrieval and RecSys continue to cross-pollinate into LLM startups. Teams with backgrounds in these areas have immense advantages.
Most of the value & IP in applied LLM startups seems to be derived from information retrieval, not AI. This is where most of the “juice to squeeze” is, so to speak.
Building a proper information retrieval pipeline for LLM products is really hard. You have to cobble together a lot of tools and build a lot in-house. Eval, testing, and maintainability are all complex.
Specific Trends & Emerging Architectural Patterns
Hybrid Retrieval - Lots of people are moving to hybrid retrieval architectures that combine vector search with lexical/BM25 search. Today, this often requires combining two distinct products - e.g. Elastic + FAISS or Elastic + Pinecone. The only product that seems to do true hybrid search well is Vespa, but it is really hard to use. The whole vector DB market is slowly realizing that the category is “Search”, not “Vector Database”, and is trying to become a hybrid search vendor.
Reranking - More and more teams are adding multi-stage rerankers to their retrieval pipelines, using either products like Cohere’s reranking models or simply building their own rerankers with LLMs.
Index Management - More complex index structures are becoming commonplace. I see many teams creating multiple derived index structures for each chunk in their corpus. You might create a dense vector embedding for the raw chunk, a dense vector embedding for a set of questions which would relate to the chunk (produced via an LLM), a sparse embedding of keywords related to the chunk, etc. Different user queries are then routed to these index structures in intelligent ways. I also see interesting explorations around more complex graphical index structures, especially in code.
Algorithms - Colbert and Splade are both gaining steam. Bridge models are also an interesting area of exploration.
Cost is a huge challenge - The cost of retrieval is massively prohibitive for a lot of companies who want to implement LLM features at scale. Some of the biggest drivers of this are: most vector DBs using fully in memory index structures, use of very high dimension embedding models, and retrieval pipelines that use LLMs to create derivative metadata or index structures. It is quite clear that the industry will move to hybrid disk based ANN algorithms (see Lance, Turbopuffer) as one way to reduce cost. Many people are also exploring lower dimension dense embeddings.
Embeddings - Most teams still seem to just pick an embedding model from the MTEB leaderboard, even though embedding models have such an impact on task accuracy. There need to be better ways for teams to find & evaluate specialized embedding models - e.g. see how Voyage AI is doing some cool work with things like voyage-code-2
Retrieve, then expand - Many teams are moving to multi-step retrieval architectures where you start by retrieving relevant chunks of data via some search algorithm, and then you “expand” each chunk by finding related data in the original data structure. For example - you might start by using vector search to find all the code chunks related to a user’s query in some embedding space, but then enrich each retrieved code chunk by adding the two closest functions in the original file.
Chunking - Chunking still feels a bit like a dark art. What granularity should you chunk at? Should you use overlapping chunk windows? Should you chunk at multiple granularities? Most of these things simply seem to require a lot of experimentation.
Query Expansion & Routing - Most products now do a lot of pre-processing and query enrichment before initiating search, reformulating the user’s query to improve result quality. In addition, I also see a lot more products doing various forms of query classification to dynamically route the query to different search systems or index structures.
So many components - If you sum up all of the above, information retrieval pipelines are getting really, really complex. Is it maintainable to be using different services for data extraction & pre-processing, index management, vector search, lexical search, embeddings, rerankers, and models?
Evaluation - Evaluating retrieval systems of this complexity is extremely hard outside of investing a lot of human time and effort to build out golden tests sets of queries, datasets, and relevance for each step of the retrieval pipeline.
PG-Vector & FAISS - Postgres + PGVector is probably the ideal developer experience for data management & search in information retrieval pipelines. One database for raw data, vector embeddings, lexical search, & metadata. However, at least today, it doesn’t seem to perform well in the 10-100M+ vector range. I see companies similarly struggle with running OSS vector index libraries such as FAISS at this 10-100M vector scale. This is the point where cost becomes high & distributed systems (sharding & partitioning of vectors) become relevant.
Vector Index Management - The other impetus I see a lot that causes people move to “proper” vector databases is demands on read after write behavior and freshness of the vector index. If deletes, writes, and updates need to be very rapidly reflected in the vector index, you typically need to move beyond FAISS or similar.
Thoughts & Questions Moving Forward
“Vector databases” as a category will stop being talked about. “Search databases” or “Information Retrieval Systems” are the category. All vendors are “racing to the middle” in terms of supporting full hybrid search with pre-filtering, metadata filters, etc. The opportunity, in my opinion, is “Vespa, but easy to use”. No one has nailed this yet despite how many players are in the space.
Chunking & index management is really important but feels relatively under-tooled to me. I think there will be more companies here.
There are probably opportunities for more of these components to be bundled. e.g. why can’t I give my vector DB an embedding model and it handles embedding inference for me? It wouldn’t surprise me to see companies doing a lot in retrieval, like Cohere, to ultimately offer some kind of bundled retrieval service.
More observability and intelligence is needed to assist developers in building & optimizing retrieval pipelines.
People with information retrieval backgrounds are massively under-valued in applied LLM startups. Any startups doing RAG should hire aggressively for people like this.
What will “RAG” look like in non-text modalities? For example, a model which takes video files as an input may benefit from a retrieval pipeline that only gives it the key frames of the video that matter for the given query. A model that takes image files as an input may benefit from segmenting or cropping the image to only the relevant piece before it is fed into the model. It is interesting to consider the extent to which such systems will be built in a similar or different way from current RAG pipelines.