Converging Approaches in LLMs

About 6 months ago, I felt very uncertain about the direction practical LLM use would take. It was definitely going to happen fast, but I wasn't quite sure what set of techniques would be needed for applied use cases across most companies. I've been reading the literature and playing around with LLMs a lot, both for work and as a hobby. Over the last few weeks, I finally feel like I see some convergence on what the approaches will be for companies practically using LLMs in their organization. I am happy people are making foundational improvements, but I am personally interested in how we can apply LLMs to create value for customers and stakeholders. And with that, I do think we finally have a good idea of what it will look like.

At a high level, I think we can expect:

  • Foundation models are consolidated and not a space for startups. I expect almost no company should try to touch this.
  • RAG (retrieval augmented generation) is likely the space for the primary effort.
  • Fine-tuning will be commonly used, as it really helps improve RAG outputs.
  • Chained prompts will become more commonly used.
  • Human-in-the-loop prompts will dramatically outperform in most workflows.

Foundation Models will consolidate. I think we will see a few more startups try here, but between AWS offering foundation models (AWS Bedrock and their recent big investment in Anthropic), Microsoft's heavy investment (both in OpenAI and their own models), OpenAI's current dominance at the top of the field (GPT4), Google's existential fight to keep up (Gemini), and Facebook trying to commoditize the rest (with LLAMA), it seems hard to see how startups can compete long term. Foundation models primarily require big compute and big data. I don't see right now how smaller players keep up.

So I think most people will pick the latest foundation model based on price/performance characteristics. Small open-sourced ones for local-first, cheap inference. Finetuned GPT-3.5-like models for good performance at cost/latency. GPT-4 for good overall performance at higher prices. We will see what a finetuned GPT-4 looks like, but I expect it will make sense for the most important use cases.

RAG (Retrieval Augmented Generation) is not my favorite term honestly, but it effectively just means "providing your LLM context in the prompt." For most real-world use cases of LLMs, you don't want them relying solely on what was trained in a foundation model. I expect RAG to effectively always be used in real-world purposes.

Now the tricky part I have found about RAG is the focus on embedding models. In my experience, it is very hard to get embeddings to reliably work on plain foundation models. It requires a lot of fine-tuning of the embedded set if you are working with documents. I have found that taking a set of docs for example, and asking "What is your cancellation policy?", can often respond with hallucinated answers even when there is an exact keyword match in the embedded documents. It does depend on the use case, but dumping documents into a chunker-embedder has not been a reliable way of getting the LLM to reply with accurate data. I expect most companies to see the solution to these problems to be the following:

More context windows. This increases the budget, but expanding the context window allows you to overcome most weaknesses of RAG. It is basically a brute-force approach, but it works! Anthropic's 100k context windows would really change the game for most use cases (I have not tried it yet). Even OpenAI's 32k context window makes a big difference. It does feel like context windows today are a bit like RAM through the 2000s. When I got my first computer the sales rep said I should have more RAM than I would ever need, 16MB. Context windows seem like the kind of thing that we would find easily useful up to 1m context windows and beyond, with diminishing returns beyond 10m context windows (outside some specific use cases, such as the law, which may require loading huge volumes of case law). Not that you would use all of this in most applications, just that they would be useful!

Finetuning. Finetuning has been very impressive IMO. It doesn't seem to do a great job in my experience without RAG, but it can help tweak models to make better use of RAG and it seems to embed some small information into the model itself. Not enough to fully teach it something new it wasn't trained on, but enough to make a meaningful difference. I like what I am seeing with Finetuning, and with OpenAI it makes practical sense to use over 3.5 in my opinion. 3.5 is just not capable enough for zero-shot use cases I have come across.

Improved embeddings for RAG. I do think hybrid search and not relying solely on embeddings is important, but improvements to the embedding process for RAGs would have big improvements. I do think we will see more embedding pipelines that involve having intermediary LLMs summarize and make intelligent decisions about how to chunk the data into the embedding model. I have seen a few people do this, and this to me seems to make sense. You pass your document into a large context window document so it can "see" the whole thing. You ask it to remove extraneous content and then embed the content using a different model. I certainly could see this all occurring in a single step with a new embedding model to replace what we currently have.

Prompt Chaining + human-in-the-loop workflows. I think critical to most real business use cases will be chaining prompts together, and including humans in the loop. AutoGPT got a lot of buzz early in 2023, but I have played with it enough to find no practical use case. There is probably some real algorithm to explain the issues, but my impression is that the fundamental issue is the probability of failure is too high at each step. If it is "only" 20%, then you can see that 90% of AutoGPT runs fail beyond 10 steps. In my experience, for most tasks even the "successes" are "close but not fully accurate", so it is likely even worse than that.

I suspect each of these 3 things will play a part in improving LLM query results, and I think are going to be the most exciting things to watch over the next year.