Hierarchical Retrieval Augmented Generation (HRAG)

Introduction

Large Language Model (LLM, e.g., ChatGPT, LLaMA, Mistral, Falcon… and many others), became increasingly popular in the past few months. The demonstrated breathtaking capabilities, being able to sustain human-like conversations and understand unstructured text. At their core, LLMs are completion engines. Given an input text called the “context” they output a likely sequence of words. To do so, LLMs are trained on billions of texts. As a “byproduct”, LLMs have a memory of the training dataset: they can answer questions about subjects they have seen during training.  

Despite their impressive capabilities, LLMs remain a recent technology and an active field of research. Limitations are being discovered while we used them. Three major limitations are: 

  1. Their memory is frozen in time and limited to the training data. 
  2. Their context (the input text) is limited in size. 
  3. They are prone to hallucination, i.e., making up an answer – usually well written but wrong.  

Currently, three main strategies alleviating those issues are: 

  1. Prompt engineering” consisting in prompting the context with instructions to steer the model in a certain direction. For example, it can reduce hallucinations by explicitly allowing the model to say “I don’t know” instead of making up an answer. 
  2. Retrieval Augmented Generation” (RAG) consisting in putting in context some information we want the LLM to focus on. It allows the LLM to punctually learn about a subject it has not seen before. 
  3. Fine Tuning”, which is training a model on particular tasks, such as conversation (“chat” models) or code generation. 

These techniques can complement each other, prompt engineering being the most accessible, and fine tuning the least (requiring a training dataset and compute resources), and each technique is in itself a field of study. 

Retrieval Augment Generation (RAG)

Given a query, RAG consists in putting in context some information. This information becoming accessible to the LLM, it can punctually learn and reason about something it has not seen before. The “retrieval” part of the system can be any mechanism able to extend the context with a document, under the constraint that it must fit in the context. The context size constraint means that we cannot simply put all our documents in the context: we must intelligently segment our documents and extend the context with the relevant part. A complete RAG system relies on five main components: 

  1. Document segmentation. Properly segmenting a document is paramount for a high-quality RAG, and several decisions must be made. What is the segment size? Do we have overlapping chunks between segments? What is the size of the overlap? Or do we link a segment with its neighbours, so that a selected segment always comes with its surrounding context? Do we segment a document flatly, or following its structure? 
  2. Databases of segmented documents. Depending on the retrieval methods, several parallel databases may exist. Where applicable, a semantic segmentation can improve the result. Rich metadata should also be used, both to improve retrieval relevance and allowing to easily locate the segment in the original document. This document pre-processing step is sometimes referred to as the “ingestion” step. 
  3. The query itself. It may be necessary to enrich a query with contextual information. In a conversation, a query can be meaningless outside the context of previous exchanges. The LLM (with a bit of prompt engineering) can be used to produce a new enriched query. 
  4. Retrieval function. Given the databases and a rich query, the retrieval function aim at retrieving the most relevant document segments. Segments are ranked by relevance to the query. The most commonly use pre-LLM ranking technique probably is Okapi BM25. In the LLM era, semantic ranking, embedding based, techniques become the norm. In practice, a mix of techniques can be used. For example, a first selection of segments can be performed by BM25, which can then be refined with semantic ranking. Retrieval functions range from simple nearest neighbour search to complex pipeline relying on rich metadata. 
  5. Context management (for chatbots). In a conversational setting, the limited context size requires to evict some information while preserving critical information. The LLM can be used for Information selection and summarisation. Advanced context management techniques are being developed (e.g. Github). 

A note on semantic search & embeddings 

“Embeddings” are a semantic vectorisation of some data. Semantically similar data are embedded in vector close to each other. This is how a computer can understand that “kitty” and “cat” are related. To do this, we rely on embedding models (EM), which, like LLMs, are trained on large dataset of text. The resulting vectors are in high dimension (e.g., 1536 values for OpenAi’s embeddings), allowing the EMs to encode relations between concepts. 

Embeddings do not necessarily apply to single words: a RAG relies on the embeddings of segments. Following our five components, a simple RAG architecture with semantic search is: 

  1. Split the documents in segments, then embeds them. 
  2. Store the embeddings with the related text in a vector database (a database optimized to store and search among vectors). 
  3. Embeds the query. 
  4. Search in the database the N closest vectors to the query’s embedding. 
  5. Inject the retrieved segments in the context and ask the LLM to answer the query. 

Hierarchical Retrieval Augmented Generation (HRAG)

One of the problems that can happen with RAG techniques is an overabundance of document segments. Assume that you want to build an assistant helping you to find and understand any Australian regulation and law, whatever the domain: driving, banking, insurance, construction, sailing… you name it! This assistant would rely on RAG, with its database loaded with thousands of segments, from the thousands of documents it needs to ingest. In this context, a query, even enriched, may return segments from different, unrelated documents. How can we improve that? 

A possibility is to use a “hierarchical RAG”. Such a system relies on the fact that documents (and their segments) do not exist in a vacuum. They relate to a system of thoughts and are organise accordingly. A HRAG system iteratively navigates increasingly specific topics, leading to a reduce set of documents. In the most extreme cases, this set of documents is small enough to be directly loaded into the LLM without further selection. A search time, we may introduce subtilities like considering the top N topics instead of the top topic, or regularly checking with the user that the system is selecting the correct topics. 

Building this kind of hierarchy is challenging. A straightforward way is to manually choose and write the topics in a top down approach, following the structure of the documents (building the topics from the different headings – although this may be too poor). Indeed, a topic must be rich enough to semantically capture the content of all its subtopics: topics can be enriched thanks to content summarisation. The drawback of following the document structure, i.e. pulling apart semantically close segments of different documents, can be overcome with a top N approach. It is however, easy for a human to follow.  

When the structure of the document is not easily recoverable (e.g. OCRed text), building the hierarchy from the ground up can be a better approach, while also addressing the drawback of the top down approach described above. The hierarchy can be established following a hierarchical clustering algorithm, while the topics can be written by a LLM based on the content or subtopics themselves. Compared to the previous approach, its result may be harder to follow for a human, especially when relying “exotic” summarisation techniques where we ask a LLM to summarise for itself rather than for a human reader. 

Closing words

We are witnessing a Cambrian explosion in the domain of LLMs, RAG, and related techniques. Currently there is a plethora of ideas circulating around the use of AI (some useful, others not so much), however we expect to see the most practical methods soon adopted into everyday use. Meanwhile, if you end up with a huge RAG, give the hierarchical one a go! 

Matthieu

Matthieu

Matthieu leads the research and work related to Generative AI for onepoint in Asia-Pacific.

Scroll to Top