How to Improve LLMs with RAG. A beginner-friendly introduction w/… | by Shaw Talebi

Imports

We start by installing and importing necessary Python libraries.

!pip install llama-index
!pip install llama-index-embeddings-huggingface
!pip install peft
!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes
# if not running on Colab ensure transformers is installed too

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor

Setting up Knowledge Base

We can configure our knowledge base by defining our embedding model, chunk size, and chunk overlap. Here, we use the ~33M parameter bge-small-en-v1.5 embedding model from BAAI, which is available on the Hugging Face hub. Other embedding model options are available on this text embedding leaderboard.

# import any embedding model on HF hub
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")Settings.llm = None # we won't use LlamaIndex to set-up LLM
Settings.chunk_size = 256
Settings.chunk_overlap = 25

Next, we load our source documents. Here, I have a folder called “articles,” which contains PDF versions of 3 Medium articles I wrote on fat tails. If running this in Colab, you must download the articles folder from the GitHub repo and manually upload it to your Colab environment.

For each file in this folder, the function below will read the text from the PDF, split it into chunks (based on the settings defined earlier), and store each chunk in a list called documents.

documents = SimpleDirectoryReader("articles").load_data()

Since the blogs were downloaded directly as PDFs from Medium, they resemble a webpage more than a well-formatted article. Therefore, some chunks may include text unrelated to the article, e.g., webpage headers and Medium article recommendations.

In the code block below, I refine the chunks in documents, removing most of the chunks before or after the meat of an article.

print(len(documents)) # prints: 71
for doc in documents:
if "Member-only story" in doc.text:
documents.remove(doc)
continueif "The Data Entrepreneurs" in doc.text:
documents.remove(doc)
if " min read" in doc.text:
documents.remove(doc)
print(len(documents)) # prints: 61

Finally, we can store the refined chunks in a vector database.

index = VectorStoreIndex.from_documents(documents)

Setting up Retriever

With our knowledge base in place, we can create a retriever using LlamaIndex’s VectorIndexRetreiver(), which returns the top 3 most similar chunks to a user query.

# set number of docs to retreive
top_k = 3# configure retriever
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=top_k,
)

Next, we define a query engine that uses the retriever and query to return a set of relevant chunks.

# assemble query engine
query_engine = RetrieverQueryEngine(
retriever=retriever,
node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

Use Query Engine

Now, with our knowledge base and retrieval system set up, let’s use it to return chunks relevant to a query. Here, we’ll pass the same technical question we asked ShawGPT (the YouTube comment responder) from the previous article.

query = "What is fat-tailedness?"
response = query_engine.query(query)

The query engine returns a response object containing the text, metadata, and indexes of relevant chunks. The code block below returns a more readable version of this information.

# reformat response
context = "Context:\n"
for i in range(top_k):
context = context + response.source_nodes[i].text + "\n\n"print(context)

Context:
Some of the controversy might be explained by the observation that log-
normal distributions behave like Gaussian for low sigma and like Power Law
at high sigma [2].
However, to avoid controversy, we can depart (for now) from whether some
given data fits a Power Law or not and focus instead on fat tails.
Fat-tailedness — measuring the space between Mediocristan
and Extremistan
Fat Tails are a more general idea than Pareto and Power Law distributions.
One way we can think about it is that “fat-tailedness” is the degree to which
rare events drive the aggregate statistics of a distribution. From this point of
view, fat-tailedness lives on a spectrum from not fat-tailed (i.e. a Gaussian) to
very fat-tailed (i.e. Pareto 80 – 20).
This maps directly to the idea of Mediocristan vs Extremistan discussed
earlier. The image below visualizes different distributions across this
conceptual landscape [2].print("mean kappa_1n = " + str(np.mean(kappa_dict[filename])))
print("")
Mean κ (1,100) values from 1000 runs for each dataset. Image by author.
These more stable results indicate Medium followers are the most fat-tailed,
followed by LinkedIn Impressions and YouTube earnings.
Note: One can compare these values to Table III in ref [3] to better understand each
κ value. Namely, these values are comparable to a Pareto distribution with α
between 2 and 3.
Although each heuristic told a slightly different story, all signs point toward
Medium followers gained being the most fat-tailed of the 3 datasets.
Conclusion
While binary labeling data as fat-tailed (or not) may be tempting, fat-
tailedness lives on a spectrum. Here, we broke down 4 heuristics for
quantifying how fat-tailed data are.
Pareto, Power Laws, and Fat Tails
What they don’t teach you in statistics
towardsdatascience.com
Although Pareto (and more generally power law) distributions give us a
salient example of fat tails, this is a more general notion that lives on a
spectrum ranging from thin-tailed (i.e. a Gaussian) to very fat-tailed (i.e.
Pareto 80 – 20).
The spectrum of Fat-tailedness. Image by author.
This view of fat-tailedness provides us with a more flexible and precise way of
categorizing data than simply labeling it as a Power Law (or not). However,
this begs the question: how do we define fat-tailedness?
4 Ways to Quantify Fat Tails

Adding RAG to LLM

We start by downloading the fine-tuned model from the Hugging Face hub.

# load fine-tuned model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
trust_remote_code=False,
revision="main")
config = PeftConfig.from_pretrained("shawhin/shawgpt-ft")
model = PeftModel.from_pretrained(model, "shawhin/shawgpt-ft")
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

As a baseline, we can see how the model responds to the technical question without any context from the articles. To do this, we create a prompt template using a lambda function, which takes in a viewer comment and returns a prompt for the LLM. For more details on where this prompt comes from, see the previous article of this series.

# prompt (no context)
intstructions_string = f"""ShawGPT, functioning as a virtual data science \
consultant on YouTube, communicates in clear, accessible language, escalating \
to technical depth upon request. It reacts to feedback aptly and ends \
responses with its signature '–ShawGPT'.ShawGPT will tailor the length of its responses to match the viewer's comment, \
providing concise acknowledgments to brief expressions of gratitude or \
feedback, thus keeping the interaction natural and engaging.
Please respond to the following comment.
"""
prompt_template = lambda comment: f'''[INST] {intstructions_string} \n{comment} \n[/INST]'''
comment = "What is fat-tailedness?"
prompt = prompt_template(comment)
print(prompt)

[INST] ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, thus keeping the interaction natural and engaging.Please respond to the following comment.
What is fat-tailedness? 
[/INST]

Next, we can pass this prompt to the model using the code below.

model.eval()inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), 
max_new_tokens=280)
print(tokenizer.batch_decode(outputs)[0])

Here’s the model’s response (no context).

Great question!Fat-tailedness is a statistical property of a distribution. In simple terms, 
it refers to the presence of extreme outliers or heavy tails in the 
distribution.
For instance, consider the distribution of heights in a population. A normal 
distribution would have most people clustered around an average height with a 
few people deviating slightly from the mean. However, in a fat-tailed 
distribution, you would observe a larger number of people being significantly 
taller or shorter than the average.
In finance, fat-tailedness is often used to describe the presence of extreme 
market moves or outliers in financial returns. It's an important concept in 
risk management and financial modeling.
I hope this explanation helps! Let me know if you have any questions.
–ShawGPT

Although the response’s style and formatting are great, the model’s explanation of fat-tailedness differs from how I defined it in my video and blog series.

Let’s see what happens to the model’s response when we include the appropriate context. To do this, we create another prompt template, which can also take in context from the RAG system.

# prompt (with context)
prompt_template_w_context = lambda context, comment: f"""[INST]ShawGPT, functioning as a virtual data science consultant on YouTube, communicates in clear, accessible language, escalating to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment, providing concise acknowledgments to brief expressions of gratitude or feedback, \
thus keeping the interaction natural and engaging.{context}
Please respond to the following comment. Use the context above if it is helpful.
{comment}
[/INST]
"""

Next, we pass the prompt with context from the query engine and the view comment to the model.

prompt = prompt_template_w_context(context, comment)inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=280)
print(tokenizer.batch_decode(outputs)[0])

Here’s the new response (with context).

Great question!Fat-tailedness is a property of a distribution that describes how much the 
presence of extreme values (rare events) affects the overall statistics of the 
distribution.
In the context of the video, the author discusses how Mediocristan and 
Extremistan distributions differ in terms of fat-tailedness. Mediocristan 
distributions have a relatively small impact from extreme values, while 
Extremistan distributions are heavily influenced by extreme values.
The author also mentions that fat-tailedness is a more general concept than 
specific distributions like Pareto or Power Laws. Instead, it lives on a 
spectrum ranging from thin-tailed (Gaussian) to very fat-tailed (Pareto 80-20).
I hope that helps clarify things a bit! Let me know if you have any questions.
–ShawGPT

This does a much better job of capturing my explanation of fat tails than the no-context response and even calls out the niche concepts of Mediocristan and Extremistan.

Here, I gave a beginner-friendly introduction to RAG and shared a concrete example of how to implement it using LlamaIndex. RAG allows us to improve an LLM system with updateable and domain-specific knowledge.

While much of the recent AI hype has centered around building AI assistants, a powerful (yet less popular) innovation has come from text embeddings (i.e. the things we used to do retrieval). In the next article of this series, I will explore text embeddings in more detail, including how they can be used for semantic search and classification tasks.

More on LLMs 👇