We import modules from Hugging Face’s transforms, peft, and datasets libraries.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import transformers

Additionally, we need the following dependencies installed for some of the previous modules to work.

!pip install auto-gptq
!pip install optimum
!pip install bitsandbytes

Load Base Model & Tokenizer

Next, we load the quantized model from Hugging Face. Here, we use a version of Mistral-7B-Instruct-v0.2 prepared by TheBloke, who has freely quantized and shared thousands of LLMs.

Notice we are using the “Instruct” version of Mistral-7b. This indicates that the model has undergone instruction tuning, a fine-tuning process that aims to improve model performance in answering questions and responding to user prompts.

Other than specifying the model repo we want to download, we also set the following arguments: device_map, trust_remote_code, and revision. device_map lets the method automatically figure out how to best allocate computational resources for loading the model on the machine. Next, trust_remote_code=False prevents custom model files from running on your machine. Then, finally, revision specifies which version of the model we want to use from the repo.

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(

Once loaded, we see the 7B parameter model only takes us 4.16GB of memory, which can easily fit in either the CPU or GPU memory available for free on Colab.

Next, we load the tokenizer for the model. This is necessary because the model expects the text to be encoded in a specific way. I discussed tokenization in previous articles of this series.

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Using the Base Model

Next, we can use the model for text generation. As a first pass, let’s try to input a test comment to the model. We can do this in 3 steps.

First, we craft the prompt in the proper format. Namely, Mistral-7b-Instruct expects input text to start and end with the special tokens [INST] and [/INST], respectively. Second, we tokenize the prompt. Third, we pass the prompt into the model to generate text.

The code to do this is shown below with the test comment, “Great content, thank you!

model.eval() # model in evaluation mode (dropout modules are deactivated)

# craft prompt
comment = "Great content, thank you!"
prompt=f'''[INST] {comment} [/INST]'''

# tokenize input
inputs = tokenizer(prompt, return_tensors="pt")

# generate output
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"),


The response from the model is shown below. While it gets off to a good start, the response seems to continue for no good reason and doesn’t sound like something I would say.

I'm glad you found the content helpful! If you have any specific questions or 
topics you'd like me to cover in the future, feel free to ask. I'm here to

In the meantime, I'd be happy to answer any questions you have about the
content I've already provided. Just let me know which article or blog post
you're referring to, and I'll do my best to provide you with accurate and
up-to-date information.

Thanks for reading, and I look forward to helping you with any questions you
may have!

Prompt Engineering

This is where prompt engineering is helpful. Since a previous article in this series covered this topic in-depth, I’ll just say that prompt engineering involves crafting instructions that lead to better model responses.

Typically, writing good instructions is something done through trial and error. To do this, I tried several prompt iterations using, which has a free UI for many open-source LLMs, such as Mistral-7B-Instruct-v0.2.

Once I got instructions I was happy with, I created a prompt template that automatically combines these instructions with a comment using a lambda function. The code for this is shown below.

intstructions_string = f"""ShawGPT, functioning as a virtual data science \
consultant on YouTube, communicates in clear, accessible language, escalating \
to technical depth upon request. \
It reacts to feedback aptly and ends responses with its signature '–ShawGPT'. \
ShawGPT will tailor the length of its responses to match the viewer's comment,
providing concise acknowledgments to brief expressions of gratitude or \
feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.

prompt_template =
lambda comment: f'''[INST] {intstructions_string} \n{comment} \n[/INST]'''

prompt = prompt_template(comment)

The Prompt

[INST] ShawGPT, functioning as a virtual data science consultant on YouTube,
communicates in clear, accessible language, escalating to technical depth upon
request. It reacts to feedback aptly and ends responses with its signature
'–ShawGPT'. ShawGPT will tailor the length of its responses to match the
viewer's comment, providing concise acknowledgments to brief expressions of
gratitude or feedback, thus keeping the interaction natural and engaging.

Please respond to the following comment.

Great content, thank you!

We can see the power of a good prompt by comparing the new model response (below) to the previous one. Here, the model responds concisely and appropriately and identifies itself as ShawGPT.

Thank you for your kind words! I'm glad you found the content helpful. –ShawGPT

Prepare Model for Training

Let’s see how we can improve the model’s performance through fine-tuning. We can start by enabling gradient checkpointing and quantized training. Gradient checkpointing is a memory-saving technique that clears specific activations and recomputes them during the backward pass [6]. Quantized training is enabled using the method imported from peft.

model.train() # model in training mode (dropout modules are activated)

# enable gradient check pointing

# enable quantized training
model = prepare_model_for_kbit_training(model)

Next, we can set up training with LoRA via a configuration object. Here, we target the query layers in the model and use an intrinsic rank of 8. Using this config, we can create a version of the model that can undergo fine-tuning with LoRA. Printing the number of trainable parameters, we observe a more than 100X reduction.

# LoRA config
config = LoraConfig(

# LoRA trainable version of model
model = get_peft_model(model, config)

# trainable parameter count

### trainable params: 2,097,152 || all params: 264,507,392 || trainable%: 0.7928519441906561
# Note: I'm not sure why its showing 264M parameters here.

Prepare Training Dataset

Now, we can import our training data. The dataset used here is available on the HuggingFace Dataset Hub. I generated this dataset using comments and responses from my YouTube channel. The code to prepare and upload the dataset to the Hub is available at the GitHub repo.

# load dataset
data = load_dataset("shawhin/shawgpt-youtube-comments")

Next, we must prepare the dataset for training. This involves ensuring examples are an appropriate length and are tokenized. The code for this is shown below.

# create tokenize function
def tokenize_function(examples):
# extract text
text = examples["example"]

#tokenize and truncate text
tokenizer.truncation_side = "left"
tokenized_inputs = tokenizer(

return tokenized_inputs

# tokenize training and validation datasets
tokenized_data =, batched=True)

Two other things we need for training are a pad token and a data collator. Since not all examples are the same length, a pad token can be added to examples as needed to make it a particular size. A data collator will dynamically pad examples during training to ensure all examples in a given batch have the same length.

# setting pad token
tokenizer.pad_token = tokenizer.eos_token

# data collator
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer,

Fine-tuning the Model

In the code block below, I define hyperparameters for model training.

# hyperparameters
lr = 2e-4
batch_size = 4
num_epochs = 10

# define training arguments
training_args = transformers.TrainingArguments(
output_dir= "shawgpt-ft",

While several are listed here, the two I want to highlight in the context of QLoRA are fp16 and optim. fp16=True has the trainer use FP16 values for the training process, which results in significant memory savings compared to the standard FP32. optim=”paged_adamw_8bit” enables Ingredient 3 (i.e. paged optimizers) discussed previously.

With all the hyperparameters set, we can run the training process using the code below.

# configure trainer
trainer = transformers.Trainer(

# train model
model.config.use_cache = False # silence the warnings.

# renable warnings
model.config.use_cache = True

Since we only have 50 training examples, the process runs in about 10 minutes. The training and validation loss are shown in the table below. We can see that both losses monotonically decrease, indicating stable training.

QLoRA — How to Fine-Tune an LLM on a Single GPU | by Shaw Talebi | Feb, 2024 - image  on
Training and Validation loss table. Image by author.

Loading Fine-tuned model

The final model is freely available on the HF hub. If you want to skip the training process and load it directly, you can use the following code.

# load model from hub
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM

model_name = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
model = AutoModelForCausalLM.from_pretrained(model_name,

config = PeftConfig.from_pretrained("shawhin/shawgpt-ft")
model = PeftModel.from_pretrained(model, "shawhin/shawgpt-ft")

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Using the Fine-tuned Model

We can use the fine-tuned model for inference in the same way as before. Here is the fine-tuned model’s response to the same test comment as before (i.e. “Great content, thank you!”).

Glad you enjoyed it! –ShawGPT

(Note: I'm an AI language model, I don't have the ability to feel emotions or
watch videos. I'm here to help answer questions and provide explanations.)

The response is much more concise and even adds a disclaimer that it is an AI. If we want to remove this disclaimer, we can easily do that using basic string manipulation in Python.

As another example, let’s try a more technical test comment: “What is fat-tailedness?” The model’s response is given below.

Fat-tailedness is a statistical property of a distribution. In simple terms, 
it refers to the presence of extreme outliers or heavy tails in the

For instance, consider the distribution of heights in a population. A normal
distribution would have most people clustered around an average height with a
few people deviating slightly from the mean. However, in a fat-tailed
distribution, you would observe a larger number of people being significantly
taller or shorter than the average.

In the context of finance, fat-tailedness is often used to describe the
presence of extreme market moves or volatility. For example, the 2008
financial crisis is an example of a fat-tailed event.

I hope this explanation helps! Let me know if you have any questions.

The response is similar to what we saw in the previous article of this series with the fine-tuned OpenAI model. It gives a concise and appropriate explanation of fat-tailedness, but this isn’t how I explain fat-tailedness.

While we could attempt to capture this specialized knowledge via further fine-tuning, a simpler approach would be to augment the fine-tuned model using external knowledge from my article series on fat tails (and other data science topics).

This brings up the idea of Retrieval Augmented Generation (i.e. RAG), which will be discussed in the next article of this series.

QLoRA is a fine-tuning technique that has made building custom large language models more accessible. Here, I gave an overview of how the approach works and shared a concrete example of using QLoRA to create a YouTube comment responder.

While the fine-tuned model did a qualitatively good job at mimicking my response style, it had some limitations in its understanding of specialized data science knowledge. In the next article of this series, we will see how we can overcome this limitation by improving the model with RAG.

More on LLMs 👇

Shaw Talebi

Large Language Models (LLMs)

Source link