Automate the evaluation process of your Retrieval Augment Generation apps without any manual intervention
Today’s topic is evaluating your RAG without manually labeling test data.
Measuring the performance of your RAG is something you should care about especially if you’re building such systems and serving them in production.
Besides giving you a rough idea of how your application behaves, evaluating your RAG also provides quantitative feedback that guides experimentations and the appropriate selection of parameters (LLMs, embedding models, chunk size, top K, etc.)
Evaluating your RAG is also important for your client or stakeholders because they are always expecting performance metrics to validate your project.
Less teasing, here’s what this issue covers:
- Automatically generating a synthetic test set from your RAG’s data
- Overview of popular RAG metrics
- Computing RAG metrics on the synthetic dataset using the Ragas package
PS: Some sections of this issue are a bit hands-on. They include the necessary coding material to implement dataset generation and evaluate the RAG.
Everything will also be available in this notebook.
Let’s have a look 🔎
Let’s say you’ve just built a RAG and now want to evaluate its performance.
To do that, you need an evaluation dataset that has the following columns:
- question (str): to evaluate the RAG on
- ground_truths (list): the reference (i.e. true) answers to the questions
- answer (str): the answers predicted by the RAG
- contexts (list): the list of relevant contexts the RAG used for each question to generate the answer
→ the first two columns represent ground-truth data and the last two columns represent the RAG predictions.