These notes are taken from this free course: https://learn.deeplearning.ai/courses/building-evaluating-advanced-rag/lesson/1/introduction
Huge thanks to the folks at <DeepLearning.ai>
Introduction
To productionize a high quality RAG system you need:
- Advanced retrieval techniques to get highly relevant sources
- Automated evals to measure responses
Goal of the course: Teach you how to build production ready LLM based system
Course covers:
- Advanced RAG
- Evaluations (the RAG triad)
- Context Relevance
- Groundedness
- Answer Relevance
Advanced RAG Pipeline
Overview of simple (naive) RAG pipeline
Ingestion: Take a doc -> chunk it up -> embed the chunks -> store them in an index
Retriever: Takes a query -> grabs top matches from the vector store index -> passes them back to the LLm to respond
Sentence Window Retrieval
Gives an LLM extra context by retrieving, in addition to the most relevant sentence, the window of context around it. Sentence window retrieval is as a result, more efficient than the direct query engine.
Auto Merging Retrieval
Organizes the document in a tree like structure where each parent’s nodes text is divided among its child nodes. If a child node is identified as relevant, then the entire text of the parent node is provided. It is a retrieval method that involves breaking text down into even smaller chunks than that of sentence window retrieval in a hierarchical manner. Nodes containing relevant chunks data are concatenated into one comprehensive response.
Overview of Evals
The RAG Triad
- Query
- Response
- Context
We use LLms to evaluate LLMs. Generally the steps to running an eval are:
- Write a list of questions to run against your RAG pipeline
- Use your eval engine of choice to run each of the questions in your list against your RAG pipeline
- The eval engine, such as trulens_evals, will evaluate each query and the context it sourced through the RAG process to generate scores for the RAG triad metrics
Evals are crucial in detecting hallucinations in our RAG process and especially for preventing them from slipping in after future changes.
Extra sources:
How it works
- Eval system will run a specific query against the RAG pipeline
- Eval system will use it’s own LLM to:
- Read the user’s query, read the output
- Then generate a Answer Relevance score
- Potentially using something like Chain-of-though to do this
- May output a “supporting evidence” justification as part of it’s internal workings to decide on the score
Feedback Function
Provides a score after reviewing an LLM app’s:
- inputs
- outputs
- intermediate results.
Example code for a trulens eval feedback function:
# Context relevance between question and each context chunk.
f_context_relevance = (
Feedback(
provider.context_relevance_with_cot_reasons,
name="Context Relevance"
)
.on(Select.RecordCalls.retrieve.args.query)
.on(Select.RecordCalls.retrieve.rets)
.aggregate(numpy.mean)
)
Can be implemented by using an LLM or a BERT model to evaluate the inputs, outputs, and intermediate results
Source: https://www.trulens.org/trulens_eval/evaluation/feedback_functions/anatomy/
Answer Relevance
Checking that the answer is relevant to the query asked by the user.
Source: https://www.trulens.org/trulens_eval/evaluation_benchmarks/
Groundedness
Also know as faithfulness, measures the degree to which an answer generated by a RAG pipeline is supported by the retrieved information.
Measures how well the response tracks to the source material.
Sources:
How to Use Evals to Improve your Rag Pipeline
- Start with a basic RAG pipeline
- Setup evals with the RAG Triad metrics
- Once you have a baseline of metrics, start tweaking the RAG pipeline and see how it effects the outputs
- Try strategies like sentence window RAG, nested tree, changing the top k, etc