Notes on: Building and Evaluating Advanced RAG - Deeplearning.ai

These notes are taken from this free course: Building and Evaluating Advanced RAG

Huge thanks to the folks at DeepLearning.ai

Introduction
Advanced RAG Pipeline
Overview of Evals
How to Use Evals to Improve your Rag Pipeline

Introduction

To productionize a high quality RAG system you need:

Advanced retrieval techniques to get highly relevant sources
Automated evals to measure responses

Goal of the course: Teach you how to build production ready LLM based system

Course covers:

Advanced RAG
- Sentence Window Retrieval
- Auto Merging Retrieval
Evaluations (the RAG triad)
- Context Relevance
- Groundedness
- Answer Relevance

Advanced RAG Pipeline

Overview of simple (naive) RAG pipeline

simple RAG pipeline

Ingestion: Take a doc -> chunk it up -> embed the chunks -> store them in an index
Retriever: Takes a query -> grabs top matches from the vector store index -> passes them back to the LLm to respond

Sentence Window Retrieval

Gives an LLM extra context by retrieving, in addition to the most relevant sentence, the window of context around it. Sentence window retrieval is as a result, more efficient than the direct query engine.

Sentence Window Retrieval

Auto Merging Retrieval

Organizes the document in a tree like structure where each parent's nodes text is divided among its child nodes. If a child node is identified as relevant, then the entire text of the parent node is provided. It is a retrieval method that involves breaking text down into even smaller chunks than that of sentence window retrieval in a hierarchical manner. Nodes containing relevant chunks data are concatenated into one comprehensive response.

Auto-merging retrieval

Overview of Evals

RAG triad

The RAG Triad

Query
Response
Context

We use LLms to evaluate LLMs. Generally the steps to running an eval are:

Write a list of questions to run against your RAG pipeline
Use your eval engine of choice to run each of the questions in your list against your RAG pipeline
- The eval engine, such as trulens_evals, will evaluate each query and the context it sourced through the RAG process to generate scores for the RAG triad metrics

Evals are crucial in detecting hallucinations in our RAG process and especially for preventing them from slipping in after future changes.

Extra sources:

https://www.trulens.org/trulens_eval/getting_started/core_concepts/rag_triad/

How it works

Eval system will run a specific query against the RAG pipeline
Eval system will use it's own LLM to:
- Read the user's query, read the output
- Then generate a Answer Relevance score
  - Potentially using something like Chain-of-though to do this
  - May output a "supporting evidence" justification as part of it's internal workings to decide on the score

Feedback Function

Provides a score after reviewing an LLM app's:

inputs
outputs
intermediate results.

Example code for a trulens eval feedback function:

# Context relevance between question and each context chunk.
f_context_relevance = (
    Feedback(
        provider.context_relevance_with_cot_reasons,
        name="Context Relevance"
    )
    .on(Select.RecordCalls.retrieve.args.query)
    .on(Select.RecordCalls.retrieve.rets)
    .aggregate(numpy.mean)
)

Can be implemented by using an LLM or a BERT model to evaluate the inputs, outputs, and intermediate results

Source: https://www.trulens.org/trulens_eval/evaluation/feedback_functions/anatomy/

Answer Relevance

Checking that the answer is relevant to the query asked by the user.

RAG Eval Answer Relevance flowchart

Source: https://www.trulens.org/trulens_eval/evaluation_benchmarks/

Groundedness

Also know as faithfulness, measures the degree to which an answer generated by a RAG pipeline is supported by the retrieved information.

Measures how well the response tracks to the source material.

Sources:

https://www.deepset.ai/blog/rag-llm-evaluation-groundedness

How to Use Evals to Improve your Rag Pipeline

Start with a basic RAG pipeline
Setup evals with the RAG Triad metrics
Once you have a baseline of metrics, start tweaking the RAG pipeline and see how it effects the outputs
- Try strategies like sentence window RAG, nested tree, changing the top k, etc

Table of Contents

Introduction

Advanced RAG Pipeline

Overview of simple (naive) RAG pipeline

Sentence Window Retrieval

Auto Merging Retrieval

Overview of Evals

How it works

Feedback Function

Answer Relevance

Groundedness

How to Use Evals to Improve your Rag Pipeline