These notes are taken from this free course: https://learn.deeplearning.ai/courses/building-evaluating-advanced-rag/lesson/1/introduction

Huge thanks to the folks at <DeepLearning.ai>

Introduction

To productionize a high quality RAG system you need:

  • Advanced retrieval techniques to get highly relevant sources
  • Automated evals to measure responses

Goal of the course: Teach you how to build production ready LLM based system

Course covers:

Advanced RAG Pipeline

Overview of simple (naive) RAG pipeline

simple RAG pipeline

  1. Ingestion: Take a doc -> chunk it up -> embed the chunks -> store them in an index

  2. Retriever: Takes a query -> grabs top matches from the vector store index -> passes them back to the LLm to respond

Sentence Window Retrieval

Gives an LLM extra context by retrieving, in addition to the most relevant sentence, the window of context around it. Sentence window retrieval is as a result, more efficient than the direct query engine.

Sentence Window Retrieval

Auto Merging Retrieval

Organizes the document in a tree like structure where each parent’s nodes text is divided among its child nodes. If a child node is identified as relevant, then the entire text of the parent node is provided. It is a retrieval method that involves breaking text down into even smaller chunks than that of sentence window retrieval in a hierarchical manner. Nodes containing relevant chunks data are concatenated into one comprehensive response.

Auto-merging retrieval

Overview of Evals

RAG triad

The RAG Triad

  • Query
  • Response
  • Context

We use LLms to evaluate LLMs. Generally the steps to running an eval are:

  • Write a list of questions to run against your RAG pipeline
  • Use your eval engine of choice to run each of the questions in your list against your RAG pipeline
    • The eval engine, such as trulens_evals, will evaluate each query and the context it sourced through the RAG process to generate scores for the RAG triad metrics

Evals are crucial in detecting hallucinations in our RAG process and especially for preventing them from slipping in after future changes.

Extra sources:

How it works

  • Eval system will run a specific query against the RAG pipeline
  • Eval system will use it’s own LLM to:
    • Read the user’s query, read the output
    • Then generate a Answer Relevance score
      • Potentially using something like Chain-of-though to do this
      • May output a “supporting evidence” justification as part of it’s internal workings to decide on the score

Feedback Function

Provides a score after reviewing an LLM app’s:

  • inputs
  • outputs
  • intermediate results.

Example code for a trulens eval feedback function:

# Context relevance between question and each context chunk.
f_context_relevance = (
    Feedback(
        provider.context_relevance_with_cot_reasons,
        name="Context Relevance"
    )
    .on(Select.RecordCalls.retrieve.args.query)
    .on(Select.RecordCalls.retrieve.rets)
    .aggregate(numpy.mean)
)

Can be implemented by using an LLM or a BERT model to evaluate the inputs, outputs, and intermediate results

Source: https://www.trulens.org/trulens_eval/evaluation/feedback_functions/anatomy/

Answer Relevance

Checking that the answer is relevant to the query asked by the user.

RAG Eval Answer Relevance flowchart

Source: https://www.trulens.org/trulens_eval/evaluation_benchmarks/

Groundedness

Also know as faithfulness, measures the degree to which an answer generated by a RAG pipeline is supported by the retrieved information.

Measures how well the response tracks to the source material.

Sources:

How to Use Evals to Improve your Rag Pipeline

  • Start with a basic RAG pipeline
  • Setup evals with the RAG Triad metrics
  • Once you have a baseline of metrics, start tweaking the RAG pipeline and see how it effects the outputs
    • Try strategies like sentence window RAG, nested tree, changing the top k, etc