文章目录

  • Introduction
  • RAG
  • Motivation
  • Contributions
  • FlashRAG
  • Component Module
  • Pipeline Module
  • Datasets and Corpus
  • Evaluation
  • Experimental Result and Discussion
  • Experimental Setup
  • Methods and Main results
  • Impact of Retrieval on RAG
  • Take away
  • References



FlashRAG_ A Modular Toolkit for Efficient Retrieval-Augmented Generation Research_python


Introduction

RAG

  • A robust solution to mitigate hallucination issues in LLMs

FlashRAG_ A Modular Toolkit for Efficient Retrieval-Augmented Generation Research_python_02

FlashRAG_ A Modular Toolkit for Efficient Retrieval-Augmented Generation Research_python_03


Motivation

  • Many works are not open-source or have fixed settings in their open-source code, making it difficult to adapt to new data or innovative components.
  • The datasets and retrieval corpus used often vary, with resources being scattered.
  • Due to the complexity of RAG systems, involving multiple steps such as indexing, retrieval, and generation, researchers often need to implement many parts of the system themselves. Although there are some existing RAG toolkits like LangChain and LlamaIndex, they are typically large and cumbersome, hindering researchers from implementing customized processes and failing to address the aforementioned issues.

FlashRAG_ A Modular Toolkit for Efficient Retrieval-Augmented Generation Research_ci_04


Contributions

FlashRAG, an open-source library designed to enable researchers to easily reproduce existing RAG methods and develop their own RAG algorithms.

FlashRAG_ A Modular Toolkit for Efficient Retrieval-Augmented Generation Research_lua_05


FlashRAG

Component Module

The Component Module encompasses five main components: Judger, Retriever,

Reranker, Refiner, and Generator.

  • Judger functions as a preliminary component that assesses whether a query necessitates retrieval.
  • Retriever:
  • sparse retrieval, BM25
  • dense retrieval, BERT-based embedding models such as DPR, E5 and BGE.
  • vector database, FAISS
  • Reranker aims at refining the order of results returned by the retriever to enhance retrieval accuracy.
  • Cross-Encoder models, such as the bge-reranker and jina-reranker.
  • Refiner refines the input text for generators to reduce token usage and reduce noise from retrieved documents, improving the final RAG responses.
  • The Extractive Refiner employs an embedding model to extract semantic units, like sentences or phrases, from the retrieved text that hold higher semantic similarity with the query.
  • The Abstractive Refiner utilizes a seq2seq model to directly summarize the retrieved text.
  • Perplexity-based refiners,LLMLingua Refiner and Selective-Context Refiner.
  • Generator
  • vllm & FastChat
  • Transformers
  • OpenAI
  • Encoder-Decoder models

Pipeline Module

  • Sequential Pipeline implements a linear execution path for the query, formally represented as query -> retriever -> post-retrieval(reranker, refiner) -> generator.
  • Branching Pipeline executes multiple paths in parallel for a single query (often one path per retrieved document) and merges the results from all paths to form the ultimate output.
  • REPLUG: The REPLUG pipeline processes each retrieved document in parallel and combines the generation probabilities from all documents to produce the final answer.
  • SuRe: The SuRe pipeline generates a candidate answer from each retrieved document and then ranks all candidate answers.
  • Conditional Pipeline utilizes a judger to direct the query into different execution paths based on the judgement outcome.
  • Loop Pipeline involves complex interactions between retrieval and generation processes, often encompassing multiple cycles of retrieval and generation.
  • Iterative
  • Self-Ask
  • Self-RAG
  • FlARE

Datasets and Corpus

  • Datasets: pre-processes 32 benchmark datasets
  • Corpus: Wikipedia passages、MS MARCO passages

FlashRAG_ A Modular Toolkit for Efficient Retrieval-Augmented Generation Research_python_06


Evaluation

  • Retrieval-aspect metrics: recall@k, precision@k, F1@k, and mean average precision (MAP)
  • Generation-aspect metrics: token-level F1 score, exact match, accuracy, BLEU, and ROUGE-L
  • accommodate custom evaluation metrics

Experimental Result and Discussion

Experimental Setup

Experiment Component

Description

Generator Model

LLAMA3-8B-instruct

Retriever Model

E5-base-v2

Retrieval Corpus

Wikipedia data from December 2018

Max Input Length for Generator Model

4096

Documents Retrieved per Query

5

Default Prompt

Answer the question based on the given document. Only give me the answer and do not output any other words. The following are given documents:{retrieval documents}

Experimental Environment

8 NVIDIA A100 GPUs

Datasets

Natural Questions (NQ), TriviaQA, HotpotQA, 2WikiMultihopQA, PopQA, WebQuestions

Evaluation Metrics

Exact match for NQ, TriviaQA, WebQuestions;

Token-level F1 for HotpotQA, 2WikiMultihopQA, PopQA

Methods and Main results

Methods are categorized based on the RAG component they primarily focused on optimizing.

  • retriever
  • refiner
  • generator and its related decoding methods
  • judger
  • entire RAG flow, including multiple retrievals and generation processes.

FlashRAG_ A Modular Toolkit for Efficient Retrieval-Augmented Generation Research_ci_07


Impact of Retrieval on RAG

  • Existing research works often employs a fixed retriever and a fixed number of retrieved documents.

FlashRAG_ A Modular Toolkit for Efficient Retrieval-Augmented Generation Research_人工智能_08

  • Main Findings:
  • Figure left:
  • the overall performance is optimal when the number of retrieved documents is 3 or 5.
  • Both an excessive and insufficient number of retrieved documents lead to a significant decrease in performance, with a drop of up to 40% (both dense and sparse retrieval methods).
  • Additionally, we observe that when the number of retrieved documents is large, the results of the three different quality retrievers converge. In contrast, for the top1 results, there is a substantial gap between dense methods (E5, Bge) and BM25, indicating that the fewer documents retrieved, the greater the impact of the retriever’s quality on the final result.
  • Figure right:
  • It can be seen that on most datasets, using top3 or top5 retrieved results yields the best performance, suggesting that this may represent a good balance between the quality of retrieved documents and noise.

Take away

FlashRAG is a Python toolkit for the reproduction and development of Retrieval Augmented Generation (RAG) research. Our toolkit includes 32 pre-processed benchmark RAG datasets and 14 state-of-the-art RAG algorithms.


References