文章目录
- Introduction
- RAG
- Motivation
- Contributions
- FlashRAG
- Component Module
- Pipeline Module
- Datasets and Corpus
- Evaluation
- Experimental Result and Discussion
- Experimental Setup
- Methods and Main results
- Impact of Retrieval on RAG
- Take away
- References
Introduction
RAG
- A robust solution to mitigate hallucination issues in LLMs
Motivation
- Many works are not open-source or have fixed settings in their open-source code, making it difficult to adapt to new data or innovative components.
- The datasets and retrieval corpus used often vary, with resources being scattered.
- Due to the complexity of RAG systems, involving multiple steps such as indexing, retrieval, and generation, researchers often need to implement many parts of the system themselves. Although there are some existing RAG toolkits like LangChain and LlamaIndex, they are typically large and cumbersome, hindering researchers from implementing customized processes and failing to address the aforementioned issues.
Contributions
FlashRAG, an open-source library designed to enable researchers to easily reproduce existing RAG methods and develop their own RAG algorithms.
FlashRAG
Component Module
The Component Module encompasses five main components: Judger, Retriever,
Reranker, Refiner, and Generator.
- Judger functions as a preliminary component that assesses whether a query necessitates retrieval.
- Retriever:
- sparse retrieval, BM25
- dense retrieval, BERT-based embedding models such as DPR, E5 and BGE.
- vector database, FAISS
- Reranker aims at refining the order of results returned by the retriever to enhance retrieval accuracy.
- Cross-Encoder models, such as the bge-reranker and jina-reranker.
- Refiner refines the input text for generators to reduce token usage and reduce noise from retrieved documents, improving the final RAG responses.
- The Extractive Refiner employs an embedding model to extract semantic units, like sentences or phrases, from the retrieved text that hold higher semantic similarity with the query.
- The Abstractive Refiner utilizes a seq2seq model to directly summarize the retrieved text.
- Perplexity-based refiners,LLMLingua Refiner and Selective-Context Refiner.
- Generator
- vllm & FastChat
- Transformers
- OpenAI
- Encoder-Decoder models
Pipeline Module
-
Sequential Pipeline
implements a linear execution path for the query, formally represented as query -> retriever -> post-retrieval(reranker, refiner) -> generator. Branching Pipeline
executes multiple paths in parallel for a single query (often one path per retrieved document) and merges the results from all paths to form the ultimate output.
- REPLUG: The REPLUG pipeline processes each retrieved document in parallel and combines the generation probabilities from all documents to produce the final answer.
- SuRe: The SuRe pipeline generates a candidate answer from each retrieved document and then ranks all candidate answers.
-
Conditional Pipeline
utilizes a judger to direct the query into different execution paths based on the judgement outcome. Loop Pipeline
involves complex interactions between retrieval and generation processes, often encompassing multiple cycles of retrieval and generation.
- Iterative
- Self-Ask
- Self-RAG
- FlARE
Datasets and Corpus
- Datasets: pre-processes 32 benchmark datasets
- Corpus: Wikipedia passages、MS MARCO passages
Evaluation
-
Retrieval-aspect metrics
: recall@k, precision@k, F1@k, and mean average precision (MAP) -
Generation-aspect metrics
: token-level F1 score, exact match, accuracy, BLEU, and ROUGE-L - accommodate custom evaluation metrics
Experimental Result and Discussion
Experimental Setup
Experiment Component | Description |
Generator Model | LLAMA3-8B-instruct |
Retriever Model | E5-base-v2 |
Retrieval Corpus | Wikipedia data from December 2018 |
Max Input Length for Generator Model | 4096 |
Documents Retrieved per Query | 5 |
Default Prompt | Answer the question based on the given document. Only give me the answer and do not output any other words. The following are given documents:{retrieval documents} |
Experimental Environment | 8 NVIDIA A100 GPUs |
Datasets | Natural Questions (NQ), TriviaQA, HotpotQA, 2WikiMultihopQA, PopQA, WebQuestions |
Evaluation Metrics | Exact match for NQ, TriviaQA, WebQuestions; Token-level F1 for HotpotQA, 2WikiMultihopQA, PopQA |
Methods and Main results
Methods are categorized based on the RAG component they primarily focused on optimizing.
- retriever
- refiner
- generator and its related decoding methods
- judger
- entire RAG flow, including multiple retrievals and generation processes.
Impact of Retrieval on RAG
- Existing research works often employs a fixed retriever and a fixed number of retrieved documents.
- Main Findings:
- Figure left:
- the overall performance is optimal when the number of retrieved documents is 3 or 5.
- Both an excessive and insufficient number of retrieved documents lead to a significant decrease in performance, with a drop of up to 40% (both dense and sparse retrieval methods).
- Additionally, we observe that when the number of retrieved documents is large, the results of the three different quality retrievers converge. In contrast, for the top1 results, there is a substantial gap between dense methods (E5, Bge) and BM25, indicating that the fewer documents retrieved, the greater the impact of the retriever’s quality on the final result.
- Figure right:
- It can be seen that on most datasets, using top3 or top5 retrieved results yields the best performance, suggesting that this may represent a good balance between the quality of retrieved documents and noise.
Take away
FlashRAG is a Python toolkit for the reproduction and development of Retrieval Augmented Generation (RAG) research. Our toolkit includes 32 pre-processed benchmark RAG datasets and 14 state-of-the-art RAG algorithms.
References
- Retrieval-Augmented Generation for Large Language Models: A Survey (https://arxiv.org/pdf/2312.10997)
- FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research (https://arxiv.org/abs/2405.13576)