Advanced Retrieval Augmented Generation: Multilingual Semantic Retrieval across Document Types by Finetuning Transformer Based Language Models and OCR Integration
Downloads
This study presents an advanced system for multilingual semantic retrieval of diverse document types, integrating Retrieval-Augmented Generation (RAG) with transformer-based language models and Optical Character Recognition (OCR) technologies. Addressing the challenge of creating a robust multilingual Question-Answering (QA) system, we developed a custom dataset derived from XQuAD, FQuAD, and MLQA, enhanced by synthetic data generated using OpenAI's GPT-3.5 Turbo. This ensured comprehensive, context-rich answers. The inclusion of Paddle OCR facilitated high-quality text extraction in French, English, and Spanish, though Arabic presented some difficulties. The Multilingual E5 embedding model was fine-tuned using the Multiple Negatives Ranking Loss approach, optimizing retrieval of context-question pairs. We utilized two models for text generation: MT5, fine-tuned for enhanced contextual understanding and longer answer generation, suitable for CPU-friendly uses, and LLAMA 3 8b-instruct, optimized for advanced language generation, ideal for professional and industry applications requiring extensive GPU resources. Evaluation employed metrics such as F1, EM, and BLEU scores for individual components, and the RAGAS framework for the entire system. MT5 showed promising results and excelled in context precision and relevancy, while the quantized version of LLAMA 3 led in answer correctness and similarity. This work highlights the effectiveness of our RAG system in multilingual semantic retrieval, providing a robust solution for real-world QA applications and laying the groundwork for future advancements in multilingual document processing.
Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, et al. 2021. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” arXiv. https://doi.org/10.48550/arXiv.2005.11401.
Borgeaud, Sebastian, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, et al. 2022. “Improving Language Models by Retrieving from Trillions of Tokens.” arXiv. https://doi.org/10.48550/arXiv.2112.04426.
Jiang, Zhengbao, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. “Active Retrieval Augmented Generation.” arXiv. https://doi.org/10.48550/arXiv.2305.06983.
Weston, Jason, and Sainbayar Sukhbaatar. 2023. “System 2 Attention (Is Something You Might Need Too).” arXiv. https://doi.org/10.48550/arXiv.2311.11829.
Ramos, Rita, Bruno Martins, and Desmond Elliott. 2023. “LMCap: Few-Shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting.” arXiv. https://doi.org/10.48550/arXiv.2305.19821.
Zhuang, Shengyao, Linjun Shou, and Guido Zuccon. 2023. “Augmenting Passage Representations with Query Generation for Enhanced Cross-Lingual Dense Retrieval.” In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1827–32. SIGIR ’23. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3539618.3591952.
Shao, Zhihong, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. “Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy.” arXiv. https://doi.org/10.48550/arXiv.2305.15294.
Du, Yuning, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, et al. 2020. “PP-OCR: A Practical Ultra Lightweight OCR System.” arXiv. https://doi.org/10.48550/arXiv.2009.09941.
Es, Shahul, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. 2023. “RAGAS: Automated Evaluation of Retrieval Augmented Generation.” arXiv. https://doi.org/10.48550/arXiv.2309.15217.
“Introducing Meta Llama 3: The Most Capable Openly Available LLM to Date.” n.d. Meta AI. Accessed May 11, 2024. https://ai.meta.com/blog/meta-llama-3/.
Wang, Liang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. “Multilingual E5 Text Embeddings: A Technical Report.” arXiv. https://doi.org/10.48550/arXiv.2402.05672.
Xue, Linting, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. “mT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer.” arXiv. https://doi.org/10.48550/arXiv.2010.11934.
Douze, Matthijs, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. “The Faiss Library.” arXiv. https://doi.org/10.48550/arXiv.2401.08281.
Whitehouse, Chenxi, Monojit Choudhury, and Alham Fikri Aji. 2023. “LLM-Powered Data Augmentation for Enhanced Cross-Lingual Performance.” arXiv. https://doi.org/10.48550/arXiv.2305.14288.
Merity, Stephen, Caiming Xiong, James Bradbury, and Richard Socher. 2016. “Pointer Sentinel Mixture Models.” arXiv. https://doi.org/10.48550/arXiv.1609.07843.