Introducing RAG Application’s First Chapter : The Retrieval

Created by Sriharsha Velicheti in Articles 7 Feb 2024

In an era where digital privacy is as precious as the information itself, we stand at the cusp of a revolution with the introduction of the Big Rag Application’s Retrieval segment. This isn’t just an advancement in technology; it’s a bold statement in favor of data sovereignty and privacy. Our application redefines the paradigm of PDF information retrieval, enabling users to extract and analyze data without the need to rely on internet connectivity or external data processors. It’s a fortress of knowledge that stands resilient in the face of privacy concerns, ensuring that your data remains yours alone.

Imagine a world where the endless reservoirs of information contained within PDF documents are not only accessible but also protected within the sanctity of your own system. This vision is the bedrock of our application, designed for those who value the sanctity of privacy as much as the quest for knowledge. Whether you are a researcher safeguarding sensitive data, a business protecting its intellectual property, or an individual cautious about digital footprints, our application speaks to your highest concerns.

By leveraging state-of-the-art open-source AI models in a self-contained ecosystem, our application ensures that the power of information retrieval is at your fingertips, no internet required. This standalone capability marks the beginning of the Big Rag Application journey, focusing on the Retrieval aspect as a testament to our commitment to privacy, security, and independence in the digital age.

Harnessing the Power of Open-Source AI, Locally

As we delve into the heart of our technology, we celebrate the liberation of data from the clutches of external dependencies...Thats enough of talking Lets jump in and write some code. We use Lnagchain to implement this RAG application especially the Retrieval phase:

LangChain, a comprehensive platform for natural language processing (NLP), revolutionizes the use of Retrieval-Augmented Generation (RAG) models by providing an efficient and accessible framework. This integration plays a critical role in simplifying and enhancing the RAG workflow, which is pivotal for tasks that require deep understanding and generation of human-like text. Here's a step-by-step breakdown of how LangChain enriches the RAG workflow:

Before even starting out the code part we need to install the following libraries in the requirments.txt file:

langchain
sentence_transformers
torch
pypdf
qdrant-client

It is always a best practise to have a virtual environment and then install the libraries using the below command (make sure to have the requirments.txt file in the same directory)
< pip install -r requirements.txt >

Once the requirements are installed you are good to go with your First Rag application development.

1. Document Loaders and Transformers

Functionality: LangChain's document loaders efficiently fetch documents from a variety of sources, including both private and public repositories. These documents, whether in HTML, PDF, or other formats, are then processed by Document Transformers.

from langchain.document_loaders import PyPDFLoader

# Load your PDF document

loader = PyPDFLoader("path/to/your/document.pdf")

documents = loader.load()

Purpose: The transformers prepare the documents for the retrieval phase by breaking down large documents into smaller, more manageable chunks, ensuring that the subsequent steps of the RAG workflow are as efficient and effective as possible.

2. Efficient Text Splitting

Post-loading, documents undergo text splitting for manageable processing, courtesy of RecursiveCharacterTextSplitter.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split text into chunks

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

texts = text_splitter.split(documents[0].text)

3. Text Embedding Models

Functionality: Utilizing a range of text embedding models, LangChain interfaces with leading providers like OpenAI, Cohere, and Hugging Face to create detailed vector representations of text. This vectorization captures the semantic essence of the text, facilitating its retrieval based on meaning rather than mere keyword matching.

To understand and process the text deeply, we use HuggingFaceBgeEmbeddings for generating text embeddings.

from langchain.embeddings import HuggingFaceBgeEmbeddings

# Generate embeddings

embedder = HuggingFaceBgeEmbeddings(model_name="YourModelNameHere")

embeddings = embedder.embed(texts)

Purpose: Embeddings enable the system to efficiently find text segments that are semantically related to a given query, a foundational step for generating accurate and contextually relevant responses

4. Vector Stores

Functionality: LangChain supports integrations with over 50 different vector stores, offering a robust infrastructure for storing and searching text embeddings. This flexibility allows users to choose the most suitable storage solution for their specific needs.

from langchain.vectorstores import Qdrant

# Initialize and use Qdrant for storing and retrieving embeddings

vector_store = Qdrant(host='localhost', port=6333)

vector_store.save(embeddings)

Purpose: Having an efficient database for embeddings is crucial for quickly retrieving the most relevant pieces of text from the vast amounts of data processed and stored by the system.purpose: Having an efficient database for embeddings is crucial for quickly retrieving the most relevant pieces of text from the vast amounts of data processed and stored by the system.

5. Streamlining User Experience with Streamlit

Our application isn't just powerful; it's also user-friendly, thanks to Streamlit. Here's how you can use Streamlit to interact with the application:

import streamlit as st

from your_application import perform_similarity_search

# Streamlit interface for similarity search

query = st.text_input("Enter your search query:")

if st.button("Search"):

results = perform_similarity_search(query)

for result in results:

st.write(result)

6. Integration with Hugging Face(Embeddings):

Functionality: LangChain's seamless integration with Hugging Face enables users to access a wide array of pre-trained models, which can be fine-tuned and adapted for specific NLP tasks.

Purpose: This partnership enriches the generator component of the RAG workflow, allowing for the creation of responses that are not only accurate but also richly detailed and contextually aware.

In summary to help you up with code, it is basically divided into two different files one is to do the data ingestion part and the other is streamlit file to show the working visually along with requirements.txt file:

Ingestion file code:

from langchain.vectorstores import Qdrant

from langchain.embeddings import HuggingFaceBgeEmbeddings

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(r"your/pdf_file /path")

documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,

chunk_overlap=50)

texts = text_splitter.split_documents(documents)

# Load the embedding model

model_name = "BAAI/bge-large-en"

model_kwargs = {'device': 'cpu'}

encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceBgeEmbeddings(

model_name=model_name,

model_kwargs=model_kwargs,

encode_kwargs=encode_kwargs

)

url = "http://localhost:6333"

qdrant = Qdrant.from_documents(

texts,

embeddings,

url=url,

prefer_grpc=False,

collection_name="retriver_db"

)

print("Vector DB Successfully Created!")

Once the Vector database is successfully created you can also paste your local URL in the browser and check for the dashboard to check the Vector db, In this case it would be something like this: http://localhost:6333/dashboard

Streamlit app code file:

import streamlit as st

from langchain.vectorstores import Qdrant

from langchain.embeddings import HuggingFaceBgeEmbeddings

from qdrant_client import QdrantClient

# Function to perform similarity search and display results

def perform_similarity_search(query, db):

docs = db.similarity_search_with_score(query=query, k=3)

for i, (doc, score) in enumerate(docs):

st.write(f"Result :")

st.write({"score": score, "content": doc.page_content, "metadata": doc.metadata})

st.markdown("---")

# Streamlit app starts here

st.title("Similarity retrieval App")

st.sidebar.title("About")

st.sidebar.info(

"This app performs similarity search "

"on a collection of documents based on your query. "

"Enter a query in the sidebar, then click 'Search' to find similar documents."

)

# Sidebar for user input

query = st.text_input("Enter Query", "What is Finance")

# Your existing code

model_name = "BAAI/bge-large-en"

model_kwargs = {'device': 'cpu'}

encode_kwargs = {'normalize_embeddings': False}

embeddings = HuggingFaceBgeEmbeddings(

model_name=model_name,

model_kwargs=model_kwargs,

encode_kwargs=encode_kwargs

)

url = "http://localhost:6333"

client = QdrantClient(

url=url, prefer_grpc=False

)

db = Qdrant(client=client, embeddings=embeddings, collection_name=retriever_db")

# Button to trigger the similarity search

if st.button("Search"):

st.write("Searching...")

perform_similarity_search(query, db)

First you have to run your ingest file to successfully create the Vector Data base once it is created you can run the streamlit app from terminal using the following command :

streamlit run filename.py

and the browser will open something similar to this :

Conclusion

LangChain's integration with the RAG workflow signifies a leap forward in the field of NLP, offering a streamlined approach to developing applications that leverage large language models for knowledge-intensive tasks. By combining state-of-the-art technologies, LangChain facilitates the creation of systems that can understand, process, and generate human-like text, paving the way for innovations that were previously unimaginable in the realm of language processing.

Comments (3)

Melkam Student

7 Feb 2024 | 08:49 pm

I love this article. I am learning a lot from your post and I can not wait what you will bring in the next content. thank you for sharing this amaizing article