Renewal·마흔의 생활코딩

LLM | Ollama Part 4: Applying Retrieval-Augmented Generation (RAG)

February 25, 2024·7 min read

cover image

LLM | Ollama Part 4. Applying Retrieval-Augmented Generation (RAG)

- Ollama Part 1. Running it from a local terminal: Linux (wsl 2), MacOS
- Ollama Part 2. Running it in a local browser : open-webui
- Ollama Part 3. Running it in an online browser (on my own domain)
? Ollama Part 4. Applying Retrieval-Augmented Generation (RAG)
- Ollama Part 5. Applying image recognition
- (in preparation) Ollama Part 6. Applying the MOE (Mixture of Experts) approach

Before we get into Ollama RAG

As a side note: since LangChain is the framework that exposes the general concept of RAG and the related APIs, I recommend going through the basics and a quick hands-on with LangChain first. I have already posted on this topic before, so I'll link it up front before we begin.

https://normalstory.tistory.com/entry/LM-to-RAGfeat-Langchain-01-%EA%B0%9C%EC%9A%94

LM to RAG(feat. Langchain) - 01 Overview

LM and LLM overview, architecture - Transformer (Decoder, Encoder), learning algorithms - language model LM, LM workflow: foundation model* > RLHF technique**, pre-training on massive compute resources and data

normalstory.tistory.com

https://normalstory.tistory.com/entry/LM-to-RAGfeat-Langchain-02-%EC%8B%A4%EC%8A%B5-%EC%9E%91%EC%84%B1-%EC%A4%91

LM to RAG(feat. Langchain) - 02 Colab hands-on

The hands-on link is Colab. Basic chat setup, API KEY signup and issuance, hands-on with the key, talking to GPT, OpenAI - Documents, LangChain - LangChain (LLM) practice, comparing GPT 3 vs 3.5, tuning parameters such as temperature: 0 consistent answers, 2 every time different.

normalstory.tistory.com

*A heads-up: the example linked above is from the second half of last year, before LangChain was updated, so the hands-on code may differ slightly from the current API. The big picture is the same though, you just need to bring the packages up to the latest versions. Use that link only as a rough reference of the structure, and rely on this post for the actual code.

Now, let's go with Ollama RAG

There are basically two ways to apply RAG with Ollama. [1] using the ChatOllama package provided by LangChain, or [2] using the libraries Ollama itself ships via pip, npm, and so on.

[1] First up, an example of using LangChain's ChatOllama package.

1. Setup is half of any coding project.

As a baseline, Ollama needs to be installed on your local PC. (No separate server setup is required, by the way.)
You can download the Ollama program from the official site, then pull whichever models you are interested in. For the related setup, refer to the first post from the table of contents above.

2. The LLM models we'll be using

For this RAG example, we will be using newly released versions from Ollama's library, specifically the nomic and mistral models. Side note 2: the nomic model is an embedding-only model. You can install them by entering the commands below.

ollama pull nomic-embed-text
ollama pull mistral:v0.2

3. The result

For this hands-on, to make things easier to grasp, let's look at the result of running the code first:

The result of running the code. Comparison between the response with RAG applied and without RAG applied

What we can learn from this example: when RAG is applied, the model searches based on the document you provide (URL or .file), and if you ask a question whose answer is not in the document, it explains based on the premise of "this cannot be found within the provided document." That said, if you mix code for both cases inside a single piece of code, no matter how strictly you define the template, hallucinations or paraphrasing tend to get worse. ( Of course, models with a lot of training can give accurate answers without hallucinations even without a document. But because there are variables such as how prompts are written and the differing characteristics of each model, in practice it amounts to leaving things up to luck. Personally, from a service-provider perspective where you have to control variables, I don't think it's a recommendable approach.)

4. The coding

All right, let's start coding. For convenience of the post, the code is divided into roughly four steps: 1) package setup, 2) document loading and splitting, 3) storing (embedding), and 4) RAG setup and running.

1) Package setup and model loading

## ollama RAG(URLs) by langchain-ChatOllama

from langchain_community.document_loaders import WebBaseLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_community import embeddings
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain.output_parsers import PydanticOutputParser
from langchain.text_splitter import CharacterTextSplitter

model_local = ChatOllama(model="mistral:v0.2")

2) Document loading and splitting
Document loading can again be configured in roughly three ways depending on the situation:
(1) Uploading a .txt file

# txt 첨부
print(".txt 첨부) \\n")

## Type. txt
with open('청킹.txt', 'r', encoding='utf-8') as file:
    text = file.read()
    
from langchain.docstore.document import Document
chunks = []
chunk_size = 35 # Characters
for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)
doc_splits = [Document(page_content=chunk, metadata={"source": "local"}) for chunk in chunks]

(2) Uploading a .pdf file

# pdf 첨부
print(".pdf 첨부) \\n")

loader = PyPDFLoader("청킹.pdf")
doc_splits = loader.load_and_split()

(3) Attaching a URL

# url 첨부
print("url 첨부) \\n")

urls = [
    '<https://ollama.com/>',
    '<https://python.langchain.com/docs/integrations/llms/ollama/>'
]
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=7500, chunk_overlap=100)
doc_splits = text_splitter.split_documents(docs_list)

3) Storing (embedding)
In step 1) when loading the model, we used mistral (the LLM model used for communicating with the user), but here, in the embedding step, we apply nomic, the LLM model dedicated to embeddings.

# 문서를 임베딩으로 변환하여 저장하기

vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="rag-chroma",
    embedding=embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'),
)

*By the way, as of April 12, Ollama provides three embedding models in total (mxbai, nomic, all-minilm). For more details, see the link below.

Embedding models · Ollama Blog

Embedding models are available in Ollama, making it easy to generate vector embeddings for use in search and retrieval augmented generation (RAG) applications.

ollama.com

4) Time to R.un the RAG!
For RAG (Retrieval Augmented Generation), we save Retrievers (search tools) into the vector store, and to scope the search to the document, we set up the context (the reference range) inside the prompt. The content is nothing fancy. It's just natural language ( natural language) meaning, in effect, "when answering, do not stray outside the context I gave you." Note that depending on the model, this prompt may be ignored, or the model may blend in its own training to answer. The newer the model, with more training, the more I see this tendency.

# RAG 
 
retriever = vectorstore.as_retriever()
after_rag_template = """
Answer only with reference to what is given in context:
{context}
Question: {question}
"""
after_rag_prompt = ChatPromptTemplate.from_template(after_rag_template)
after_rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | after_rag_prompt
    | model_local
    | StrOutputParser()
)

print(after_rag_chain.invoke("인지심리학에 대해 알려줘?"))

{Appendix}

For comparison, here is the code that throws a question directly at the LLM without applying RAG.

print("RAG 미적용 예: \\n")

before_rag_template = "What is {topic}"
before_rag_prompt = ChatPromptTemplate.from_template(before_rag_template)
before_rag_chain = before_rag_prompt | model_local | StrOutputParser()

print(before_rag_chain.invoke({"topic": "청킹(Chunking)"}))

[2] An example using Ollama's own Libraries

1. For details, please refer to Ollama's official documentation

Ollama official libraries

Python & JavaScript Libraries · Ollama Blog

The initial versions of the Ollama Python and JavaScript libraries are now available, making it easy to integrate your Python or JavaScript, or Typescript app with Ollama in a few lines of code. Both libraries include all the features of the Ollama REST AP

ollama.com

Ollama's official libraries come in two flavors: Python (PIP) & JavaScript (NPM).

2. Straight to the coding

For this post's example, like above, we'll go with a Python base.

1) Setting up the virtual environment
Create a venv and activate it

2) Install the libraries

pip install ollama
pip install langchain beautifulsoup4 chromadb gradio

3) Writing the code
The only real difference is that the structure is centered around functions; otherwise it is almost identical.

import ollama
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import OllamaEmbeddings
from langchain.text_splitter import CharacterTextSplitter

urls = [
    '<https://ollama.com/>',
    '<https://python.langchain.com/docs/integrations/llms/ollama/>'
]
docs = [WebBaseLoader(url).load() for url in urls]
docs_list = [item for sublist in docs for item in sublist]

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=7500, chunk_overlap=100)
doc_splits = text_splitter.split_documents(docs_list)

# 저장
vectorstore = Chroma.from_documents(
    documents=doc_splits,
    collection_name="rag-chroma",
    embedding=OllamaEmbeddings(model='nomic-embed-text'),
)

# RAG 적용 
retriever = vectorstore.as_retriever()

def format_docs(docs):
    return "\\n\\n".join(doc.page_content for doc in docs)

# Define the Ollama LLM function
def ollama_llm(question, context):
    formatted_prompt = f"Answer the question based only on the following context: {context}\\n\\nQuestion: {question}"
    response = ollama.chat(model='mistral:v0.2', messages=[{'role': 'user', 'content': formatted_prompt}])
    return response['message']['content']

# Use the RAG chain
result = ollama_llm("인지심리학에 대해 알려줘", retriever)
print(result)

The result of running with Ollama Libraries (pip). Even when run multiple times, it answers that there is no related content in the provided document.

(Additional content planned to be updated)

PDF parsing-related content
Unlike .txt or .MD, PDF documents are not laid out as a single stream of running text — they have images and tables interspersed, the layout shifts frequently, and so on. Up until now, in many cases, you'd have to forcibly tune LangChain, stack multiple packages on top, or just brute-force convert everything to .txt. But recently, an open-source project has been released that first segments the document by layout in vision form, then parses each detail per region. I'm planning to upload the previously planned posts as additional updates, and once my hands-on with PDF parsing is wrapped up, I'll add the related content here.

This English version was translated by Claude.