There are various LLM-related frameworks out there, and most of them depend on the OpenAI API. I have been trying to lean toward examples that avoid that, and TeddyNote uploaded a hands-on video on the very topics I was interested in, so I'm posting on what I worked through.
Hosting your own local LLM for free with a Korean?? fine-tuned model (LangServe) + even RAG!! (learning content source: TeddyNote)
(For reference, the well-known tools that help you download an LLM to your local computer and use it easily include roughly three: ollama, AnythingLLM, LM Studio — and TeddyNote's example here uses ollama.)
Hands-on walkthrough
1. Get a Korean?? fine-tuned model from HuggingFace-Hub
1) Download the practice model (EEVE-Korean-Instruct-10.8B-v1.0)
(1) Method A. CLI
a-1) Create a working folder and move into it.
a-2) From a terminal in that path, install the package : pip install huggingface-hub
a-3) Then, the 'download' command to type into the terminal
- How to write it
huggingface-cli download \\
허깅페이스 모델페이지의 메인타이틀 \\
모델목록중 다운받으려는 파일이름.gguf \\
--local-dir 내 컴퓨터 안에 모델을 저장할 위치 \\
--local-dir-use-symlinks 심볼릭(바로가기)링크로 사용여부
- References
- Main title: heegyu/EEVE-Korean-Instruct-10.8B-v1.0-GGUF · Hugging Face
- File name: heegyu/EEVE-Korean-Instruct-10.8B-v1.0-GGUF at main
- HuggingFace model download detailed guide: Downloading files from the Hub
- Example usage
huggingface-cli download \\
heegyu/EEVE-Korean-Instruct-10.8B-v1.0-GGUF \\
ggml-model-Q5_K_M.gguf \\
--local-dir /Users/Charles/code/langserve_ollama/ollama-modelfile \\
--local-dir-use-symlinks False
(2) Method B. Direct download from the site
b-1) Find the model and download.
b-2) Move it into the folder you want.
2. Register the downloaded model with ollama
1) Quick summary first:
(1) In the folder containing the downloaded .gguf file* (location does not matter)
(2) create a file describing the model (Modelfile**, name does not matter)
(3) and run ollama create to add it to the ollama list.
*The .gguf file is a file format that's barely a year old (see this explainer). Because of this, sometimes when you search HuggingFace-Hub, certain models are only available in the .ggml file format, so it helps your sanity to know this in advance.
What is GGUF and GGML?
GGUF and GGML are file formats used for storing models for inference, especially in the context of language models like GPT (Generative…
medium.com
**Modelfile is a file with no extension. There are plenty of tools like VIM... but honestly, just writing it in VSCODE is healthier for your sanity..
***ollama create.. is one of the ollama CLI commands. The official docs describe it in detail.
ollama/docs/modelfile.md at main · ollama/ollama
Get up and running with Llama 3, Mistral, Gemma, and other large language models. - ollama/ollama
github.com
2) In the same path as the downloaded .gguf file, create a Modelfile and write the following :
FROM ggml-model-Q5_K_M.gguf
TEMPLATE """
{{- if .System }}
<s>{{ .System }}</s>
{{- end }}
<s>Human:
{{ .Prompt }}</s>
<s>Assistant:
"""
SYSTEM """
A chat between a curious user and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the user's questions.
"""
PARAMETER temperature 0
PARAMETER num_predict 3000
PARAMETER num_ctx 4096
PARAMETER stop <s>
PARAMETER stop </s>
Modelfile (model file) writing-rules summary
FROM - specifies the base model to use (this one is the only required field)
TEMPLATE - the full prompt template that will be sent to the model
( *not required, but without it the LLM may feel a little 'tipsy')
SYSTEM - specifies the system message embedded in the template
( *not required, but without it the LLM might feel like it's ignoring you)
PARAMETER - sets parameters for how Ollama runs the model
( *how creative/random it should be: 0 ~ 2)
ADAPTER - sets an absolute /relative path
( *if you used --local-dir-use-symlinks False earlier, or just downloaded directly, you can omit this)
LICENSE - specifies the legal license.
MESSAGE - specifies the message history.
( *you can define response patterns for specific situations — see the official docs. The official-doc example is
that for "questions asking about a country's capital" the model should "answer in short Yes/No form".)
*Order doesn't matter. Not case-sensitive. Uppercase is just used for readability.
3) From the terminal, run ollama create:
(1) Example usage:
ollama create EEVE-Korean-10.8B -f ollama-modelfile/Modelfile-V02
(2) Usage explanation:
Case 1) If the terminal is in the same path as the model file:
ollama create your-desired-model-name -f model-file-name
Case 2) If the terminal is in a different folder than the model file:
ollama create your-desired-model-name -f folder-where-model-lives (varies by user)/model-file-name (varies by user, but typically just 'Modelfile')
4) Verify it was registered correctly,
(1) From a terminal, look up your-registered-model via ollama
*By the way, if ollama is installed on your computer you can run ollama-related commands from any working directory. (That's because installation registers it in your environment variables.)
ollama list
(2) If it appears in the list, you're good
3. The actual coding
0) Setting up a virtual environment
This is not strictly required. But personally, I find that if I install dependencies straight onto my local machine without setting up a venv, things tend to get unmanageable later, so I'm trying to make it a habit. A representative example of getting stuck is Stable Diffusion: it's optimized not for the latest Python but for version 3.10.6. More broadly, a lot of hands-on examples depend on specific versions of Python or specific package versions, so for your own sanity it's probably best to set up a venv on a per-project basis right from the start before you study them.
1) Create an empty folder, then
2) Create a virtual environment: python -m venv langServe_localRag
3) Activate the venv: source langServe_localRag/bin/activate
*Set langServe_localRag to whatever name you want.
*Side example : While doing this LangServe hands-on, I actually ran into an issue caused by a bug in a specific package..
( LangServe Bug : issue surfaced in 0.1.0 -> patched in 0.1.1 -> works correctly)
1) Setting up dependency packages
pip install -r requirements.txt
*Note that requirements above is just my own personal list. It's a document that lets me — and others — use the same code under the same conditions. To generate this file, run the command below from a terminal in the project directory.
pip freeze > requirements.txt
2) Packages / modules / classes
import os # module that provides many functions for interacting with the operating system from fastapi import FastAPI # fast and easy ASGI web framework used to create APIs from fastapi.responses import RedirectResponse # one of the response classes, used to redirect to a specific URL from fastapi.middleware.cors import CORSMiddleware # CORS stands for Cross-Origin Resource Sharing and is used when sharing resources across origins from typing import List, Union # Python typing module used to add type hints to code from langserve.pydantic_v1 import BaseModel, Field # classes used for data validation from langchain_core.messages import HumanMessage, AIMessage, SystemMessage # import classes that handle messages (human, AI, system) from langserve import add_routes # function that adds routes to a FastAPI application from langchain_community.chat_models import ChatOllama # class that handles the Ollama chat model from langchain_core.output_parsers import StrOutputParser # class that parses output into string form from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder # import classes that manage prompts from langchain_core.prompts
[ Notes ]
- ChatPromptTemplate:
- This is the class used to generate prompts that instruct the chat model on a specific task.
- For example, you take a topic from the user and generate a prompt like "Please explain the topic."
- MessagesPlaceholder:
- This is the class for representing the parts that change dynamically inside a prompt template.
- For instance, it's used to slot user questions or other messages into the prompt template. So this is what you use to inject the parts that vary based on the user's message.
3) Loading env
from dotenv import load_dotenv # loads variables inside the .env file. load_dotenv()
[ Notes ]
Create a file with no name, just .env, in the root (top-level) folder, then write it as below. Out of the lines below, only the 'ls__1111aa 22bb3 3444cc 5666dd 77e' part needs to be replaced with the API key you obtained yourself.
The page where LangChain issues keys is here ( -> link).
LangSmith
smith.langchain.com
4) Code 1. Performing a specific task and returning the result based on the client's request
# Load Ollama: LangChain chat model llm = ChatOllama(model="EEVE-Korean-10.8B:latest") # Create a prompt: instruct the model to explain a given topic briefly prompt_prompt = ChatPromptTemplate.from_template("Briefly explain {topic}.") # Build the chain: take user input, pass it through the chat model, and parse the output chain = prompt_prompt | llm | StrOutputParser() # Build the translation chain: produce parsed output from the input sentence(prompt) through the chat model EN_TO_KO_chain = translator_prompt | llm | StrOutputParser()
5) Composing the prompts
# 채팅 프롬프트 생성 : ChatPromptTemplate의 from_messages 메소드를 사용
chat_prompt = ChatPromptTemplate.from_messages(
[
# 이 배열의 첫 번째 요소는 시스템 메시지입니다. 이것은 AI에게 명령을 내립니다.
# 이 경우에는, 내 이름은 '테디'이고 도움을 주는 AI Assistant임을 명시하고 있습니다. 이 AI는 반드시 한국어로 대답해야 합니다.
(
"system",
"You are a helpful AI Assistant. Your name is '테디'. You must answer in Korean.",
),
# 이 프롬프트는 사용자의 메시지를 저장하는 Placeholder를 포함하고 있습니다.
# Placeholder는 템플릿에서 변하는 부분을 나타내는데 사용됩니다.
# 'variable_name' 속성의 값으로 'messages'를 설정하여, 채팅 대화에서 사용자의 메시지를 포함하도록 합니다.
MessagesPlaceholder(variable_name="messages")
]
)
# 'chat_prompt', 'llm', 그리고 'StrOutputParser'를 '|' 연산자를 이용하여 체인으로 연결합니다.
chat_chain = chat_prompt | llm | StrOutputParser()
# 번역 프롬프트 생성 : 주어진 문장들을 한국어로 번역하라는 지시를 가지는 프롬프트를 만듭니다.
translator_prompt = ChatPromptTemplate.from_template(
"Translate following sentences into Korean:\\n{input}"
)
[Notes]
- 'chat_prompt' takes the user's input and generates the prompt.
- 'llm' uses the generated prompt to produce a response from the model.
- 'StrOutputParser' parses the generated response and turns it into text form.
By chaining these together this way, the entire flow of taking user input, generating a response, and returning it as text is implemented in a single line.
6) Initialize the FastAPI app and configure CORS (Cross-Origin Resource Sharing)
- Create a FastAPI instance and add CORS middleware so the API can be accessed and used (resource sharing) from other domains too.
# FastAPI is a Python web framework used to create APIs app = FastAPI() # CORS(Cross-Origin Resource Sharing); "*" means all origins are allowed. app.add_middleware( CORSMiddleware, allow_origins=["*"], allow_credentials=True, allow_methods=["*"], allow_headers=["*"], expose_headers=["*"], )
7) Set up and run a RESTful API server
# If a GET request comes to the root URL("/"), redirect it to "/prompt/playground" @app.get("/") async def redirect_root_to_docs(): return RedirectResponse("/prompt/playground")
8) Configure each individual service
- Set up various paths ("/prompt", "/chat", "/translate", "/llm") so that when a request comes in on each path, it's handled and responded to in the appropriate way.
# 1. Set up the "/prompt" URL so the basic Q&A chain can be accessed add_routes(app, chain, path="/prompt") # 2-2. Define the input type for the chat endpoint. # This is a list of messages that make up a conversation and may include human, AI, and system messages. class InputChat(BaseModel): """Input for the chat endpoint.""" messages: List[Union[HumanMessage, AIMessage, SystemMessage]] = Field( ..., description="The chat messages representing the current conversation.", ) # 2-1. Route the chat chain(chat_chain) to the "/chat" path(URL) so it can be accessed # Also enable the feedback endpoint and public trace link endpoint, and set this endpoint type to "chat" add_routes( app, chat_chain.with_types(input_type=InputChat), path="/chat", enable_feedback_endpoint=True, enable_public_trace_link_endpoint=True, playground_type="chat", ) # 3. Route the translation chain(EN_TO_KO_chain) to the "/translate" URL path add_routes(app, EN_TO_KO_chain, path="/translate") # 4. Route "/llm" so the ChatOllama model can be accessed directly add_routes(app, llm, path="/llm")
9) When you run app.py, it auto-launches via uvicorn so that this application (the FastAPI server) starts up.
# Run the code below when executed as the main program. if __name__ == "__main__": # uvicorn is an ASGI server. It is used to host the FastAPI application. # Here it runs the application on address 0.0.0.0 and port 8000. import uvicorn uvicorn.run(app, host="0.0.0.0", port=8000)
10) Deploying with NGROK
(1) Sign up for ngrok
ngrok | Unified Application Delivery Platform for Developers
ngrok is a secure unified ingress platform that combines your global server load balancing, reverse proxy, firewall, API gateway and Kubernetes Ingress Controller to deliver applications and APIs.
ngrok.com
(2) Register a domain https://dashboard.ngrok.com/cloud-edge/domains
ngrok - Online in One Line
dashboard.ngrok.com
(3) Run the base model and the code
(3-1) Start ollama and app.py
(3-2) Open another terminal,,
(4) Tunnel through ngrok (port forward)
(4-1) Run the command using the same port that the code is running on
(4-2) Port-forwarding result
(5) Access
- Access the basic Q&A chat chain
- Access the chat chain
- Monitoring on LangSmith
- Access the translation chain
Code
https://github.com/normalstory/local_ollama_langserve/tree/main
GitHub - normalstory/local_ollama_langserve
Contribute to normalstory/local_ollama_langserve development by creating an account on GitHub.
github.com

