AILLMRAGChatbotPython

How to Build an LLM-Powered Customer Support Chatbot with RAG

A step-by-step guide to building a production-ready AI customer support chatbot using LangChain, OpenAI, Pinecone vector database, and FastAPI — with human escalation built in.

Softotic Engineering·25 February 2025·3 min read

How to Build an LLM-Powered Customer Support Chatbot with RAG

Generic LLM chatbots hallucinate. They answer confidently with wrong information because they lack context about your business. The solution is RAG (Retrieval-Augmented Generation) — grounding the LLM in your own knowledge base. Here's how Softotic builds this for clients.

What Is RAG?

RAG = Retrieval-Augmented Generation.

Instead of asking the LLM to answer from memory (which leads to hallucinations), you:

  • Retrieve relevant documents from your knowledge base.
  • Include them in the LLM's prompt as context.
  • The LLM generates its answer based only on the retrieved context.

Result: accurate, specific, verifiable answers grounded in your business data.

Architecture Overview

``

User message

[Query Embedding] (OpenAI / sentence-transformers)

[Vector Search in Pinecone] → Top-K relevant chunks

[Prompt Construction] = System prompt + Context chunks + User message

[OpenAI GPT-4o] generates response

[Confidence check] → if low confidence: escalate to human

Response to user

`

Step 1: Build Your Knowledge Base

Index your knowledge base into a vector database.

`python

from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_openai import OpenAIEmbeddings

from langchain_pinecone import PineconeVectorStore

Load docs (could be PDF, markdown, website scrape)

docs = load_documents("./knowledge_base/")

Split into chunks

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

chunks = splitter.split_documents(docs)

Embed and store

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = PineconeVectorStore.from_documents(chunks, embeddings, index_name="support-kb")

`

Step 2: Build the RAG Chain

`python

from langchain_openai import ChatOpenAI

from langchain.chains import ConversationalRetrievalChain

from langchain.memory import ConversationBufferMemory

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

chain = ConversationalRetrievalChain.from_llm(

llm=llm,

retriever=retriever,

memory=memory,

verbose=False,

)

`

Step 3: FastAPI Endpoint

`python

from fastapi import FastAPI

from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):

session_id: str

message: str

@app.post("/chat")

async def chat(req: ChatRequest):

response = chain.invoke({"question": req.message})

# Escalation trigger: model returns low-confidence signal

should_escalate = needs_human(response["answer"])

return {

"answer": response["answer"],

"escalate": should_escalate,

"sources": [doc.metadata for doc in response.get("source_documents", [])]

}

`

Step 4: Human Escalation

A critical feature often overlooked. When the bot says "I'm not sure" or the user asks for a human, escalate:

  • Flag the session as escalated in the database.
  • Alert live agents via WebSocket or notification.
  • Show the full conversation history to the agent.
  • Agent takes over; user sees "You're now connected to a support agent."

Step 5: Multi-Channel Integration

  • Web widget: React component connects to /chat` API via WebSocket.
  • WhatsApp: WhatsApp Business API webhook → your chat API → response via WhatsApp.

Production Considerations

  • Session management: Store chat history in Redis with TTL.
  • Rate limiting: Per-IP and per-session to prevent abuse.
  • Content filtering: Validate inputs to prevent prompt injection.
  • Logging: Log all conversations for quality review and model fine-tuning.
  • Monitoring: Track average response latency, escalation rate, user satisfaction.

Keeping the Knowledge Base Fresh

Set up a pipeline to re-index when content changes:

  • Webhook from your CMS triggers re-ingestion
  • Weekly full re-index as a scheduled job

Conclusion

A RAG-based customer support bot, built properly, reduces support volume by 60–80% while maintaining accuracy. The critical success factor is a well-structured, comprehensive knowledge base.

Ready to add AI support to your product? Softotic's AI team can build it.

Ready to Transform

Let's build your next breakthrough.

Clear communication, predictable delivery, and long-term ownership. From day one, you're partnering with engineers who think like founders.

Start Here

Offices

LondonUnited Kingdom

DubaiUAE – Dubai

Copyright © 2026. All rights reserved.

SOFTOTIC LTD (16371717) is a private limited company incorporated by the Registrar of Companies for England and Wales under the Companies Act 2006& registered in Pakistan as a private SMC under SECP (0320678) and FBR with certification from Pakistan Software Export Board (Z-25-16578/25).