Overview
An AI-powered chatbot that lets non-technical users query complex databases using plain English. The system uses LangChain with a RAG pipeline to understand database schema context and generate accurate, safe SQL queries in real time.
The Problem
Business analysts and operations teams frequently needed data insights but had to wait for developers to write SQL queries — a bottleneck that slowed decision-making. The goal was to give non-technical users direct, self-serve access to data through a conversational interface.
How It Works
At setup, the database schema (table names, column names, relationships, sample values) is embedded into a vector database. When a user asks a question, the query is embedded and the most relevant schema chunks are retrieved. These are injected into the LLM prompt alongside strict safety constraints.
Key Features
- ▹Natural language to SQL translation via LangChain
- ▹RAG pipeline for schema-aware query generation
- ▹Read-only query enforcement (no destructive operations)
- ▹Query explanation in plain English alongside the SQL
- ▹Support for complex JOINs, aggregations, and subqueries
- ▹Vector DB stores schema embeddings for fast semantic retrieval
- ▹Query history and bookmarking for frequent lookups
Challenges
The hardest problem was handling ambiguous queries. When a user asks about "last month's sales", the model needs to understand the correct date column, format, and timezone. Solved by enriching schema embeddings with sample data and column descriptions.
Outcome
Reduced ad-hoc data requests to the engineering team by a significant margin. Non-technical team members can now get instant answers without writing a single line of SQL.
Case Study — Challenges Faced & How They Were Solved
Building a production-grade RAG chatbot that businesses actually trust is harder than the demo videos suggest. Here's what went wrong and what got fixed.
Challenge 1 — Chunking Strategy Lost Semantic Context
The first naive pass split documents on a fixed character count. The retrieval results looked plausible in benchmarks but produced wrong answers in practice — questions that needed two adjacent paragraphs would only get one, and the LLM happily filled the gap with hallucinated context.
Solution: moved to recursive paragraph-aware splitting with sliding-window overlap (15% of chunk size). Each chunk carries metadata tags — source document, section heading, parent chunk ID — so the LLM can stitch context back together when needed. Retrieval recall on the eval set jumped from 62% to 89%.
Challenge 2 — Multilingual Embeddings Were Mediocre
Customers needed English and Hindi support. The first version used a single English-trained embedding model and translated Hindi queries to English before embedding. Latency went up, translation errors stacked with retrieval errors, and Hindi-specific phrasing got mangled.
Solution: switched to multilingual embedding models (e5-multilingual / bge-m3) that share an embedding space across languages. A Hindi query now retrieves the most relevant English chunks directly — no translation step. The LLM's final response respects the query language via a system-prompt directive ("respond in the same language as the user's question").
Challenge 3 — Streaming Without Choking the Frontend
Synchronous generation made the UI feel dead for 5–15 seconds per response. Users assumed the bot was broken and refreshed.
Solution: token-by-token streaming via Server-Sent Events from the backend, rendered with a typewriter-style component on the frontend. First-token latency dropped to under 800ms. Users see the response building in real time, which both improves perceived speed and signals "this thing is alive."
Challenge 4 — Hallucinations on Out-of-Scope Questions
Even with RAG, the LLM would confidently answer questions the knowledge base didn't cover. "What's your refund policy?" → invented refund terms. Trust killer.
Solution: added a confidence scoring layer. After retrieval, compute the average similarity score of the top-k chunks. If below a threshold (tuned per deployment), the LLM is prompted with explicit instructions to say "I don't have information on that — let me connect you with a human" and route the conversation to a support handoff queue. False-positive answers dropped to near zero.
Challenge 5 — Knowledge Base Drift
Customers added, edited, and removed docs constantly. The vector store became stale within days — old chunks for deleted docs were still being retrieved.
Solution: built an admin dashboard with an indexed change log. Every upload triggers a re-chunk + re-embed of the affected document, marks superseded chunks as deleted (soft delete with version pointers), and re-runs eval queries to verify the change didn't break existing flows. Nightly cron prunes orphaned chunks.
Challenge 6 — Cost Spirals From Long Conversations
Conversation history kept getting stuffed into the LLM context window. By turn 10, prompts were 8000 tokens and costs were 5x what they should have been.
Solution: sliding-window memory keeping the last 10 turns at full fidelity, with older turns summarized into a single rolling summary. Token usage per request stays bounded; the bot still remembers what was discussed earlier in the session because it lives in the summary.
Challenge 7 — Source Attribution for Trust
Business stakeholders wanted to see *where* an answer came from. "Trust me, it's in the docs" wasn't enough.
Solution: every response carries inline citation markers ([1], [2], [3]) that link to the source chunks displayed alongside the chat. Hovering surfaces the exact snippet that informed each part of the answer. Customers stopped asking "is this right?" because they could verify it themselves in two clicks.
Challenge 8 — Embedding-Model Migration Without Re-indexing the World
When a better embedding model dropped (text-embedding-3-large), upgrading meant re-embedding millions of chunks. Customers couldn't tolerate hours of downtime, and re-running the whole pipeline at once would have melted the rate limits.
Solution: dual-write migration. The new model embeds new chunks immediately; a background worker slowly re-embeds old chunks in priority order (most-queried first). Both vector stores serve reads during migration via a query-fanout layer that takes the higher-confidence result. Migration completed over two weeks with zero downtime.
What I'd Do Differently Next Time
- ▹Build the eval harness first. Half the chunking-strategy iteration was guesswork before I had a proper retrieval-recall benchmark. Start with 100 hand-labeled question-chunk pairs on day one.
- ▹Treat the knowledge base as a versioned artifact — Git-like history, not a bag of documents.
- ▹Pin model versions explicitly in the prompt — silent OpenAI model swaps shifted answer style mid-quarter.
Outcome
Live RAG chatbot serving production traffic in English and Hindi, with source attribution, confidence-gated handoff, and a customer-facing admin dashboard for knowledge-base updates. Answer accuracy on the internal eval suite holds above 90%, hallucination rate sits near zero on in-scope questions, and the support team uses the same dashboard to keep the bot aligned with shifting product copy.