Under the Hood of a RAG System: a plain-language tutorial, built around a real poetry blog
I wrote this article to help decode the building of an (AI) RAG system. When I first came across RAG I found it mesmerising, almost stupefying. Slowly, and slowly, bit by bit, I understood what went inside the hood of the system. This is an attempt to unravel and decode the system with the poems in this blog as a knowledge base. It is my earnest desire that this will demystify the RAG system a bit.
Start of Tutorial
1. What is a RAG system, and why does it matter?
Large language models like GPT or Phi-mini are trained on vast amounts of text. They learn to produce fluent, knowledgeable responses. But there is a problem: they have no knowledge of your specific content. Ask Phi-mini about a poem on your blog and it will improvise something generic, entirely disconnected from what you actually wrote.
RAG — Retrieval-Augmented Generation — solves this by splitting the task in two. First, a retrieval system finds the passages from your own content that are most relevant to the question. Then, a language model reads those passages and generates an answer grounded in them. The model is no longer guessing; it is responding to real material.
The system described in this tutorial was built for hunterfiftyfour.blogspot.com, a poetry blog with 286 posts. Every example and result shown here is drawn from that real deployment.
2. The four-stage pipeline
The complete system moves through four stages, each handled by a dedicated Python script.
• Stage 1 — Scrape: fetch all blog posts and save them as structured JSON.
• Stage 2 — Chunk: break each poem into smaller overlapping pieces.
• Stage 3 — Retrieve: when a question arrives, find the most relevant chunks using semantic similarity.
• Stage 4 — Generate: pass those chunks to a local language model and let it compose the answer.
3. Stage 1 — Scraping the blog
Blogspot exposes an Atom feed at a predictable URL. The scraper requests this feed in batches of 500 posts, strips all HTML tags, decodes special characters, and saves the result as blog_data.json. Each entry becomes a clean record with three fields: the title, the content, and the URL of the original post.
An early version of the scraper tried to filter posts by detecting poem structure — counting short lines versus long lines. This turned out to be too aggressive: the filter rejected the majority of genuine poems because they did not conform to a narrow structural definition. Removing the filter and trusting the blog's own content resulted in 286 posts being scraped, compared to just 5 with the filter in place.
4. Stage 2 — Chunking the poems
A poem stored as one long string is not ideal for retrieval. A question about a single image or a single metaphor should surface the relevant lines — not the entire poem. Chunking addresses this by breaking each poem into overlapping pieces.
The chunking strategy creates two kinds of chunks from each poem: a whole-poem chunk that preserves the complete text, and line-group chunks of three lines each, sliding through the poem. Very short fragments — fewer than two lines or fewer than eight words — are discarded. For the 286 scraped poems, this produced around 900 chunks in total.
5. Stage 3 — Semantic retrieval
5a. From words to vectors
Keyword search matches exact words. Semantic search matches meaning. The difference matters enormously for poetry: a question about "longing" should surface a poem about "yearning", even if the word "longing" never appears.
To enable semantic search, every chunk is converted into a vector — a list of 384 numbers — using a locally-running model called all-MiniLM-L6-v2. This model was pre-trained on hundreds of millions of sentences and learned to place similar meanings close together in this 384-dimensional space.
What are the 384 dimensions?
No human decided that dimension 47 means "celestial" or dimension 112 means "longing". The model learned these dimensions automatically during training. Each number captures some abstract aspect of meaning — the combination of all 384 numbers together is what encodes the semantic content of a phrase.
5b. Inside the embedding model
When a chunk of text enters the embedding model, it passes through six transformer layers. Each layer runs an attention mechanism: every word looks at every other word in the chunk and adjusts its own meaning based on its neighbours. After six rounds of this, each word's vector is richly contextualised. The vectors for all words in the chunk are then averaged into a single 384-vector that represents the chunk as a whole.
The visualisation below shows the three steps of semantic retrieval for the query "star" — tokenisation, encoding (with a 32-wide slice of the 384-vector shown as a colour heatmap), and cosine similarity ranking across all chunks.
5c. Cosine similarity
When the user types "star", the same embedding model converts that query into its own 384-vector. The retriever then computes the cosine similarity between the query vector and every chunk vector — essentially measuring the angle between them in 384-dimensional space.
Here is the actual output from the system for the query "star" across the full 286-poem corpus:
-----------------------------------------------------------------------------------------
=== Blog RAG Ready. Type 'quit' to exit. ===
Your question: star
📚 Retrieved 3 chunks:
[0.49] Let the sun hide himself — All the stars are your Diyas,Let the moon hide herself,Let t
[0.47] Like her binary Star — I know that she is far,But when You patrol,The Universe,Stop
[0.38] Stars, show the way — On this Special Day,Dear Stars so high,Let love becomeThe Gi
🤖 Generating answer...
Answer: In "Let the sun hide himself," there is a vivid portrayal of camaraderie during Diwali among soldiers in different service branches—earth-bound diyas, moonlit memories to remind us that our companions are never truly absent. The poem's imagery evokes warmth and togetherness amidst the coldness of warfare with references such as "Let the sun hide himself" representing a longing for comfort or normalcy during times apart due to service, which also touches upon deeper emotional aspects like companionship that transcend beyond just Diwali festivities.
In contrast, in “Like her binary Star,” there is an intimate connection between two individuals who wish they could bring their distant partner home for a moment of joy during the festive season and even fantasize about them riding a shooting star back to Earth with them on this auspicious night, highlighting both longing and hope.
Meanwhile, "Stars, show the way," uses metaphorical language wherein stars are seen as guides for lovers separated by distance or circumstances that prevent their meeting during Diwali—the poem invites these celestial bodies to display love's brilliance and be a source of hope. Here we see imagery suggesting guidance, the giving nature associated with festive celebrations like Diwali being extended beyond humans towards stars themselves as they are metaphorically asked by human hearts not just for light but also direction during tough times.
Each poem offers unique perspectives on camaraderie and love in connection to this significant festival, expressing emotions of togetherness, longing, hope, companionship—and the imagery used amplifies these sentiments effectively through references that resonate with cultural or universal aspects associated with Diwali.
----------------------------------------------------------------------------------------
6. Stage 4 — Generation with Ollama
Once the top three chunks are identified, they are assembled into a prompt and sent to Phi-mini — a small, efficient language model running entirely on the local machine via Ollama. No data leaves the computer. No API key is required.
6a. Next-word prediction
Language models do not compose sentences the way a human writer does. They predict the single most probable next word, given everything that has come before — the context, the retrieved chunks, and all previously generated words. This prediction is drawn from a probability distribution over the entire vocabulary of approximately 30,000 words.
The model has no knowledge of what the correct answer should be. Its sense of what sounds right was absorbed during training, which exposed it to billions of sentences. By the time training finished, the model had internalised the statistical patterns of language deeply enough that its word-by-word predictions produce coherent, natural-sounding text.
6b. Why the answer sounds literary
When the retrieved chunks signal poetry and the prompt establishes a literary register, the probability distribution at each step shifts toward words and constructions that belong to literary analysis. Word order, grammatical structure, topic coherence, and tone all simultaneously constrain the probability at each step. The model has seen millions of essays and analyses during training — it knows what kind of words tend to follow what other words in that context.
A key distinction
The retriever finds what to talk about — the relevant passages from your specific poems. The language model decides how to talk about it — drawing on patterns learned during training. Neither can do the other's job. This division of labour is the core idea behind every RAG system.
7. Design choices and their rationale
Every component was chosen with a specific constraint in mind: 8 GB of RAM, no paid API keys, and a minimal auditable implementation.
• all-MiniLM-L6-v2 — an 80 MB embedding model that runs in under a second per query and fits comfortably in memory.
• Phi-mini via Ollama — a small but capable language model that runs fully locally, requiring no internet connection after the initial download.
• NumPy for similarity — cosine similarity over a few hundred chunks requires no vector database. A NumPy matrix multiply is sufficient and fast.
• Embedding cache — chunk vectors are saved to disk on first run and reloaded on subsequent runs, automatically invalidated if the chunk count changes.
9. Closing thoughts
A RAG system is not magic. Every step is traceable: the scraper fetches text, the chunker slices it, the embedding model encodes meaning as numbers, cosine similarity ranks by proximity, and the language model predicts words one at a time using patterns learned during training. Understanding each step makes it possible to diagnose failures, tune performance, and extend the system sensibly.
The poems of this blog offered an unusually interesting test case. Poetry is compressed, allusive, and context-sensitive — exactly the kind of content where semantic search outperforms keyword matching, and where a model with literary sensibility produces richer answers than a generic one. The system surfaced connections across 286 poems that would have taken hours to find by hand.
End of Tutorial
I hope the reader finds this tutorial useful and interesting.
No comments:
Post a Comment