There is a lot of excitement after the announcement of an India developed Sanskrit LLM. I wanted to find out what the making of a Sanskrit LLM entail.
This is a longish post because what follows is the result of my research on this. I can assure you that it will be worth your while to go through the post. So, here it is:
xxxxx
Me:
In Sanskrit the word order does not matter. Aham jalam
pibami, pibami jalam aham, jalam pibami aham, are all grammatically correct ,
and all mean the same. Of course, the case of 'jalam' is intrinsically embedded
in the word. For example, with 'jalam' we know that it is the object. But
certain neuter gender like 'phalam' can be both subject as well as
object(karta, as well as karam). Given this, is it more difficult to construct
LLMs in Sanskrit? I ask because it has been recently reported that India has
made a Sanskrit LLM.
The second question that I want to ask is this: where does
vector database kick in? In the present example, I am told it is required.
So, when is it exactly required?
Claude:
These are two beautifully interconnected questions. Let me
take them one at a time.
Question 1: Is Sanskrit harder for LLMs?
Yes and no — it is harder in specific ways, but also
surprisingly tractable in others.
The free word order challenge
You're right that Sanskrit's free word order (called a pro-drop,
highly inflected language in linguistics) creates a challenge. English LLMs
benefit from the fact that word position strongly signals grammatical role —
"Dog bites man" vs. "Man bites dog" are completely
different. In Sanskrit, the case suffix does all that work, and position is
stylistic. So the model cannot rely on positional patterns the way a
Transformer's attention mechanism naturally does for English.
The kāraka ambiguity problem — your phalam example
This is genuinely hard. Neuter nouns in Sanskrit have
identical nominative (kartā/subject) and accusative (karma/object) forms. So phalam
pacati could mean "the fruit cooks" or "he cooks the
fruit" depending on context. A human uses world knowledge ("fruits
don't typically cook things") to disambiguate. LLMs must learn this too —
but the training signal is weaker because the form gives no clue. This
is a real obstacle.
Where Sanskrit is actually easier
Here's the counterintuitive part: Sanskrit's extreme
regularity — codified by Pāṇini's Ashtādhyāyī with ~4,000 sūtras — means the
morphology is almost perfectly rule-governed. There are no "irregular
verbs" in the chaotic way English or French have them. In principle, a
model that truly learns the grammar can generalize better. Researchers
have in fact proposed a Pāṇini Sūtra-based AI model that draws inspiration from
the Ashtādhyāyī precisely because current LLMs lack deep understanding of
Sanskrit's structure, leading to grammatically incorrect and logically flawed
outputs.
The data scarcity problem
The bigger practical obstacle is corpus size. India has
millions of manuscripts spanning billions of pages, but most are in manuscript
form or scanned PDFs that are difficult to process — even copying text from
these PDFs often results in corrupted characters, a problem that significantly
hinders development. English LLMs train on trillions of tokens; Sanskrit
digital text is orders of magnitude smaller.
What's happening in India
A 118-year-old Sanskrit institution has partnered with IIT
Madras to build India's first native Sanskrit LLM. Separately, under the
IndiaAI Mission, the Government of India has selected Sarvam AI to build
India's sovereign LLM, collaborating with AI4Bharat at IIT Madras — designed to
be fluent in Indian languages and capable of population-scale deployment.
Question 2: When does a Vector Database become necessary?
This is a crisp architectural question. Let me explain the
spectrum:
What a plain LLM does
A standard LLM (like this conversation) holds all its
"knowledge" baked into its weights during training. When you ask it
something, it retrieves that knowledge from parametric memory — no external
lookup. This is exactly what's happening right now in our conversation. No
vector database is needed.
When a Vector Database enters
A vector DB becomes necessary when you have one or more of
these conditions:
- Your
knowledge is too large to fit in the context window. Suppose you have
10,000 pages of Sanskrit manuscripts and a user asks a question. You can't
paste all 10,000 pages into the prompt. Instead, you convert each passage
into a vector (a numerical embedding), store them in the vector DB, and at
query time you retrieve only the most semantically similar passages
— then feed those into the LLM's context. This is called RAG
(Retrieval-Augmented Generation). Research on Sanskrit scripture-based
AI has shown that RAG systems are necessary to provide specific and
faithful answers from ancient Sanskrit texts, significantly outperforming
standard LLMs in producing factual, comprehensive responses with fewer hallucinations.
- Your
knowledge changes frequently. You cannot retrain an LLM every time new
data arrives. A vector DB lets you add, update, or delete knowledge
dynamically without retraining.
- You
need precise, attributable retrieval. If a user asks "What does
verse 3.16 of the Gita say?", an LLM may hallucinate. A vector DB
retrieves the exact passage, which the LLM then reasons over.
- You
need to search a private corpus the LLM was never trained on — your
company's documents, a personal journal, a proprietary legal database,
etc.
The simple summary:
|
Scenario |
Vector DB needed? |
|
General Q&A from training knowledge |
No |
|
Chatting, reasoning, coding help |
No |
|
Q&A over a large private document corpus |
Yes |
|
Keeping knowledge up-to-date dynamically |
Yes |
|
Precise citation from specific texts |
Yes |
In the Sanskrit LLM context specifically, a vector database
becomes almost essential for scripture Q&A — because the corpus of Vedic,
Purāṇic and philosophical texts is vast, and you want the model to cite
specific passages rather than confabulate them from parametric memory.
Me:
So, sentence-transformers does the work of the vector db,
for example, in my blog_RAG that I created with your help earlier? Another
question is can the knowledge base from which the vector db was created, be created reversely from the vector db? Is it a forward only system? Third
question is this: can general maths questions be answered by Sanskrit LLM? For
example, 'pi kim asti?' Will the answer be 3.1419...?
Claude:
Three great questions. Let me take them in order.
Question 1: Does sentence-transformers do the work of the
vector DB?
Not quite — they do different jobs in the pipeline.
It's worth separating the two roles clearly:
- sentence-transformers
is the embedding model — it converts your text chunks into vectors
(numerical representations). It answers: "What does this passage
mean, mathematically?"
- A
vector database (like FAISS, Chroma, Pinecone) is the storage and
search engine — it stores those vectors and, at query time, finds the
nearest ones. It answers: "Which stored vectors are closest to
this query vector?"
In your blog_RAG, you likely used something like FAISS
or an in-memory structure alongside sentence-transformers. The
sentence-transformer creates the vectors; FAISS does the similarity search.
Together they replace what a dedicated vector DB service like Pinecone would do
in a production system. So sentence-transformers alone is only half the picture
— it's the encoder, not the store.
Think of it this way: sentence-transformers is the librarian
who reads every book and writes a summary card. The vector DB is the filing
cabinet that lets you pull the most relevant cards quickly.
Question 2: Can you reconstruct the knowledge base from
the vector DB? Is it forward-only?
It is essentially forward-only, and the reconstruction is
impossible in any meaningful sense.
Here's why. When sentence-transformers encodes a passage —
say, a paragraph from your blog — it compresses potentially hundreds of words
into a vector of, say, 384 floating point numbers. This is a lossy,
non-invertible transformation. The semantic "meaning" is
approximately preserved, but the original words are destroyed. It's analogous
to a hash function — you can go from text → hash easily, but not hash → text.
There are a couple of nuances though:
- The
original text chunks are usually stored separately alongside the
vectors (as metadata). So in practice, production systems keep the raw
text and the vectors. The vector DB (like Chroma or Pinecone)
stores both. You retrieve the vector, and it hands you back the original
passage. So the knowledge base is recoverable — but only because you
stored it explicitly, not because you derived it from the vector.
- Approximate
inversion research exists — there are adversarial ML papers showing
that with enough effort you can partially reconstruct text from
embeddings, which is actually a privacy concern in sensitive
applications. But this is far from reliable or practical.
So the design principle is: always preserve your original
corpus. The vector DB is an index, not a replacement for the source.
Question 3: Can a Sanskrit LLM answer "pi kim
asti?"
This is a delightful question, and the answer is: it
depends entirely on what the model was trained on, not on the language itself.
"Pi kim asti?" is perfectly valid Sanskrit
— kim (what is), pi (the Greek letter or mathematical constant,
borrowed). If the training corpus included mathematical texts or modern
Sanskrit writing that discusses π, the model could absolutely answer "tri-dasha-saptati..."
or give the decimal expansion.
But here's the deeper point: mathematical reasoning is
language-agnostic at the level of the underlying model weights. The same
transformer architecture that does arithmetic in English can do it in Sanskrit
— if trained on sufficient Sanskrit mathematical text. Sanskrit actually
has a strong historical tradition here: the Āryabhaṭīya, Līlāvatī,
and Brāhmasphutasiddhānta are full of mathematics expressed in Sanskrit
verse. Āryabhata computed π ≈ 3.1416 in the 5th century CE, expressed in
Sanskrit.
The practical reality for a current Sanskrit LLM
though: the training corpus is small, and mathematical content within it is
tiny. So today's Sanskrit LLM would likely fail at this — not because Sanskrit
can't express it, but because the model hasn't seen enough Sanskrit math text
to learn it. A workaround being explored is cross-lingual transfer —
leveraging a multilingual base model that already knows mathematics in
English/Hindi, then fine-tuning it on Sanskrit, so the mathematical reasoning
ability transfers across languages.
In short: π knows no language. The constraint is data, not
grammar.
Me:
Can you please give examples of these two opposing
constructs? And what do you mean when you say: In principle, a model that truly
learns the grammar can generalize better (with the italics in 'generalize')?
Claude:
Great question — you're asking me to unpack two things: the opposing
constructs (regular Sanskrit vs. irregular English), and what "generalize
better" really means in the context of ML. Let me do both carefully.
The Two Opposing Constructs
English: Irregular, Unpredictable Morphology
In English, a model must essentially memorize each
irregular form, because no rule predicts it:
|
Base |
Past Tense |
Why? |
|
walk |
walked |
regular |
|
run |
ran |
no rule — must memorize |
|
go |
went |
completely suppletive — different root! |
|
fly |
flew |
vowel change, unpredictable |
|
cut |
cut |
unchanged — no signal at all |
A child (or LLM) learning English cannot say: "I
know the verb 'go', therefore I know its past tense." It must have seen went
explicitly in training data. If it hasn't, it will say "goed" — which
is actually a famous error children make, showing they over-apply the
regular rule.
For an LLM, this means: the model needs massive data
exposure to encounter all these irregular forms enough times to learn them
reliably.
Sanskrit: Rule-Governed, Predictable Morphology
Sanskrit verb conjugation follows Pāṇini's system with very
high regularity. Take the root √pac (to cook):
|
Person |
Singular |
Dual |
Plural |
|
3rd |
pacati |
pacataḥ |
pacanti |
|
2nd |
pacasi |
pacathaḥ |
pacatha |
|
1st |
pacāmi |
pacāvaḥ |
pacāmaḥ |
Now take a completely different root, √gam (to
go), same class:
|
Person |
Singular |
|
3rd |
gacchati |
|
2nd |
gacchasi |
|
1st |
gacchāmi |
The endings are identical. The pattern is the same. A
model that has truly internalized the suffix pattern -ti / -si / -āmi can
correctly conjugate a verb root it has never seen before — simply by
applying the rule.
This is the opposing construct: Sanskrit rewards
rule-learning; English rewards memorization.
What "Generalize Better" Means
In machine learning, generalization means performing
correctly on inputs you have never seen during training.
A model that has memorized 10,000 English irregular verbs
will fail on the 10,001st if it wasn't in training data. It has learned a
lookup table, not a rule.
A model that has truly learned Sanskrit's suffix system can
encounter the rare root √vraśc (to cut) and still correctly produce vraścati
in the third person singular — because it applies the internalized rule rather
than recalling a memorized form.
The italics on generalize were deliberate: it signals
that this is the core virtue we want in any intelligent system — the ability to
handle novelty correctly. Sanskrit's structure, in principle, makes this
easier to achieve because:
- The
signal is consistent — the same rule applies thousands of times
across the corpus
- The
model doesn't need to see every word form — it can derive
unseen forms from seen rules
- Pāṇini
essentially wrote the grammar as an explicit finite rule system —
which maps naturally onto what neural networks are trying to learn
implicitly
The caveat — and why I said "in principle"
— is that current LLMs learn statistically from data, not by being explicitly
handed Pāṇini's rules. Whether they actually learn the deep rule
structure, versus just approximating it from pattern frequency, is still an
open research question. The researchers proposing the Pāṇini Sūtra-based model
are essentially arguing: why learn the rules implicitly from data when Pāṇini
already wrote them down? Feed the rules in explicitly as an inductive
bias.
That is a genuinely interesting architectural idea.
Me:
What about coining new words, or neologism, as we call in
English. In Sanskrit, we are told coining new words also follows a predictable
pattern. For example, the word 'computer' was not there in Sanskrit earlier(to be
fair, it wasn't there in English either), but has been coined as
'sanganakah' now. Is there a pattern which LLMs can follow?
Claude:
What a wonderful extension of the previous thought. Yes —
and this is actually where Sanskrit becomes philosophically superior to
English for LLM-based neologism, not just comparable.
How English Coins New Words
English neologism is essentially borrowing + accident:
- Computer
— from Latin computare (to reckon). Borrowed, not derived.
- Telephone
— Greek tele (far) + phone (sound). Borrowed roots stitched
together.
- Selfie,
googling, tweet — purely cultural/arbitrary coinages. No
rule predicts them.
- Bluetooth
— named after a Viking king. Completely opaque semantically.
An LLM cannot generate a new English word for a new
concept by rule. It can only imitate the style of existing coinages,
which is a very weak generative capacity. There is no English equivalent of Pāṇini
that tells you: "given concept X, here is the word."
How Sanskrit Coins New Words — The Kṛt and Taddhita
System
Sanskrit has a systematic derivational morphology —
meaning the process of building new words from roots is itself rule-governed.
Pāṇini dedicates enormous sections of the Ashtādhyāyī to this. The key
mechanisms are:
1. Verbal roots (dhātu) + suffixes → nouns, adjectives,
agent words
The root encodes the action, the suffix encodes the role
or quality:
|
Root |
Meaning |
Suffix |
Derived Word |
Meaning |
|
√gam |
to go |
-aka |
gāmaka |
one who goes |
|
√pac |
to cook |
-aka |
pācaka |
cook, one who cooks |
|
√saṅgan |
to compute/reckon together |
-aka |
saṅganaka |
one who computes → computer |
Notice saṅganaka (your saṅgaṇakaḥ): it comes
from sam (together) + gana (to count/reckon) + -aka (agent
suffix). The word is not arbitrary — it is semantically transparent.
Anyone who knows Sanskrit morphology can decode it instantly without
being told what it means.
2. Compound formation (Samāsa) — near-unlimited
productivity
Sanskrit's compounding rules (Tatpuruṣa, Bahuvrīhi, Dvandva,
etc.) allow arbitrarily complex concepts to be expressed as single words:
- Doordarshan
(television): dūra (far) + darśan (seeing) — "that
which enables seeing far"
- Vimāna
(aircraft): vi (through/across) + māna (that which
measures/moves) — "that which traverses"
- Antarjāla
(internet): antar (within/between) + jāla (net/web) —
"the inner web"
Each of these is a rule-following derivation, not an
arbitrary label.
What This Means for an LLM
Here is where it gets really interesting architecturally. If
an LLM truly learns Sanskrit's derivational system, it gains something
English-trained LLMs cannot have: a generative word-formation grammar.
Given a new concept — say, large language model — a
Sanskrit-trained LLM could, in principle:
- Identify
the semantic primitives: large + language + model/pattern
- Find
the appropriate roots: bṛhat (large) + bhāṣā (language) + nidarśana
(model/exemplar)
- Apply
compounding rules: bṛhadbhāṣānidarśanam
- Verify
the derivation is Pāṇini-compliant
This is not speculation — the Samskrita Bharati
organisation and various Sanskrit computational linguistics groups have been
doing exactly this manually for decades, coining words like:
- Saṅgaṇaka
— computer
- Antarjāla
— internet
- Dūradarśinī
— mobile phone (far-seeing device)
- Cālakavartiḥ
— battery (moving wick/power)
The LLM advantage: if the model has internalized Pāṇini's
kṛt and taddhita suffix rules, it can automate this coinage process. You
describe a concept in Sanskrit, the model identifies the appropriate roots and
suffixes, and produces a new word that is:
- Semantically
transparent (decodable from its parts)
- Grammatically
valid (Pāṇini-compliant)
- Culturally
rooted (uses native Sanskrit roots, not borrowed ones)
The Deeper Point
English neologism is opaque and arbitrary — laser
is an acronym, quark was taken from James Joyce, meme was coined
by Dawkins from Greek. Future speakers cannot decode these from first
principles.
Sanskrit neologism is transparent and rule-governed —
a speaker a thousand years from now could encounter saṅgaṇaka and derive
its meaning from morphology alone, even without a dictionary.
This is precisely why some computational linguists argue
Sanskrit is not merely a historical curiosity but potentially a superior
substrate for knowledge representation in AI systems — the structure of the
language itself encodes meaning compositionally, which is exactly what vector
embeddings are trying to approximate statistically. Sanskrit does it symbolically
and explicitly, at the level of the word itself.
No comments:
Post a Comment