Abjad Script Transliteration¶

translit provides two transliteration modes for abjad scripts — Arabic, Persian (Farsi), and Hebrew — where standard writing omits most vowels.

The problem with abjad scripts¶

Arabic, Persian, and Hebrew are written in abjad scripts: the alphabet primarily represents consonants. Short vowels are either omitted entirely or indicated by optional diacritical marks (Arabic tashkeel, Hebrew niqqud) that most published text does not include.

This means a single written word can represent multiple spoken words:

Arabic	Consonant skeleton	Possible readings
كتب	k-t-b	kataba (he wrote), kutub (books), kutiba (was written), kuttāb (writers)
درس	d-r-s	dars (lesson), darrasa (he taught), durūs (lessons)
علم	ʿ-l-m	ʿilm (knowledge), ʿalam (flag), ʿallama (he taught)

Standard character-by-character transliteration — the approach used by Unidecode, anyascii, and translit's default mode — can only produce the consonant skeleton: ktb, drs, 'lm. This is unreadable to anyone who doesn't already know the word.

Two modes¶

Context-free (default)¶

transliterate("كتب العربية")              # → "ktb al'rbyh"
transliterate("שלום", lang="he")           # → "shlvm"
transliterate("کتاب فارسی", lang="fa")     # → "ktab farsy"

This is the same approach as every other transliteration library. Each character maps to a fixed ASCII equivalent via a lookup table. No context, no dictionary, no ambiguity resolution. Fast (O(1) per character), deterministic, and produces the same output as Unidecode for these scripts.

When to use: Machine processing where human readability is not required (search indexing, deduplication, database keys).

Context-aware (`context=True`)¶

transliterate("كتب العربية", context=True)              # → "kataba al'arabiyahi"
transliterate("שלום", lang="he", context=True)           # → "shalvom"
transliterate("کتاب فارسی", lang="fa", context=True)     # → "ketab farsy"

This mode uses a dictionary-based vowel restoration system to recover the missing vowels before transliterating. The result is readable romanized text rather than a consonant skeleton.

When to use: Any application where a human will read the output — display, NLP preprocessing, content moderation, transliteration for non-native readers.

Requires: Context dictionaries installed separately:

pip install translit-rs[arabic]   # Arabic + Persian
pip install translit-rs[hebrew]   # Hebrew
pip install translit-rs[context]  # All context dictionaries

How context-aware transliteration works¶

Architecture¶

The system uses a three-tier fallback for each word:

Bigram lookup: check if the combination of the previous word and the current word (both as consonant skeletons) has a known best reading. This resolves ambiguity using context — for example, after the Arabic article ال, the word كتب is more likely to be kutub (books) than kataba (he wrote).
Unigram lookup: if no bigram match, look up the current word's skeleton in a frequency-ranked dictionary. The most common reading is selected.
Context-free fallback: if the word is not in the dictionary at all, the existing character-by-character transliteration is used. The output is never worse than the default mode.

Dictionary sources¶

Language	Source corpus	Size	License
Arabic	Tashkeela — 65.7M diacritized words from 97 books	182K unigrams, 200K bigrams	CC-BY
Hebrew	Project Ben Yehuda — 11.4M niqqud-pointed words from 26K literary texts	227K unigrams, 200K bigrams	Public domain
Persian	Curated vocabulary — 266 common words with diacritics applied per BGN/PCGN 1958	257 unigrams	Hand-curated

Dictionaries are built reproducibly from source corpora via scripts/bootstrap_dicts.sh. All parameters and expected checksums are pinned. See Building dictionaries below.

Arabic¶

Standard used¶

BGN/PCGN Arabic romanization (1956) for consonant mappings. This is the system used by the US Board on Geographic Names and the UK Permanent Committee on Geographical Names. It uses digraphs for emphatic and pharyngeal consonants: ث→th, خ→kh, ذ→dh, ش→sh, غ→gh.

How it differs from other systems¶

Feature	translit (context-free)	translit (context-aware)	Buckwalter	ALA-LC / Library of Congress
Vowels	Omitted (consonant skeleton)	Restored from dictionary	Omitted	Required in source
Emphatics	Merged with plain (ص→s, ط→t)	Same	Distinct single chars (S, T)	Underdots (ṣ, ṭ)
Shadda (gemination)	Dropped	Preserved via diacritized form	`~`	Doubled consonant
Output charset	ASCII	ASCII	ASCII	Requires diacritics
Context needed	No	Yes (dictionary)	No	Yes (human judgment)

Context-aware accuracy¶

The Arabic dictionary covers 99%+ of newspaper vocabulary. The bigram table resolves the most common ambiguities:

# Without context
transliterate("السلام عليكم")        # → "alslam 'lykm"

# With context — vowels restored, readable
transliterate("السلام عليكم", context=True)  # → "alsalaamu 'alaykum"

What it cannot do¶

Recover vowels not in the dictionary: Rare proper nouns, neologisms, and code-mixed text will fall back to consonant skeletons.
Sentence-level disambiguation: The bigram model captures adjacent-word context but not full sentence meaning. For كتب after a subject pronoun (he wrote) vs after an article (the books), bigrams usually resolve correctly, but complex sentences may not.
Dialect variation: The dictionary is built from Modern Standard Arabic (MSA) sources. Dialectal Arabic (Egyptian, Gulf, Levantine) uses different vowel patterns that are not covered.

Persian (Farsi)¶

Standard used¶

BGN/PCGN Persian romanization (1958, updated 2019). Persian shares the Arabic script but differs in four key ways:

Four extra letters: پ (p), چ (ch), ژ (zh), گ (g) — sounds that don't exist in Arabic.
Different vowel system: Persian has 6 vowels — three short (/æ, e, o/) and three long (/ɒː, iː, uː/). The critical difference from Arabic: Persian kasra = e (not i), Persian damma = o (not u).
Waw is v, not w: و is pronounced /v/ in Persian (consonant position), not /w/ as in Arabic.
The ezafe: A connecting vowel (-e after consonants, -ye after vowels) links nouns to their modifiers. Written as a kasra or with ه‌ی but often unmarked.

How translit handles Persian¶

The lang="fa" profile overrides 51 character mappings from the Arabic default:

Character	Arabic default	Persian override	Reason
ث (thā)	th	s	Persian pronunciation
ذ (dhāl)	dh	z	Persian pronunciation
ض (ḍād)	d	z	Persian pronunciation
و (wāw)	w	v	Persian pronunciation
kasra (ِ)	i	e	Persian 6-vowel system
damma (ُ)	u	o	Persian 6-vowel system
tāʾ marbūṭa	h	e	Persian feminine ending

Context-aware Persian¶

Unlike Arabic and Hebrew, no large diacritized Persian corpus exists. Persian rarely uses diacritics even in formal text. translit addresses this with a curated vocabulary of 266 common words with diacritics applied following BGN/PCGN pronunciation rules:

# Without context
transliterate("کتاب فارسی", lang="fa")              # → "ktab farsy"

# With context — vowels from curated dictionary
transliterate("کتاب فارسی", lang="fa", context=True) # → "ketab farsy"

For words not in the curated vocabulary, the system falls back to the Arabic context dictionary. Since approximately 40% of Persian vocabulary is Arabic-origin, many loanwords benefit from the Arabic dictionary automatically.

Limitations specific to Persian¶

Smaller dictionary: 266 curated entries vs Arabic's 182K corpus-derived entries. Common words are covered; rare words fall back to context-free.
No ezafe prediction: The ezafe construction (-e/-ye connecting nouns to adjectives/possessors) is not predicted. It would require syntactic analysis beyond dictionary lookup.
Waw ambiguity: و serves as both consonant (/v/) and vowel (/o, u/). The lang="fa" override maps it to v; the context dictionary provides the correct vowel form for known words.

Hebrew¶

Standard used¶

The default Hebrew mappings follow common Israeli romanization conventions. Hebrew has the same fundamental abjad challenge as Arabic: the consonantal alphabet with optional niqqud (vowel points) that most text omits.

How context-aware Hebrew works¶

The Hebrew dictionary is built from Project Ben Yehuda, a public domain collection of 26,000+ Hebrew literary works with niqqud. The dictionary maps unpointed consonant skeletons to their most common niqqud-pointed forms:

# Without context
transliterate("שלום", lang="he")              # → "shlvm"

# With context — niqqud restored from dictionary
transliterate("שלום", lang="he", context=True) # → "shalvom"

Differences from Arabic¶

Feature	Arabic	Hebrew
Vowel marks	Tashkeel (fatha, kasra, damma, etc.)	Niqqud (patach, segol, hiriq, etc.)
Gemination	Shadda (ّ)	Dagesh (ּ)
Dictionary size	182K unigrams (65.7M-word corpus)	227K unigrams (11.4M-word corpus)
Ambiguity level	High (many homographs)	Moderate (fewer morphological patterns)

Limitations specific to Hebrew¶

Literary bias: The Ben Yehuda corpus is predominantly literary (19th-20th century). Modern Hebrew slang, technical terms, and recent loanwords may not be covered.
No morphological analysis: Hebrew verbs follow predictable root+pattern templates (binyanim) that could theoretically be used to predict vowels for unknown words. The current system does not exploit this — it relies purely on dictionary lookup.

Building dictionaries¶

All dictionaries are built reproducibly from source corpora:

# Build all dictionaries from scratch (downloads corpora, builds, verifies checksums)
bash scripts/bootstrap_dicts.sh all

# Build individually
bash scripts/bootstrap_dicts.sh arabic    # Tashkeela corpus → arabic_dict.bin
bash scripts/bootstrap_dicts.sh persian   # Curated vocab → persian_dict.bin
bash scripts/bootstrap_dicts.sh hebrew    # Ben Yehuda → hebrew_dict.bin

# Verify existing dictionaries match expected checksums
bash scripts/bootstrap_dicts.sh verify

The bootstrap script pins all parameters (corpus source, min-frequency threshold, max bigram count) and expected output checksums. Changing any parameter requires updating the checksum — making all dictionary changes visible and auditable.

How translit differs from other approaches¶

Approach	Used by	Strengths	Weaknesses
Character-by-character	Unidecode, anyascii, translit (default)	Fast, deterministic, no data dependency	Consonant skeletons for abjad scripts
Dictionary + bigram	translit (context=True)	Readable output, no ML dependency, fast	Dictionary size, no sentence-level context
Neural diacritization	libtashkeel, Rababa, Mishkal	Handles unknown words, sentence context	Requires ONNX runtime (~15MB+), slower, non-deterministic
Rule-based morphology	Buckwalter Analyzer, MADAMIRA	Linguistically precise	Complex, language-specific, slow
Human transcription	ALA-LC, scholarly publications	Perfect accuracy	Not automatable

translit's dictionary+bigram approach occupies the middle ground: substantially better than character-by-character for human-readable output, without the weight and complexity of neural or morphological systems. The three-tier fallback ensures graceful degradation — the output is never worse than the default mode.

Abjad Script Transliteration¶

The problem with abjad scripts¶

Two modes¶

Context-free (default)¶

Context-aware (context=True)¶

How context-aware transliteration works¶

Architecture¶

Dictionary sources¶

Arabic¶

Standard used¶

How it differs from other systems¶

Context-aware accuracy¶

What it cannot do¶

Persian (Farsi)¶

Standard used¶

How translit handles Persian¶

Context-aware Persian¶

Limitations specific to Persian¶

Hebrew¶

Standard used¶

How context-aware Hebrew works¶

Differences from Arabic¶

Limitations specific to Hebrew¶

Building dictionaries¶

How translit differs from other approaches¶

Context-aware (`context=True`)¶