Abjad Script Transliteration¶
translit provides two transliteration modes for abjad scripts — Arabic, Persian (Farsi), and Hebrew — where standard writing omits most vowels.
The problem with abjad scripts¶
Arabic, Persian, and Hebrew are written in abjad scripts: the alphabet primarily represents consonants. Short vowels are either omitted entirely or indicated by optional diacritical marks (Arabic tashkeel, Hebrew niqqud) that most published text does not include.
This means a single written word can represent multiple spoken words:
| Arabic | Consonant skeleton | Possible readings |
|---|---|---|
| كتب | k-t-b | kataba (he wrote), kutub (books), kutiba (was written), kuttāb (writers) |
| درس | d-r-s | dars (lesson), darrasa (he taught), durūs (lessons) |
| علم | ʿ-l-m | ʿilm (knowledge), ʿalam (flag), ʿallama (he taught) |
Standard character-by-character transliteration — the approach used by Unidecode, anyascii, and translit's default mode — can only produce the consonant skeleton: ktb, drs, 'lm. This is unreadable to anyone who doesn't already know the word.
Two modes¶
Context-free (default)¶
transliterate("كتب العربية") # → "ktb al'rbyh"
transliterate("שלום", lang="he") # → "shlvm"
transliterate("کتاب فارسی", lang="fa") # → "ktab farsy"
This is the same approach as every other transliteration library. Each character maps to a fixed ASCII equivalent via a lookup table. No context, no dictionary, no ambiguity resolution. Fast (O(1) per character), deterministic, and produces the same output as Unidecode for these scripts.
When to use: Machine processing where human readability is not required (search indexing, deduplication, database keys).
Context-aware (context=True)¶
transliterate("كتب العربية", context=True) # → "kataba al'arabiyahi"
transliterate("שלום", lang="he", context=True) # → "shalvom"
transliterate("کتاب فارسی", lang="fa", context=True) # → "ketab farsy"
This mode uses a dictionary-based vowel restoration system to recover the missing vowels before transliterating. The result is readable romanized text rather than a consonant skeleton.
When to use: Any application where a human will read the output — display, NLP preprocessing, content moderation, transliteration for non-native readers.
Requires: Context dictionaries installed separately:
pip install translit-rs[arabic] # Arabic + Persian
pip install translit-rs[hebrew] # Hebrew
pip install translit-rs[context] # All context dictionaries
How context-aware transliteration works¶
Architecture¶
The system uses a three-tier fallback for each word:
-
Bigram lookup: check if the combination of the previous word and the current word (both as consonant skeletons) has a known best reading. This resolves ambiguity using context — for example, after the Arabic article ال, the word كتب is more likely to be kutub (books) than kataba (he wrote).
-
Unigram lookup: if no bigram match, look up the current word's skeleton in a frequency-ranked dictionary. The most common reading is selected.
-
Context-free fallback: if the word is not in the dictionary at all, the existing character-by-character transliteration is used. The output is never worse than the default mode.
Dictionary sources¶
| Language | Source corpus | Size | License |
|---|---|---|---|
| Arabic | Tashkeela — 65.7M diacritized words from 97 books | 182K unigrams, 200K bigrams | CC-BY |
| Hebrew | Project Ben Yehuda — 11.4M niqqud-pointed words from 26K literary texts | 227K unigrams, 200K bigrams | Public domain |
| Persian | Curated vocabulary — 266 common words with diacritics applied per BGN/PCGN 1958 | 257 unigrams | Hand-curated |
Dictionaries are built reproducibly from source corpora via scripts/bootstrap_dicts.sh. All parameters and expected checksums are pinned. See Building dictionaries below.
Arabic¶
Standard used¶
BGN/PCGN Arabic romanization (1956) for consonant mappings. This is the system used by the US Board on Geographic Names and the UK Permanent Committee on Geographical Names. It uses digraphs for emphatic and pharyngeal consonants: ث→th, خ→kh, ذ→dh, ش→sh, غ→gh.
How it differs from other systems¶
| Feature | translit (context-free) | translit (context-aware) | Buckwalter | ALA-LC / Library of Congress |
|---|---|---|---|---|
| Vowels | Omitted (consonant skeleton) | Restored from dictionary | Omitted | Required in source |
| Emphatics | Merged with plain (ص→s, ط→t) | Same | Distinct single chars (S, T) | Underdots (ṣ, ṭ) |
| Shadda (gemination) | Dropped | Preserved via diacritized form | ~ |
Doubled consonant |
| Output charset | ASCII | ASCII | ASCII | Requires diacritics |
| Context needed | No | Yes (dictionary) | No | Yes (human judgment) |
Context-aware accuracy¶
The Arabic dictionary covers 99%+ of newspaper vocabulary. The bigram table resolves the most common ambiguities:
# Without context
transliterate("السلام عليكم") # → "alslam 'lykm"
# With context — vowels restored, readable
transliterate("السلام عليكم", context=True) # → "alsalaamu 'alaykum"
What it cannot do¶
- Recover vowels not in the dictionary: Rare proper nouns, neologisms, and code-mixed text will fall back to consonant skeletons.
- Sentence-level disambiguation: The bigram model captures adjacent-word context but not full sentence meaning. For كتب after a subject pronoun (he wrote) vs after an article (the books), bigrams usually resolve correctly, but complex sentences may not.
- Dialect variation: The dictionary is built from Modern Standard Arabic (MSA) sources. Dialectal Arabic (Egyptian, Gulf, Levantine) uses different vowel patterns that are not covered.
Persian (Farsi)¶
Standard used¶
BGN/PCGN Persian romanization (1958, updated 2019). Persian shares the Arabic script but differs in four key ways:
- Four extra letters: پ (p), چ (ch), ژ (zh), گ (g) — sounds that don't exist in Arabic.
- Different vowel system: Persian has 6 vowels — three short (/æ, e, o/) and three long (/ɒː, iː, uː/). The critical difference from Arabic: Persian kasra = e (not i), Persian damma = o (not u).
- Waw is v, not w: و is pronounced /v/ in Persian (consonant position), not /w/ as in Arabic.
- The ezafe: A connecting vowel (-e after consonants, -ye after vowels) links nouns to their modifiers. Written as a kasra or with هی but often unmarked.
How translit handles Persian¶
The lang="fa" profile overrides 51 character mappings from the Arabic default:
| Character | Arabic default | Persian override | Reason |
|---|---|---|---|
| ث (thā) | th | s | Persian pronunciation |
| ذ (dhāl) | dh | z | Persian pronunciation |
| ض (ḍād) | d | z | Persian pronunciation |
| و (wāw) | w | v | Persian pronunciation |
| kasra (ِ) | i | e | Persian 6-vowel system |
| damma (ُ) | u | o | Persian 6-vowel system |
| tāʾ marbūṭa | h | e | Persian feminine ending |
Context-aware Persian¶
Unlike Arabic and Hebrew, no large diacritized Persian corpus exists. Persian rarely uses diacritics even in formal text. translit addresses this with a curated vocabulary of 266 common words with diacritics applied following BGN/PCGN pronunciation rules:
# Without context
transliterate("کتاب فارسی", lang="fa") # → "ktab farsy"
# With context — vowels from curated dictionary
transliterate("کتاب فارسی", lang="fa", context=True) # → "ketab farsy"
For words not in the curated vocabulary, the system falls back to the Arabic context dictionary. Since approximately 40% of Persian vocabulary is Arabic-origin, many loanwords benefit from the Arabic dictionary automatically.
Limitations specific to Persian¶
- Smaller dictionary: 266 curated entries vs Arabic's 182K corpus-derived entries. Common words are covered; rare words fall back to context-free.
- No ezafe prediction: The ezafe construction (-e/-ye connecting nouns to adjectives/possessors) is not predicted. It would require syntactic analysis beyond dictionary lookup.
- Waw ambiguity: و serves as both consonant (/v/) and vowel (/o, u/). The
lang="fa"override maps it to v; the context dictionary provides the correct vowel form for known words.
Hebrew¶
Standard used¶
The default Hebrew mappings follow common Israeli romanization conventions. Hebrew has the same fundamental abjad challenge as Arabic: the consonantal alphabet with optional niqqud (vowel points) that most text omits.
How context-aware Hebrew works¶
The Hebrew dictionary is built from Project Ben Yehuda, a public domain collection of 26,000+ Hebrew literary works with niqqud. The dictionary maps unpointed consonant skeletons to their most common niqqud-pointed forms:
# Without context
transliterate("שלום", lang="he") # → "shlvm"
# With context — niqqud restored from dictionary
transliterate("שלום", lang="he", context=True) # → "shalvom"
Differences from Arabic¶
| Feature | Arabic | Hebrew |
|---|---|---|
| Vowel marks | Tashkeel (fatha, kasra, damma, etc.) | Niqqud (patach, segol, hiriq, etc.) |
| Gemination | Shadda (ّ) | Dagesh (ּ) |
| Dictionary size | 182K unigrams (65.7M-word corpus) | 227K unigrams (11.4M-word corpus) |
| Ambiguity level | High (many homographs) | Moderate (fewer morphological patterns) |
Limitations specific to Hebrew¶
- Literary bias: The Ben Yehuda corpus is predominantly literary (19th-20th century). Modern Hebrew slang, technical terms, and recent loanwords may not be covered.
- No morphological analysis: Hebrew verbs follow predictable root+pattern templates (binyanim) that could theoretically be used to predict vowels for unknown words. The current system does not exploit this — it relies purely on dictionary lookup.
Building dictionaries¶
All dictionaries are built reproducibly from source corpora:
# Build all dictionaries from scratch (downloads corpora, builds, verifies checksums)
bash scripts/bootstrap_dicts.sh all
# Build individually
bash scripts/bootstrap_dicts.sh arabic # Tashkeela corpus → arabic_dict.bin
bash scripts/bootstrap_dicts.sh persian # Curated vocab → persian_dict.bin
bash scripts/bootstrap_dicts.sh hebrew # Ben Yehuda → hebrew_dict.bin
# Verify existing dictionaries match expected checksums
bash scripts/bootstrap_dicts.sh verify
The bootstrap script pins all parameters (corpus source, min-frequency threshold, max bigram count) and expected output checksums. Changing any parameter requires updating the checksum — making all dictionary changes visible and auditable.
How translit differs from other approaches¶
| Approach | Used by | Strengths | Weaknesses |
|---|---|---|---|
| Character-by-character | Unidecode, anyascii, translit (default) | Fast, deterministic, no data dependency | Consonant skeletons for abjad scripts |
| Dictionary + bigram | translit (context=True) | Readable output, no ML dependency, fast | Dictionary size, no sentence-level context |
| Neural diacritization | libtashkeel, Rababa, Mishkal | Handles unknown words, sentence context | Requires ONNX runtime (~15MB+), slower, non-deterministic |
| Rule-based morphology | Buckwalter Analyzer, MADAMIRA | Linguistically precise | Complex, language-specific, slow |
| Human transcription | ALA-LC, scholarly publications | Perfect accuracy | Not automatable |
translit's dictionary+bigram approach occupies the middle ground: substantially better than character-by-character for human-readable output, without the weight and complexity of neural or morphological systems. The three-tier fallback ensures graceful degradation — the output is never worse than the default mode.