Language Support¶
translit ships with 83 built-in language profiles that provide language-specific transliteration rules. You can also register custom profiles at runtime.
Built-in languages¶
European languages¶
| Code | Language | Key overrides | Example |
|---|---|---|---|
bg |
Bulgarian | Ъ→A, Щ→Sht | Ъгъл → Agal |
ca |
Catalan | Ç→C, ŀ→l·l | Ça → Ca |
cs |
Czech | Č→C, Ř→R, Ž→Z | Říční → Ricni |
cy |
Welsh | Ŵ→W, Ŷ→Y | Ŵyr → Wyr |
da |
Danish | Æ→Ae, Ø→Oe, Å→Aa | Ærø → Aeroe |
de |
German | Ä→Ae, Ö→Oe, Ü→Ue, ß→ss | München → Muenchen |
el |
Greek | Full alphabet transliteration | Αθήνα → Athina |
es |
Spanish | Ñ→N | España → Espana |
et |
Estonian | Õ→O, Š→S, Ž→Z | Õlu → Olu |
fi |
Finnish | Ä→A, Ö→O | Ääkkönen → Aakkonen |
fr |
French | Ç→C, Œ→OE | Ça → Ca |
ga |
Irish | Ḃ→Bh, Ċ→Ch, Ḋ→Dh | Ṁáire → Mhaire |
hr |
Croatian | Č→C, Ć→C, Đ→D, Š→S, Ž→Z | Đurđevac → Durdevac |
hu |
Hungarian | Ő→O, Ű→U | Győr → Gyor |
is |
Icelandic | Ð→Dh, Þ→Th | Ísland → Island |
it |
Italian | À→A, È→E | Città → Citta |
lt |
Lithuanian | Ą→A, Ę→E, Ė→E, Į→I, Ų→U | Šiauliai → Siauliai |
lv |
Latvian | Ā→A, Č→C, Ģ→G, Ķ→K, Ļ→L, Ņ→N | Rīga → Riga |
mt |
Maltese | Ċ→C, Ġ→G, Ħ→H, Ż→Z | Għawdex → Ghawdex |
nl |
Dutch | IJ→IJ | IJmuiden → IJmuiden |
no |
Norwegian | Æ→Ae, Ø→Oe, Å→Aa | Ål → Aal |
pl |
Polish | Ą→A, Ć→C, Ę→E, Ł→L, Ń→N, Ó→O, Ś→S, Ź→Z, Ż→Z | Łódź → Lodz |
pt |
Portuguese | Ã→A, Õ→O, Ç→C | São Paulo → Sao Paulo |
ro |
Romanian | Ă→A, Â→A, Î→I, Ș→S, Ț→T | București → Bucuresti |
sk |
Slovak | Ä→A, Č→C, Ď→D, Ľ→L, Ň→N, Ô→O, Ŕ→R, Š→S, Ť→T, Ž→Z | Bratislava |
sl |
Slovenian | Č→C, Š→S, Ž→Z | Ljubljana |
sq |
Albanian | Ç→C, Ë→E | Shqipëria → Shqiperia |
sr |
Serbian | Full Cyrillic→Latin | Београд → Beograd |
sv |
Swedish | Ä→Ae, Ö→Oe, Å→Aa | Malmö → Malmoe |
tr |
Turkish | Ç→C, Ğ→G, İ→I, Ö→O, Ş→S, Ü→U | İstanbul → Istanbul |
uk |
Ukrainian | Г→H, Ґ→G, Є→Ye, Ї→Yi, І→I | Київ → Kyiv |
Southeast Asian languages¶
| Code | Language | Key overrides | Example |
|---|---|---|---|
vi |
Vietnamese | Full diacritical vowel set | Hà Nội → Ha Noi |
Semitic languages¶
| Code | Language | Notes |
|---|---|---|
ar |
Arabic | Basic transliteration (Buckwalter-derived) |
he |
Hebrew | Common Israeli romanization; Qof → q (SBL); presentation forms with dagesh |
Iranian languages¶
| Code | Language | Notes |
|---|---|---|
fa |
Persian (Farsi) | UNGEGN-based romanization; ث→s, ذ→z, ض→z, ظ→z (Persian pronunciation) |
Other Middle Eastern languages¶
| Code | Language | Script | Notes |
|---|---|---|---|
cop |
Coptic | Coptic | Coptic scholarly romanization |
syr |
Syriac | Syriac | Syriac script transliteration |
Ethiopic languages¶
| Code | Language | Script | Notes |
|---|---|---|---|
am |
Amharic | Ethiopic | Syllable-based transliteration |
African languages¶
| Code | Language | Script | Notes |
|---|---|---|---|
bax |
Bamum | Bamum | Bamum syllabary transliteration |
nqo |
N'Ko | N'Ko | Manding languages (N'Ko script) |
tzm |
Tamazight (Berber) | Tifinagh | Neo-Tifinagh script transliteration |
vai |
Vai | Vai | Vai syllabary transliteration |
Caucasian languages¶
| Code | Language | Notes |
|---|---|---|
hy |
Armenian | BGN/PCGN romanization |
ka |
Georgian | National romanization |
Indic languages¶
| Code | Language | Script | Example |
|---|---|---|---|
as |
Assamese | Bengali | — |
bn |
Bengali | Bengali | কলকাতা → kalakata |
gu |
Gujarati | Gujarati | ગુજરાતી → gujarati |
hi |
Hindi | Devanagari | नमस्ते → namaste |
kn |
Kannada | Kannada | ಕನ್ನಡ → kannada |
ml |
Malayalam | Malayalam | മലയാളം → malayalam |
mni |
Meitei | Meetei Mayek | Meetei Mayek script transliteration |
mr |
Marathi | Devanagari | — |
ne |
Nepali | Devanagari | — |
or |
Odia | Odia | ଓଡ଼ିଆ → odia |
pa |
Punjabi | Gurmukhi | ਗੁਰਮੁਖੀ → gurmukhi |
sa |
Sanskrit | Devanagari | — |
sat |
Santali | Ol Chiki | Ol Chiki script transliteration |
si |
Sinhala | Sinhala | සිංහල → simhala |
ta |
Tamil | Tamil | தமிழ் → tamizh |
te |
Telugu | Telugu | తెలుగు → telugu |
All 10 Brahmic scripts use virama/mātrā-aware transliteration: consonants carry an inherent "a" that is suppressed by virama (halant) or replaced by dependent vowel marks.
Tibetan languages¶
| Code | Language | Script | Notes |
|---|---|---|---|
bo |
Tibetan | Tibetan | Indic-phonetic romanization (Hunterian-style aspiration markers; not Wylie) |
Southeast Asian languages¶
| Code | Language | Script | Example |
|---|---|---|---|
ban |
Balinese | Balinese | Balinese script transliteration |
bug |
Buginese | Lontara | Lontara syllabary transliteration |
cjm |
Cham | Cham | Cham script transliteration |
khb |
Tai Lue | New Tai Lue | New Tai Lue script transliteration |
km |
Khmer | Khmer | ភាសាខ្មែរ → phasakhmaer |
lo |
Lao | Lao | ລາວ → lao |
my |
Myanmar (Burmese) | Myanmar | မြန်မာ → mrannma |
nod |
Northern Thai | Tai Tham (Lanna) | Tai Tham script transliteration |
su |
Sundanese | Sundanese | Sundanese script transliteration |
tdd |
Tai Le | Tai Le | Tai Le script transliteration |
th |
Thai | Thai | สวัสดี → sawatdi |
Philippine languages¶
| Code | Language | Script | Notes |
|---|---|---|---|
tl |
Tagalog | Baybayin (Tagalog) | Baybayin script transliteration |
Americas¶
| Code | Language | Script | Notes |
|---|---|---|---|
chr |
Cherokee | Cherokee | Cherokee syllabary transliteration |
Lisu¶
| Code | Language | Script | Notes |
|---|---|---|---|
lis |
Lisu | Fraser script | Fraser/Lisu script transliteration |
East Asian & other non-European languages¶
| Code | Language | Notes |
|---|---|---|
ja |
Japanese | Hiragana/Katakana → Hepburn; Kanji → Chinese pinyin fallback |
ja-kunrei |
Japanese (Kunrei-shiki) | し→si, ち→ti, つ→tu, ふ→hu; use for ISO/TR 11941 |
ko |
Korean | Hangul → Revised Romanization (algorithmic jamo decomposition) |
ru |
Russian | Full Cyrillic → Latin |
zh |
Chinese | Hanzi → toneless pinyin (20,924 characters from Unihan kMandarin) |
Toned pinyin: Pass
tones=Truetotransliterate()for diacritical pinyin output (e.g.,"běi jīng"instead of"bei jing"). Coverage includes the ~2,000 most common characters.
CJK examples¶
from translit import transliterate, slugify
# Chinese
transliterate("北京市") # => "bei jing shi"
slugify("北京烤鸭") # => "bei-jing-kao-ya"
# Korean
transliterate("서울") # => "seo ul"
slugify("대한민국") # => "dae-han-min-gug"
# Japanese (hiragana/katakana use Hepburn; kanji use Chinese pinyin)
transliterate("ひらがな") # => "hiragana"
transliterate("東京タワー") # => "dong jing tawa-"
transliterate("東京タワー", lang="ja") # => "dong jing tawa" (ー dropped)
Reverse transliteration¶
translit can convert romanized Latin text back to native script for selected languages using the target parameter:
from translit import transliterate, reverse_langs
transliterate("Moskva", target="ru") # → "Москва"
transliterate("Kyiv", target="uk") # → "Київ" (approximate)
transliterate("Athina", target="el") # → "Αθηνα"
# List supported languages
reverse_langs() # → ["el", "ru", "uk"]
Reverse transliteration uses greedy longest-match scanning to handle digraphs and trigraphs (e.g., "shch" → щ). See Limitations for round-trip degradation details.
Auto-detecting language from script¶
When you don't know the language of the input text, pass lang="auto" to automatically detect the dominant non-Latin script and select the appropriate language profile:
from translit import transliterate, slugify, LANG_AUTO
# Detects Cyrillic → uses Russian ("ru") profile
transliterate("Москва", lang="auto") # => "Moskva"
# Detects Thai → uses Thai ("th") profile
transliterate("ภาษาไทย", lang="auto") # => Thai transliteration
# Detects Devanagari → uses Hindi ("hi") profile
transliterate("नमस्ते", lang="auto") # => "namaste"
# Detects Hangul → uses Korean ("ko") profile
slugify("한국어", lang="auto") # => Korean romanization slug
# Works with all call sites
from translit import TextPipeline, Slugifier
pipe = TextPipeline(transliterate=True, lang="auto")
pipe("こんにちは") # => Japanese transliteration
s = Slugifier(lang="auto")
s("東京タワー") # => CJK slug
How auto-detection works¶
- Scans the input for the first non-Latin, non-Common character
- For ambiguous scripts, scans for exclusive discriminator characters
- Maps the detected script (and discriminated language) to a language code
- Falls back to default (no language override) if the text is Latin-only or the script has no mapping
For a detailed walkthrough of the three-stage detection pipeline, discriminator tables, and fail-safe guarantees, see Language Detection.
Script-to-language mapping¶
For unambiguous scripts (one script = one language), detection is immediate:
| Script | Default language |
|---|---|
| Georgian | ka |
| Armenian | hy |
| Thai | th |
| Hangul | ko |
| Hiragana / Katakana | ja |
| Greek | el |
| Thaana | dv (Dhivehi) |
| Bengali, Tamil, Telugu, Kannada, Malayalam, Gujarati, Gurmukhi, Odia, Sinhala | respective language |
| Ethiopic, Tibetan, Lao, Myanmar, Khmer, Mongolian, Javanese, Hebrew | respective language |
| Balinese | ban |
| Bamum | bax |
| Buginese (Lontara) | bug |
| Cham | cjm |
| Cherokee | chr |
| Coptic | cop |
| Lisu (Fraser) | lis |
| Meetei Mayek | mni |
| N'Ko | nqo |
| New Tai Lue | khb |
| Ol Chiki | sat |
| Sundanese | su |
| Syriac | syr |
| Tagalog (Baybayin) | tl |
| Tai Le | tdd |
| Tai Tham (Lanna) | nod |
| Tifinagh | tzm |
| Vai | vai |
Character-level discrimination for ambiguous scripts¶
For scripts shared by multiple languages, translit scans for exclusive characters — codepoints that appear in exactly one language's alphabet among the profiles we support:
| Script | Exclusive characters | Detected language |
|---|---|---|
| Cyrillic | ґ Ґ ї Ї є Є і І | uk (Ukrainian) |
| Cyrillic | ђ Ђ ћ Ћ љ Љ њ Њ џ Џ ј Ј | sr (Serbian) |
| Cyrillic | ө Ө ү Ү | mn (Mongolian) |
| Arabic | پ چ ژ گ | fa (Persian) |
| Latin | ơ Ơ ư Ư | vi (Vietnamese) |
| Latin | İ ı | tr (Turkish) |
| Latin | ß ẞ | de (German) |
If no exclusive characters are found, the script default is used (Cyrillic → ru, Arabic → ar, Latin → no override). If exclusive characters from two different languages appear in the same text (e.g., Ukrainian ї and Serbian ћ), detection falls back to the script default — this is the fail-safe guarantee.
# Ukrainian detected by exclusive ї
transliterate("Київ", lang="auto") # → uses uk profile
# Persian detected by exclusive پ
transliterate("پارسی", lang="auto") # → uses fa profile
# German detected by ß
transliterate("Straße", lang="auto") # → uses de profile
# No exclusive chars → safe default
transliterate("Москва", lang="auto") # → uses ru (unchanged)
For scripts that remain ambiguous after discrimination (Devanagari, Han), pass an explicit language code when accuracy matters.
Tip
Use the LANG_AUTO constant for type safety:
from translit import LANG_AUTO, transliterate
transliterate("Москва", lang=LANG_AUTO)
Using language profiles¶
With functions¶
from translit import transliterate, slugify, sanitize_filename
transliterate("Ürümqi", lang="de") # => "Ueruemqi"
slugify("Ärger im Büro", lang="de") # => "aerger-im-buero"
sanitize_filename("Ärger.txt", lang="de") # => "Aerger.txt"
With classes¶
from translit import Slugifier, TextPipeline
slug = Slugifier(lang="de", separator="_")
pipe = TextPipeline(transliterate=True, lang="fr")
Language constants¶
Pre-defined constants for type safety:
from translit import LANG_DE, LANG_FR, transliterate
transliterate("Ä", lang=LANG_DE) # => "Ae"
transliterate("Ç", lang=LANG_FR) # => "C"
Listing available languages¶
from translit import list_langs
print(list_langs())
# => ['am', 'ar', 'as', 'bg', 'bn', 'bo', 'ca', 'cs', 'cy', 'da', 'de', 'dv', 'el',
# 'es', 'et', 'fa', 'fi', 'fr', 'ga', 'gu', 'he', 'hi', 'hr', 'hu', 'hy',
# 'is', 'it', 'ja', 'jv', 'ka', 'km', 'kn', 'ko', 'lo', 'lt', 'lv', 'ml', 'mn',
# 'mr', 'mt', 'my', 'ne', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sa',
# 'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tr', 'uk', 'vi', 'zh']
Custom language profiles¶
register_lang¶
Register a new language profile or override an existing one:
from translit import register_lang, transliterate
# Register Esperanto
register_lang("eo", {
"ĉ": "cx",
"ĝ": "gx",
"ĥ": "hx",
"ĵ": "jx",
"ŝ": "sx",
"ŭ": "ux",
})
transliterate("ĉapelo", lang="eo") # => "cxapelo"
Warning
register_lang() is a global operation. Registered profiles persist for the lifetime of the Python process. They are not thread-local.
register_replacements¶
Register global pre-transliteration string replacements:
from translit import register_replacements, transliterate
register_replacements({
"©": "(c)",
"®": "(R)",
"™": "(TM)",
})
transliterate("Hello™ World©") # => "Hello(TM) World(c)"
Norwegian variants¶
Both "no" and "nb" (Bokmål) map to the same Norwegian profile. "nn" (Nynorsk) also uses the same mappings. Use any of these codes interchangeably.
Historical and ancient scripts¶
translit includes transliteration mappings for several historical and ancient writing systems:
| Script | Unicode Block | Example |
|---|---|---|
| Runic (Elder/Younger Futhark) | U+16A0–U+16FF | ᚠᚢᚦᚨᚱᚲ → futhark |
| Ogham | U+1680–U+169F | ᚑᚌᚐᚋ → ogam |
| Gothic | U+10330–U+1034F | 𐌲𐌿𐍄 → gut |
| Old Persian Cuneiform | U+103A0–U+103D5 | 𐎠𐎭𐎶 → adama |
| Linear B Syllabary | U+10000–U+1007F | 𐀀𐀁𐀂 → aei |
| Cherokee | U+13A0–U+13FF | ᏣᎳᎩ → tsalagi |
| Canadian Aboriginal Syllabics | U+1400–U+167F | ᐃᓄᒃᑎᑐᑦ → inoktwetwiit |
| Mongolian | U+1800–U+18AF | ᠮᠣᠩᠭᠣᠯ → monggol |
These mappings provide approximate romanizations suitable for search indexing and display purposes. They are not intended as scholarly transliteration standards.