Language Reference

Complete reference of all 83 built-in language profiles, their transliteration rules, and test reference texts.

Language Table

Code Language Script Region Has Overrides
am Amharic Ethiopic African Yes
ar Arabic Arabic Middle Eastern
as Assamese Bengali Indic
ban Balinese Balinese Southeast Asian
bax Bamum Bamum African
bg Bulgarian Cyrillic European Yes
bn Bengali Bengali Indic
bo Tibetan Tibetan Central Asian
bug Buginese Lontara Southeast Asian
ca Catalan Latin European Yes
chr Cherokee Cherokee Americas
cjm Cham Cham Southeast Asian
cop Coptic Coptic Middle Eastern
cs Czech Latin European
cy Welsh Latin European
da Danish Latin European
de German Latin European Yes
dv Dhivehi (Maldivian) Thaana South Asian
el Greek Greek European Yes
es Spanish Latin European Yes
et Estonian Latin European Yes
fa Persian (Farsi) Arabic Middle Eastern Yes
fi Finnish Latin European
fr French Latin European Yes
ga Irish Latin European
gu Gujarati Gujarati Indic
he Hebrew Hebrew Middle Eastern
hi Hindi Devanagari Indic
hr Croatian Latin European
hu Hungarian Latin European
hy Armenian Armenian Caucasian
is Icelandic Latin European Yes
it Italian Latin European Yes
ja Japanese Han/Kana East Asian Yes
jv Javanese Javanese Southeast Asian
ka Georgian Georgian Caucasian
khb Tai Lue New Tai Lue Southeast Asian
km Khmer Khmer Southeast Asian
kn Kannada Kannada Indic
ko Korean Hangul East Asian
lis Lisu Fraser/Lisu East Asian
lo Lao Lao Southeast Asian
lt Lithuanian Latin European
lv Latvian Latin European
ml Malayalam Malayalam Indic
mn Mongolian Mongolian Central Asian
mni Meitei Meetei Mayek Indic
mr Marathi Devanagari Indic
mt Maltese Latin European
my Myanmar (Burmese) Myanmar Southeast Asian
ne Nepali Devanagari Indic
nl Dutch Latin European Yes
no Norwegian Latin European Yes
nod Northern Thai Tai Tham Southeast Asian
nqo N'Ko N'Ko African
or Odia Odia Indic
pa Punjabi Gurmukhi Indic
pl Polish Latin European
pt Portuguese Latin European Yes
ro Romanian Latin European
ru Russian Cyrillic European Yes
sa Sanskrit Devanagari Indic
sat Santali Ol Chiki Indic
si Sinhala Sinhala South Asian
sk Slovak Latin European
sl Slovenian Latin European
sq Albanian Latin European
sr Serbian Cyrillic European Yes
su Sundanese Sundanese Southeast Asian
sv Swedish Latin European Yes
syr Syriac Syriac Middle Eastern
ta Tamil Tamil Indic
tdd Tai Le Tai Le Southeast Asian
te Telugu Telugu Indic
th Thai Thai Southeast Asian
tl Tagalog Baybayin Southeast Asian
tr Turkish Latin European Yes
tzm Tamazight (Berber) Tifinagh African
uk Ukrainian Cyrillic European Yes
vai Vai Vai African
vi Vietnamese Latin Southeast Asian Yes
zh Chinese Han East Asian

Languages marked Yes in "Has Overrides" have a dedicated TSV file that overrides the default transliteration table with language-specific rules. Languages with rely entirely on the default Unicode transliteration tables.

Reference Texts

Each language has a reference text used for integration testing. These texts are representative samples containing script-specific characters that exercise the transliteration rules.

Code Language Reference Text
am Amharic የኢትዮጵያ ፌዴራላዊ ዲሞክራሲያዊ ሪፐብሊክ...
ar Arabic المملكة العربية السعودية دولة عربية تقع في شبه الجزيرة العربية
as Assamese অসম ভাৰতৰ উত্তৰ পূৰ্বাঞ্চলৰ এখন ৰাজ্য
ban Balinese ᬅᬓᬗᬕᬘ
bax Bamum ꚠꚡꚢꚣ
bg Bulgarian Република България е държава в Югоизточна Европа
bn Bengali গণপ্রজাতন্ত্রী বাংলাদেশ দক্ষিণ এশিয়ার একটি রাষ্ট্র
bo Tibetan བོད་རང་སྐྱོང་ལྗོངས་ནི་རྒྱ་ནག་གི་ཁོངས་གཏོགས
bug Buginese ᨀᨁᨂᨃᨄ
ca Catalan Catalunya és una comunitat autònoma d'Espanya
chr Cherokee ᏣᏔᏂᏃ
cjm Cham ꨀꨁꨂꨃ
cop Coptic Ⲁⲁ Ⲃⲃ Ⲅⲅ
cs Czech Česká republika je stát ve střední Evropě
cy Welsh Cymru yw gwlad sy'n rhan o'r Deyrnas Unedig
da Danish København er Danmarks hovedstad og største by
de German Die Bundesrepublik Deutschland ist ein Bundesstaat in Mitteleuropa
dv Dhivehi ދިވެހިރާއްޖެ
el Greek Η Ελληνική Δημοκρατία είναι χώρα της νοτιοανατολικής Ευρώπης
es Spanish España es un país soberano transcontinental
et Estonian Eesti Vabariik on riik Põhja-Euroopas Läänemere ääres
fa Persian جمهوری اسلامی ایران کشوری در خاورمیانه است
fi Finnish Suomen tasavalta on valtio Pohjois-Euroopassa
fr French La République française est un État transcontinental
ga Irish Éire nó Poblacht na hÉireann is tír í
gu Gujarati ગુજરાત ભારતનું એક રાજ્ય છે જે ભારતના પશ્ચિમ ભાગમાં
he Hebrew מדינת ישראל היא מדינה במזרח התיכון
hi Hindi भारत गणराज्य दक्षिण एशिया में स्थित एक देश है
hr Croatian Republika Hrvatska je država u srednjoj Europi
hu Hungarian Magyarország közép-európai ország
hy Armenian Հայաստան Հանրապետություն Հայ երկիր
is Icelandic Ísland er eyríki á norðanverðum Atlantshafi
it Italian La Repubblica Italiana è uno Stato membro dell'Unione europea
ja Japanese 日本国は東アジアに位置する島国である
jv Javanese ꦐꦟꦪꦣꦨ
ka Georgian საქართველო სახელმწიფოა აღმოსავლეთ ევროპაში
khb Tai Lue ᦀᦁᦂᦃ
km Khmer ព្រះរាជាណាចក្រកម្ពុជា ជាប្រទេសមួយ
kn Kannada ಕರ್ನಾಟಕ ದಕ್ಷಿಣ ಭಾರತದ ಒಂದು ರಾಜ್ಯ
ko Korean 대한민국은 동아시아에 있는 공화국이다
lis Lisu ꓐꓑꓒꓓ
lo Lao ສາທາລະນະລັດ ປະຊາທິປະໄຕ ປະຊາຊົນລາວ
lt Lithuanian Lietuvos Respublika yra valstybė šiaurės Europoje
lv Latvian Latvijas Republika ir valsts Ziemeļeiropā
ml Malayalam കേരളം ഇന്ത്യയിലെ ഒരു സംസ്ഥാനമാണ്
mn Mongolian ᠮᠣᠩᠭᠣᠯ
mni Meitei ꯀꯁꯂꯃ
mr Marathi महाराष्ट्र हे भारतातील एक राज्य आहे
mt Maltese Ir-Repubblika ta' Malta hija stat gżejjer fil-Mediterran
my Myanmar မြန်မာနိုင်ငံသည် အရှေ့တောင်အာရှတွင်
ne Nepali नेपाल एशियाको एक स्वतन्त्र देश हो
nl Dutch Het Koninkrijk der Nederlanden is een staat in West-Europa
no Norwegian Kongeriket Norge er et nordisk land i Skandinavia
nod Northern Thai ᨠᨡᨢᨣ
nqo N'Ko ߁߂߃߄
or Odia ଓଡ଼ିଶା ଭାରତର ପୂର୍ବ ଉପକୂଳରେ ଅବସ୍ଥିତ
pa Punjabi ਪੰਜਾਬ ਭਾਰਤ ਦਾ ਇੱਕ ਰਾਜ ਹੈ
pl Polish Rzeczpospolita Polska jest państwem w Europie Środkowej
pt Portuguese A República Portuguesa é um país situado no sudoeste da Europa
ro Romanian România este un stat situat în sud-estul Europei
ru Russian Российская Федерация является демократическим федеративным государством
sa Sanskrit संस्कृतम् जगतः एका प्राचीनतमा भाषा
sat Santali ᱚᱛᱜᱝ
si Sinhala ශ්‍රී ලංකා ප්‍රජාතාන්ත්‍රික සමාජවාදී ජනරජය
sk Slovak Slovenská republika je štát v strednej Európe
sl Slovenian Republika Slovenija je država v srednji Evropi
sq Albanian Republika e Shqipërisë është një shtet në Europën Juglindore
sr Serbian Београд → Beograd
su Sundanese ᮃᮄᮅᮆ
sv Swedish Konungariket Sverige är ett nordiskt land på Skandinaviska halvön
syr Syriac ܐܒܓܕ
ta Tamil தமிழ்நாடு இந்தியாவின் தெற்கே அமைந்துள்ள மாநிலம்
tdd Tai Le ᥐᥑᥒᥓ
te Telugu తెలుగు భాష ద్రావిడ భాషా కుటుంబానికి చెందిన భాష
th Thai ประเทศไทยเป็นรัฐชาติอันตั้งอยู่ในเอเชียตะวันออกเฉียงใต้
tl Tagalog ᜀᜁᜂᜃ
tr Turkish Türkiye Cumhuriyeti Avrupa ile Asya arasında yer alan bir ülkedir
tzm Tamazight ⴰⴱⴳⴷ
uk Ukrainian Україна є державою у Східній та Центральній Європі
vai Vai ꔀꔁꔂꔃ
vi Vietnamese Cộng hòa xã hội chủ nghĩa Việt Nam là một quốc gia
zh Chinese 中华人民共和国是位于东亚的社会主义国家

Language-Specific Transliteration Rules

The following sections document the exact character-level overrides applied by each language profile. Languages without a dedicated section rely entirely on the default Unicode transliteration tables (accent stripping, script-specific tables, etc.).

Amharic (am)

Based on BGN/PCGN romanization for Amharic. Three categories of overrides:

ጸ series — tsade merger (U+1338–U+133F):

Character Unicode Default Override Notes
U+1338 tse se Ejective /sʼ/ in Amharic, not /ts/
U+1339 tsu su
U+133A tsi si
U+133B tsa sa
U+133C tse se
U+133D ts s
U+133E tso so
U+133F tswa swa

ፀ series — tsade merger (U+1340–U+1347):

Character Unicode Default Override Notes
U+1340 tse se ጸ/ፀ merger in modern Amharic
U+1341 tsu su
U+1342 tsi si
U+1343 tsa sa
U+1344 tse se
U+1345 ts s
U+1346 tso so
U+1347 tswa swa

ዐ series — pharyngeal marking (U+12D0–U+12D6):

Character Unicode Default Override Notes
U+12D0 e 'e Pharyngeal distinct from glottal stop (አ)
U+12D1 u 'u
U+12D2 i 'i
U+12D3 a 'a
U+12D4 e 'e
U+12D5 e 'e
U+12D6 o 'o

Bulgarian (bg)

Character Unicode Replacement Notes
Ъ U+042A A Hard sign
ъ U+044A a Hard sign (lowercase)
Щ U+0429 Sht Shta
щ U+0449 sht Shta (lowercase)

Catalan (ca)

Character Unicode Replacement Notes
· U+00B7 (empty) Interpunct (ela geminada separator removed)

German (de)

Character Unicode Replacement Notes
Ä U+00C4 Ae Umlaut
Ö U+00D6 Oe Umlaut
Ü U+00DC Ue Umlaut
ä U+00E4 ae Umlaut (lowercase)
ö U+00F6 oe Umlaut (lowercase)
ü U+00FC ue Umlaut (lowercase)
U+1E9E SS Capital sharp s

Greek (el)

Character Unicode Replacement Notes
Η U+0397 I Eta
η U+03B7 i Eta (lowercase)
Υ U+03A5 Y Upsilon
υ U+03C5 y Upsilon (lowercase)
Χ U+03A7 Ch Chi
χ U+03C7 ch Chi (lowercase)

Spanish (es)

Character Unicode Replacement Notes
¡ U+00A1 ! Inverted exclamation mark
¿ U+00BF ? Inverted question mark

Estonian (et)

Character Unicode Replacement Notes
Ä U+00C4 Ae
ä U+00E4 ae
Ö U+00D6 Oe
ö U+00F6 oe
Ü U+00DC Ue
ü U+00FC ue

Persian (fa)

Based on BGN/PCGN 1958 romanization system with ASCII output.

Consonants:

Character Unicode Replacement Notes
ب U+0628 b
پ U+067E p Persian-specific
ت U+062A t
ث U+062B s Persian pronunciation (Arabic: th)
ج U+062C j
چ U+0686 ch Persian-specific
ح U+062D h
خ U+062E kh
د U+062F d
ذ U+0630 z Persian pronunciation (Arabic: dh)
ر U+0631 r
ز U+0632 z
ژ U+0698 zh Persian-specific
س U+0633 s
ش U+0634 sh
ص U+0635 s
ض U+0636 z Persian pronunciation (Arabic: d)
ط U+0637 t
ظ U+0638 z
ع U+0639 ' Ain
غ U+063A gh
ف U+0641 f
ق U+0642 q
ک U+06A9 k Persian kaf
گ U+06AF g Persian-specific
ل U+0644 l
م U+0645 m
ن U+0646 n
و U+0648 v Consonantal default
ه U+0647 h
ی U+06CC y Farsi yeh
ك U+0643 k Arabic kaf fallback

Vowels and special characters:

Character Unicode Replacement Notes
ا U+0627 a Alef
آ U+0622 a Alef-madda
ء U+0621 ' Hamza
أ U+0623 a Alef with hamza above
إ U+0625 e Alef with hamza below
ؤ U+0624 ' Waw with hamza (glottal stop)
ئ U+0626 ' Yeh with hamza (glottal stop)
ة U+0629 e Taa marbuta
ى U+0649 a Alef maqsura
ي U+064A y Arabic yaa
ۀ U+06C0 -e Izafe
ہ U+06C1 h Heh goal

Diacritics:

Character Unicode Replacement Notes
فتحه U+064E a Fathah
کسره U+0650 e Kasra
ضمه U+064F o Damma
سکون U+0652 (empty) Sukun — suppress vowel
شدّه U+0651 (empty) Shadda — gemination

Digits: ۰–۹ (U+06F0–U+06F9) → 0–9

Punctuation: ، → , ؛ → ; ؟ → ? ۔ → .

French (fr)

Character Unicode Replacement Notes
Œ U+0152 OE Ligature
œ U+0153 oe Ligature (lowercase)
Æ U+00C6 AE Ligature
æ U+00E6 ae Ligature (lowercase)

Icelandic (is)

Character Unicode Replacement Notes
Ð U+00D0 Dh Eth
ð U+00F0 dh Eth (lowercase)
Þ U+00DE Th Thorn
þ U+00FE th Thorn (lowercase)
Æ U+00C6 Ae
æ U+00E6 ae

Italian (it)

Character Unicode Replacement Notes
ª U+00AA a Feminine ordinal indicator
º U+00BA o Masculine ordinal indicator

Japanese (ja)

Character Unicode Replacement Notes
U+30FC (empty) Chōonpu (prolonged sound mark) removed

Japanese uses the default Hiragana/Katakana → Hepburn tables and Han → Chinese pinyin fallback. Only the prolonged sound mark is overridden.

Dutch (nl)

Character Unicode Replacement Notes
IJ U+0132 IJ Ligature
ij U+0133 ij Ligature (lowercase)

Norwegian (no)

Character Unicode Replacement Notes
Å U+00C5 Aa
å U+00E5 aa
Ø U+00D8 Oe
ø U+00F8 oe
Æ U+00C6 Ae
æ U+00E6 ae

Both "no" and "nb" (Bokmål) map to the same profile. "nn" (Nynorsk) also uses the same mappings.

Portuguese (pt)

Character Unicode Replacement Notes
ª U+00AA a Feminine ordinal indicator
º U+00BA o Masculine ordinal indicator

Russian (ru)

Character Unicode Replacement Notes
Ё U+0401 Yo
ё U+0451 yo
Й U+0419 Y Short I
й U+0439 y Short I (lowercase)
Ъ U+042A " Hard sign
ъ U+044A " Hard sign (lowercase)
Ь U+042C ' Soft sign
ь U+044C ' Soft sign (lowercase)
Э U+042D E Reversed E
э U+044D e Reversed E (lowercase)
Ю U+042E Yu
ю U+044E yu
Я U+042F Ya
я U+044F ya

Serbian (sr)

Character Unicode Replacement Notes
Ђ U+0402 Dj Dje
ђ U+0452 dj Dje (lowercase)
Ћ U+040B C Tshe
ћ U+045B c Tshe (lowercase)
Џ U+040F Dz Dzhe
џ U+045F dz Dzhe (lowercase)
Љ U+0409 Lj Lje
љ U+0459 lj Lje (lowercase)
Њ U+040A Nj Nje
њ U+045A nj Nje (lowercase)
Ј U+0408 J Je
ј U+0458 j Je (lowercase)
Й U+0419 Y Short I
й U+0439 y Short I (lowercase)

Swedish (sv)

Character Unicode Replacement Notes
Ä U+00C4 Ae
ä U+00E4 ae
Ö U+00D6 Oe
ö U+00F6 oe

Turkish (tr)

Character Unicode Replacement Notes
İ U+0130 I Dotted capital I
ı U+0131 i Dotless lowercase i
Ğ U+011E G Breve
ğ U+011F g Breve (lowercase)
Ş U+015E S Cedilla
ş U+015F s Cedilla (lowercase)

Ukrainian (uk)

Character Unicode Replacement Notes
Г U+0413 H Ukrainian /h/ sound (Russian: G)
г U+0433 h
Ґ U+0490 G Hard /g/ sound
ґ U+0491 g
Є U+0404 Ye Ukrainian Ye
є U+0454 ye
Ї U+0407 I Yi
ї U+0457 i
І U+0406 I Ukrainian I
і U+0456 i
И U+0418 Y Ukrainian Y (Russian: I)
и U+0438 y
Ь U+042C ' Soft sign
ь U+044C ' Soft sign (lowercase)

Vietnamese (vi)

Character Unicode Replacement Notes
Đ U+0110 D D with stroke
đ U+0111 d D with stroke (lowercase)
Ơ U+01A0 O O with horn
ơ U+01A1 o O with horn (lowercase)
Ư U+01AF U U with horn
ư U+01B0 u U with horn (lowercase)