Transliteration Provenance
This document records the formal standard or source behind every Unicode block
in translit's transliteration tables. Its purpose is traceability: for any
character→ASCII mapping, a reader should be able to identify which published
romanization system it follows and where to verify it.
Methodology
Provenance was determined by comparing translit's actual per-character mappings
against published romanization tables. Diagnostic characters — those where
competing standards diverge — were used to identify the source unambiguously.
Default Table (translit_default.tsv)
The default table covers the BMP (U+0080–U+FFFF). Every mapping applies unless
overridden by a language-specific table or the ISO 9 / GOST table.
Latin Blocks
| Block |
Range |
Source |
Notes |
| Latin-1 Supplement |
U+0080–U+00FF |
NFKD decomposition + convention |
~69% match Unicode NFKD; remainder uses conventional ASCII (AE, Th, ss, GBP, JPY) |
| Latin Extended-A |
U+0100–U+017F |
NFKD decomposition + convention |
~62% NFKD; remainder follows Unidecode-like conventions for stroked/hooked letters |
| Latin Extended-B |
U+0180–U+024F |
NFKD + Unidecode-like fallback |
Letters without NFKD decomposition use phonetic approximation (Ŋ→N, Ə→A, Ʃ→Sh) |
| IPA Extensions |
U+0250–U+02AF |
Phonetic approximation |
0% NFKD match; maps each IPA symbol to its nearest readable ASCII. Digraphs preferred over Unidecode's uppercase convention (ʃ→sh not S, ʒ→zh not Z) |
| Latin Extended Additional |
U+1E00–U+1EFF |
NFKD decomposition |
99.6% NFKD match. Single exception: U+1E9E LATIN CAPITAL LETTER SHARP S → SS (no NFKD decomposition exists) |
| Spacing Modifier Letters |
U+02B0–U+02FF |
Phonetic approximation |
Modifier letters mapped to their base letter equivalents |
Cyrillic
| Block |
Range |
Source |
Notes |
| Cyrillic |
U+0400–U+04FF |
BGN/PCGN Russian (1947, revised 1994) |
Confirmed by Ж→Zh, Х→Kh, Щ→Shch, Ц→Ts, Ю→Yu, Я→Ya. Hard/soft signs map to empty string (BGN/PCGN drops them). Extended Cyrillic (non-Russian letters) uses simplified phonetic approximations consistent with BGN/PCGN conventions |
| Cyrillic Supplement |
U+0500–U+052F |
BGN/PCGN conventions (extended) |
Follows the same digraph/phonetic pattern as base Cyrillic |
Greek
| Block |
Range |
Source |
Notes |
| Greek and Coptic |
U+0370–U+03FF |
BGN/PCGN Greek (1962, amended 1996), modern pronunciation |
Confirmed by θ→Th, φ→F, ψ→Ps, η→I (itacist/modern). Deviation: χ→Ch (BGN/PCGN uses Kh; Ch matches ISO 843). Coptic range (U+03E2–U+03EF) follows Coptic scholarly convention |
| Greek Extended |
U+1F00–U+1FFF |
NFKD decomposition to base Greek + default Greek mappings |
Polytonic characters decompose then follow the base Greek table |
Arabic
| Block |
Range |
Source |
Notes |
| Arabic |
U+0600–U+06FF |
BGN/PCGN Arabic (1956) |
Confirmed by ث→th, خ→kh, ذ→dh, ش→sh, غ→gh. Emphatic consonants (ص,ض,ط,ظ) lose underdot diacritics (expected for ASCII output). Definitively not Buckwalter (which uses single ASCII characters: x, v, $, etc.) |
| Arabic Presentation Forms-A |
U+FB50–U+FDFF |
Derived from base Arabic |
Presentation forms map to the same values as their base characters |
| Arabic Presentation Forms-B |
U+FE70–U+FEFF |
Derived from base Arabic |
Same as above |
South Asian (Indic)
All Indic scripts follow the UNGEGN/Hunterian romanization pattern with ASCII
simplification (no underdots or macrons). The diagnostic is the use of "cha"/"chha"
for palatal stops (Hunterian) rather than "ca"/"cha" (IAST).
| Block |
Range |
Source |
Notes |
| Devanagari |
U+0900–U+097F |
UNGEGN/Hunterian |
Confirmed: ka, kha, ga, gha, cha, chha. Retroflex/dental merge (both → ta/tha/da/dha/na). Both श and ष → sha |
| Bengali |
U+0980–U+09FF |
UNGEGN/Hunterian |
Mirrors Devanagari pattern. Same aspiration markers |
| Gurmukhi |
U+0A00–U+0A7F |
UNGEGN/Hunterian |
Same pattern as Devanagari |
| Gujarati |
U+0A80–U+0AFF |
UNGEGN/Hunterian |
Same pattern as Devanagari |
| Oriya |
U+0B00–U+0B7F |
UNGEGN/Hunterian |
Same pattern as Devanagari |
| Tamil |
U+0B80–U+0BFF |
UNGEGN Tamil |
Fewer consonants (no aspirated series). ழ→zha is diagnostic of UNGEGN Tamil |
| Telugu |
U+0C00–U+0C7F |
UNGEGN/Hunterian |
Same Indic pattern |
| Kannada |
U+0C80–U+0CFF |
UNGEGN/Hunterian |
Same Indic pattern |
| Malayalam |
U+0D00–U+0D7F |
UNGEGN/Hunterian |
Same Indic pattern |
| Sinhala |
U+0D80–U+0DFF |
UNGEGN/Indic pattern |
Standard Indic framework extended with Sinhala-specific prenasalized stops (nnga, nndda, mba) and unique vowels (ae, aae) |
Southeast Asian
| Block |
Range |
Source |
Notes |
| Thai |
U+0E00–U+0E7F |
RTGS (Royal Thai General System) |
Exact match on all consonants and vowels tested. Aspiration distinction (k/kh, t/th, p/ph) matches RTGS precisely |
| Lao |
U+0E80–U+0EDF |
BGN/PCGN Lao (1966) |
Confirmed by digraph pattern (kh, ch, th, ph, ng). Vowels ASCII-simplified (ue instead of diacritics) |
| Khmer |
U+1780–U+17FF |
UNGEGN Khmer (simplified) |
Two-series consonants collapse to same romanization (expected for ASCII). Vowels heavily simplified. KHR for Riel currency symbol |
| Myanmar |
U+1000–U+109F |
MLC (Myanmar Language Commission) |
Confirmed by hsa at U+1006 (diagnostic). Follows Indic aspiration pattern. Medial consonants: y, r, w, h |
Tibetan
| Block |
Range |
Source |
Notes |
| Tibetan |
U+0F00–U+0FFF |
Indic-phonetic romanization (NOT Wylie) |
U+0F45 ཅ→cha definitively rules out Wylie (which uses ca). Also chha for U+0F46. Follows UNGEGN/Hunterian-style aspiration markers applied to Tibetan consonants. Likely THL Simplified Phonetic or similar. Note: docs/user-guide/language-support.md incorrectly claims "Wylie-based" |
Caucasian
| Block |
Range |
Source |
Notes |
| Georgian |
U+10A0–U+10FF |
BGN/PCGN Georgian (2009) |
Confirmed by base consonant choices (gh, zh, kh, dz). Deviation: Ejective apostrophes stripped — t'/k'/p'/ts'/ch' all lose the apostrophe, causing ejective/non-ejective pairs to merge. Expected for ASCII |
| Armenian |
U+0530–U+058F |
BGN/PCGN Armenian (1981) |
Confirmed by digraphs (Zh, Kh, Gh, Sh, Ch, Ts) and "yev" for ew ligature (U+0587). Deviation: Aspirate apostrophes stripped — Ch'/Ts'/P'/K' lose apostrophes |
Semitic
| Block |
Range |
Source |
Notes |
| Hebrew |
U+0590–U+05FF |
BGN/PCGN Hebrew (1962/2018) |
Confirmed by: ב→v (spirant default), ש→sh, צ→ts, ק→q. Deviation: ח(het)→ch instead of BGN/PCGN kh. The "ch" reflects Ashkenazi/popular convention |
| Syriac |
U+0700–U+074F |
Phonetic approximation |
Follows Arabic-like conventions adapted for Syriac |
| Thaana |
U+0780–U+07BF |
Phonetic approximation |
Maldivian Thaana mapped to phonetic ASCII equivalents |
African
| Block |
Range |
Source |
Notes |
| Ethiopic |
U+1200–U+137F |
BGN/PCGN Amharic (1967) |
Confirmed by syllabic vowel order (e, u, i, a, e, ∅, o, wa) and bare-consonant 6th order. Digraphs: sh, ch match BGN/PCGN |
Historic and Specialized
| Block |
Range |
Source |
Notes |
| Ogham |
U+1680–U+169F |
Standard scholarly values |
Matches Book of Ballymote / modern Celtic studies consensus. Beith-Luis-Nion order |
| Runic |
U+16A0–U+16FF |
Phonetic values per scholarly consensus |
Mixed Elder/Younger Futhark and Anglo-Saxon values. No single published standard; uses commonly accepted sound values per Unicode character names |
| Cherokee |
U+13A0–U+13FF |
Syllabary phonetic values |
Each syllable mapped to its phonetic romanization |
| Canadian Aboriginal Syllabics |
U+1400–U+167F |
Phonetic decomposition |
No single published standard. Each syllabic mapped to its consonant+vowel phonetic value, reflecting the inherent structure of the unified syllabary |
CJK and East Asian
| Block |
Range |
Source |
Notes |
| CJK Compatibility Ideographs |
U+F900–U+FAFF |
Unicode Unihan kMandarin |
Same source as hanzi_pinyin.tsv. Toneless pinyin |
| Hangul Jamo |
U+1100–U+11FF |
Revised Romanization of Korean (RR, 2000) |
Jamo components; full syllable romanization is algorithmic in hangul.rs |
| Hiragana |
U+3040–U+309F |
Modified Hepburn |
Standard Hepburn romanization for Japanese kana |
| Katakana |
U+30A0–U+30FF |
Modified Hepburn |
Same as Hiragana |
| Halfwidth and Fullwidth Forms |
U+FF00–U+FFEF |
NFKD to base character |
Fullwidth Latin letters decompose to ASCII; halfwidth katakana follows Hepburn |
| Kangxi Radicals |
U+2F00–U+2FDF |
Unicode Unihan kMandarin |
Mapped via radical-to-ideograph correspondence |
| Enclosed Alphanumerics |
U+2460–U+24FF |
Numeric/letter extraction |
① → 1, Ⓐ → A, etc. |
Symbols and Punctuation
| Block |
Range |
Source |
Notes |
| General Punctuation |
U+2000–U+206F |
Functional ASCII equivalents |
—→-, …→..., etc. |
| Currency Symbols |
U+20A0–U+20CF |
ISO 4217 codes or conventional abbreviations |
₤→GBP, ₹→Rs, ₩→KRW, etc. |
| Number Forms |
U+2150–U+218F |
Numeric expansion |
⅓→1/3, Ⅳ→IV, etc. |
| Superscripts and Subscripts |
U+2070–U+209F |
Base digit/letter |
² → 2, ₂ → 2, etc. |
| Letterlike Symbols |
U+2100–U+214F |
Expansion or abbreviation |
℃→C, №→No, etc. |
Language Override Tables (translit_lang_*.tsv)
These tables override specific characters from the default table when a lang
parameter is provided.
| File |
Standard |
Has header comment? |
translit_lang_am.tsv |
BGN/PCGN Amharic overrides |
Yes |
translit_lang_bg.tsv |
BGN/PCGN Bulgarian |
No |
translit_lang_ca.tsv |
Catalan convention (punt volat removal) |
No |
translit_lang_de.tsv |
German convention (ä→ae, ö→oe, ü→ue, ß→ss) |
No |
translit_lang_el.tsv |
BGN/PCGN Greek overrides |
No |
translit_lang_es.tsv |
Spanish convention (¡→!, ¿→?) |
No |
translit_lang_et.tsv |
Estonian convention (ä→ae, ö→oe, ü→ue, š→sh, ž→zh) |
No |
translit_lang_fa.tsv |
BGN/PCGN Persian (1958) |
Yes |
translit_lang_fr.tsv |
French convention (Œ→OE, œ→oe) |
No |
translit_lang_is.tsv |
Icelandic convention (Æ→Ae, ð→d, þ→th) |
No |
translit_lang_it.tsv |
Italian convention |
No |
translit_lang_ja.tsv |
Modified Hepburn overrides |
No |
translit_lang_ja_kunrei.tsv |
Kunrei-shiki romanization |
Yes |
translit_lang_nl.tsv |
Dutch convention (IJ digraph) |
No |
translit_lang_no.tsv |
Norwegian convention (Å→Aa, Ø→Oe, Æ→Ae) |
No |
translit_lang_pt.tsv |
Portuguese convention |
No |
translit_lang_ru.tsv |
BGN/PCGN Russian overrides (Ё→Yo, Й→Y, Ъ→", Ь→') |
No |
translit_lang_sr.tsv |
BGN/PCGN Serbian overrides |
No |
translit_lang_sv.tsv |
Swedish convention (Ä→Ae, Ö→Oe, Å→Aa) |
No |
translit_lang_tr.tsv |
Turkish convention (İ→I, ı→i) |
No |
translit_lang_uk.tsv |
Ukrainian national romanization (2010) |
No |
translit_lang_vi.tsv |
Vietnamese NFKD + convention |
No |
translit_iso9.tsv |
ISO 9:1995 (scholarly Cyrillic) |
No |
translit_gost7034.tsv |
GOST R 7.0.34-2014 (simplified Russian) |
Yes |
Alternate Cyrillic Tables
| File |
Standard |
translit_iso9.tsv |
ISO 9:1995 — International standard for Cyrillic-to-Latin transliteration. Preserves diacritics (not ASCII-only). One-to-one reversible mapping |
translit_gost7034.tsv |
GOST R 7.0.34-2014 — Russian national standard for simplified transliteration. ASCII-compatible |
SMP Table (translit_default_smp.tsv)
Already annotated with block-level comments in the file itself. Covers:
- Gothic (U+10330–U+1034A) — Wulfila's alphabet, one-to-one Latin correspondence
- Old Persian Cuneiform (U+103A0–U+103D5) — Syllabic values
- Linear B Syllabary (U+10000–U+1005D) — Conventional syllabic values
Algorithmic Transliteration (not in TSV)
These are computed at runtime, not stored in the default table:
| Script |
Source file |
Standard |
| Hangul syllables (U+AC00–U+D7A3) |
hangul.rs |
Revised Romanization of Korean (RR, 2000) — official South Korean standard. Algorithmic jamo decomposition: 19 initials × 21 vowels × 28 finals = 11,172 syllables |
| CJK Unified Ideographs (U+4E00–U+9FFF) |
hanzi_pinyin.rs |
Unicode Unihan kMandarin field — toneless pinyin. 20,924 characters |
Known Documentation Errors
- Tibetan claimed as "Wylie-based" in
docs/user-guide/language-support.md:100.
The actual mapping uses cha for U+0F45 ཅ, ruling out Wylie (which uses ca).
The system follows an Indic-phonetic romanization with Hunterian-style aspiration
markers. This should be corrected in the docs.
Design Principles
The audit reveals a consistent set of design decisions across all blocks:
-
BGN/PCGN is the primary standard family for non-Latin scripts (Cyrillic,
Greek, Arabic, Armenian, Georgian, Hebrew, Lao, Ethiopic). This is the system
used by the US Board on Geographic Names and the UK Permanent Committee on
Geographical Names.
-
UNGEGN/Hunterian for South Asian scripts. BGN/PCGN defers to UNGEGN for
Indic romanization, and UNGEGN's system is based on the Hunterian scheme.
-
National/official standards where they exist: RTGS for Thai, RR for Korean,
MLC for Myanmar.
-
Unicode data for CJK: Unihan kMandarin for Chinese, algorithmic RR for
Korean, Hepburn for Japanese kana.
-
ASCII simplification is applied uniformly: Diacritics are dropped, underdots
removed, apostrophe modifiers stripped. This is documented as an explicit design
constraint, not an oversight.
-
NFKD decomposition for Latin extensions: Where Unicode provides a
decomposition to ASCII-range characters, it is used. Where NFKD fails (IPA,
stroked letters, ligatures), phonetic approximation fills the gap.