Transliteration Provenance¶

This document records the formal standard or source behind every Unicode block in translit's transliteration tables. Its purpose is traceability: for any character→ASCII mapping, a reader should be able to identify which published romanization system it follows and where to verify it.

Methodology¶

Provenance was determined by comparing translit's actual per-character mappings against published romanization tables. Diagnostic characters — those where competing standards diverge — were used to identify the source unambiguously.

Default Table (`translit_default.tsv`)¶

The default table covers the BMP (U+0080–U+FFFF). Every mapping applies unless overridden by a language-specific table or the ISO 9 / GOST table.

Latin Blocks¶

Block	Range	Source	Notes
Latin-1 Supplement	U+0080–U+00FF	NFKD decomposition + convention	~69% match Unicode NFKD; remainder uses conventional ASCII (AE, Th, ss, GBP, JPY)
Latin Extended-A	U+0100–U+017F	NFKD decomposition + convention	~62% NFKD; remainder follows Unidecode-like conventions for stroked/hooked letters
Latin Extended-B	U+0180–U+024F	NFKD + Unidecode-like fallback	Letters without NFKD decomposition use phonetic approximation (Ŋ→N, Ə→A, Ʃ→Sh)
IPA Extensions	U+0250–U+02AF	Phonetic approximation	0% NFKD match; maps each IPA symbol to its nearest readable ASCII. Digraphs preferred over Unidecode's uppercase convention (ʃ→sh not S, ʒ→zh not Z)
Latin Extended Additional	U+1E00–U+1EFF	NFKD decomposition	99.6% NFKD match. Single exception: U+1E9E LATIN CAPITAL LETTER SHARP S → SS (no NFKD decomposition exists)
Spacing Modifier Letters	U+02B0–U+02FF	Phonetic approximation	Modifier letters mapped to their base letter equivalents

Cyrillic¶

Block	Range	Source	Notes
Cyrillic	U+0400–U+04FF	BGN/PCGN Russian (1947, revised 1994)	Confirmed by Ж→Zh, Х→Kh, Щ→Shch, Ц→Ts, Ю→Yu, Я→Ya. Hard/soft signs map to empty string (BGN/PCGN drops them). Extended Cyrillic (non-Russian letters) uses simplified phonetic approximations consistent with BGN/PCGN conventions
Cyrillic Supplement	U+0500–U+052F	BGN/PCGN conventions (extended)	Follows the same digraph/phonetic pattern as base Cyrillic

Greek¶

Block	Range	Source	Notes
Greek and Coptic	U+0370–U+03FF	BGN/PCGN Greek (1962, amended 1996), modern pronunciation	Confirmed by θ→Th, φ→F, ψ→Ps, η→I (itacist/modern). Deviation: χ→Ch (BGN/PCGN uses Kh; Ch matches ISO 843). Coptic range (U+03E2–U+03EF) follows Coptic scholarly convention
Greek Extended	U+1F00–U+1FFF	NFKD decomposition to base Greek + default Greek mappings	Polytonic characters decompose then follow the base Greek table

Arabic¶

Block	Range	Source	Notes
Arabic	U+0600–U+06FF	BGN/PCGN Arabic (1956)	Confirmed by ث→th, خ→kh, ذ→dh, ش→sh, غ→gh. Emphatic consonants (ص,ض,ط,ظ) lose underdot diacritics (expected for ASCII output). Definitively not Buckwalter (which uses single ASCII characters: x, v, $, etc.)
Arabic Presentation Forms-A	U+FB50–U+FDFF	Derived from base Arabic	Presentation forms map to the same values as their base characters
Arabic Presentation Forms-B	U+FE70–U+FEFF	Derived from base Arabic	Same as above

South Asian (Indic)¶

All Indic scripts follow the UNGEGN/Hunterian romanization pattern with ASCII simplification (no underdots or macrons). The diagnostic is the use of "cha"/"chha" for palatal stops (Hunterian) rather than "ca"/"cha" (IAST).

Block	Range	Source	Notes
Devanagari	U+0900–U+097F	UNGEGN/Hunterian	Confirmed: ka, kha, ga, gha, cha, chha. Retroflex/dental merge (both → ta/tha/da/dha/na). Both श and ष → sha
Bengali	U+0980–U+09FF	UNGEGN/Hunterian	Mirrors Devanagari pattern. Same aspiration markers
Gurmukhi	U+0A00–U+0A7F	UNGEGN/Hunterian	Same pattern as Devanagari
Gujarati	U+0A80–U+0AFF	UNGEGN/Hunterian	Same pattern as Devanagari
Oriya	U+0B00–U+0B7F	UNGEGN/Hunterian	Same pattern as Devanagari
Tamil	U+0B80–U+0BFF	UNGEGN Tamil	Fewer consonants (no aspirated series). ழ→zha is diagnostic of UNGEGN Tamil
Telugu	U+0C00–U+0C7F	UNGEGN/Hunterian	Same Indic pattern
Kannada	U+0C80–U+0CFF	UNGEGN/Hunterian	Same Indic pattern
Malayalam	U+0D00–U+0D7F	UNGEGN/Hunterian	Same Indic pattern
Sinhala	U+0D80–U+0DFF	UNGEGN/Indic pattern	Standard Indic framework extended with Sinhala-specific prenasalized stops (nnga, nndda, mba) and unique vowels (ae, aae)

Southeast Asian¶

Block	Range	Source	Notes
Thai	U+0E00–U+0E7F	RTGS (Royal Thai General System)	Exact match on all consonants and vowels tested. Aspiration distinction (k/kh, t/th, p/ph) matches RTGS precisely
Lao	U+0E80–U+0EDF	BGN/PCGN Lao (1966)	Confirmed by digraph pattern (kh, ch, th, ph, ng). Vowels ASCII-simplified (ue instead of diacritics)
Khmer	U+1780–U+17FF	UNGEGN Khmer (simplified)	Two-series consonants collapse to same romanization (expected for ASCII). Vowels heavily simplified. KHR for Riel currency symbol
Myanmar	U+1000–U+109F	MLC (Myanmar Language Commission)	Confirmed by hsa at U+1006 (diagnostic). Follows Indic aspiration pattern. Medial consonants: y, r, w, h

Tibetan¶

Block	Range	Source	Notes
Tibetan	U+0F00–U+0FFF	Indic-phonetic romanization (NOT Wylie)	U+0F45 ཅ→cha definitively rules out Wylie (which uses ca). Also chha for U+0F46. Follows UNGEGN/Hunterian-style aspiration markers applied to Tibetan consonants. Likely THL Simplified Phonetic or similar. Note: docs/user-guide/language-support.md incorrectly claims "Wylie-based"

Caucasian¶

Block	Range	Source	Notes
Georgian	U+10A0–U+10FF	BGN/PCGN Georgian (2009)	Confirmed by base consonant choices (gh, zh, kh, dz). Deviation: Ejective apostrophes stripped — t'/k'/p'/ts'/ch' all lose the apostrophe, causing ejective/non-ejective pairs to merge. Expected for ASCII
Armenian	U+0530–U+058F	BGN/PCGN Armenian (1981)	Confirmed by digraphs (Zh, Kh, Gh, Sh, Ch, Ts) and "yev" for ew ligature (U+0587). Deviation: Aspirate apostrophes stripped — Ch'/Ts'/P'/K' lose apostrophes

Semitic¶

Block	Range	Source	Notes
Hebrew	U+0590–U+05FF	BGN/PCGN Hebrew (1962/2018)	Confirmed by: ב→v (spirant default), ש→sh, צ→ts, ק→q. Deviation: ח(het)→ch instead of BGN/PCGN kh. The "ch" reflects Ashkenazi/popular convention
Syriac	U+0700–U+074F	Phonetic approximation	Follows Arabic-like conventions adapted for Syriac
Thaana	U+0780–U+07BF	Phonetic approximation	Maldivian Thaana mapped to phonetic ASCII equivalents

African¶

Block	Range	Source	Notes
Ethiopic	U+1200–U+137F	BGN/PCGN Amharic (1967)	Confirmed by syllabic vowel order (e, u, i, a, e, ∅, o, wa) and bare-consonant 6th order. Digraphs: sh, ch match BGN/PCGN

Historic and Specialized¶

Block	Range	Source	Notes
Ogham	U+1680–U+169F	Standard scholarly values	Matches Book of Ballymote / modern Celtic studies consensus. Beith-Luis-Nion order
Runic	U+16A0–U+16FF	Phonetic values per scholarly consensus	Mixed Elder/Younger Futhark and Anglo-Saxon values. No single published standard; uses commonly accepted sound values per Unicode character names
Cherokee	U+13A0–U+13FF	Syllabary phonetic values	Each syllable mapped to its phonetic romanization
Canadian Aboriginal Syllabics	U+1400–U+167F	Phonetic decomposition	No single published standard. Each syllabic mapped to its consonant+vowel phonetic value, reflecting the inherent structure of the unified syllabary

CJK and East Asian¶

Block	Range	Source	Notes
CJK Compatibility Ideographs	U+F900–U+FAFF	Unicode Unihan kMandarin	Same source as hanzi_pinyin.tsv. Toneless pinyin
Hangul Jamo	U+1100–U+11FF	Revised Romanization of Korean (RR, 2000)	Jamo components; full syllable romanization is algorithmic in hangul.rs
Hiragana	U+3040–U+309F	Modified Hepburn	Standard Hepburn romanization for Japanese kana
Katakana	U+30A0–U+30FF	Modified Hepburn	Same as Hiragana
Halfwidth and Fullwidth Forms	U+FF00–U+FFEF	NFKD to base character	Fullwidth Latin letters decompose to ASCII; halfwidth katakana follows Hepburn
Kangxi Radicals	U+2F00–U+2FDF	Unicode Unihan kMandarin	Mapped via radical-to-ideograph correspondence
Enclosed Alphanumerics	U+2460–U+24FF	Numeric/letter extraction	① → 1, Ⓐ → A, etc.

Symbols and Punctuation¶

Block	Range	Source	Notes
General Punctuation	U+2000–U+206F	Functional ASCII equivalents	—→-, …→..., etc.
Currency Symbols	U+20A0–U+20CF	ISO 4217 codes or conventional abbreviations	₤→GBP, ₹→Rs, ₩→KRW, etc.
Number Forms	U+2150–U+218F	Numeric expansion	⅓→1/3, Ⅳ→IV, etc.
Superscripts and Subscripts	U+2070–U+209F	Base digit/letter	² → 2, ₂ → 2, etc.
Letterlike Symbols	U+2100–U+214F	Expansion or abbreviation	℃→C, №→No, etc.

Language Override Tables (`translit_lang_*.tsv`)¶

These tables override specific characters from the default table when a lang parameter is provided.

File	Standard	Has header comment?
`translit_lang_am.tsv`	BGN/PCGN Amharic overrides	Yes
`translit_lang_bg.tsv`	BGN/PCGN Bulgarian	No
`translit_lang_ca.tsv`	Catalan convention (punt volat removal)	No
`translit_lang_de.tsv`	German convention (ä→ae, ö→oe, ü→ue, ß→ss)	No
`translit_lang_el.tsv`	BGN/PCGN Greek overrides	No
`translit_lang_es.tsv`	Spanish convention (¡→!, ¿→?)	No
`translit_lang_et.tsv`	Estonian convention (ä→ae, ö→oe, ü→ue, š→sh, ž→zh)	No
`translit_lang_fa.tsv`	BGN/PCGN Persian (1958)	Yes
`translit_lang_fr.tsv`	French convention (Œ→OE, œ→oe)	No
`translit_lang_is.tsv`	Icelandic convention (Æ→Ae, ð→d, þ→th)	No
`translit_lang_it.tsv`	Italian convention	No
`translit_lang_ja.tsv`	Modified Hepburn overrides	No
`translit_lang_ja_kunrei.tsv`	Kunrei-shiki romanization	Yes
`translit_lang_nl.tsv`	Dutch convention (IJ digraph)	No
`translit_lang_no.tsv`	Norwegian convention (Å→Aa, Ø→Oe, Æ→Ae)	No
`translit_lang_pt.tsv`	Portuguese convention	No
`translit_lang_ru.tsv`	BGN/PCGN Russian overrides (Ё→Yo, Й→Y, Ъ→", Ь→')	No
`translit_lang_sr.tsv`	BGN/PCGN Serbian overrides	No
`translit_lang_sv.tsv`	Swedish convention (Ä→Ae, Ö→Oe, Å→Aa)	No
`translit_lang_tr.tsv`	Turkish convention (İ→I, ı→i)	No
`translit_lang_uk.tsv`	Ukrainian national romanization (2010)	No
`translit_lang_vi.tsv`	Vietnamese NFKD + convention	No
`translit_iso9.tsv`	ISO 9:1995 (scholarly Cyrillic)	No
`translit_gost7034.tsv`	GOST R 7.0.34-2014 (simplified Russian)	Yes

Alternate Cyrillic Tables¶

File	Standard
`translit_iso9.tsv`	ISO 9:1995 — International standard for Cyrillic-to-Latin transliteration. Preserves diacritics (not ASCII-only). One-to-one reversible mapping
`translit_gost7034.tsv`	GOST R 7.0.34-2014 — Russian national standard for simplified transliteration. ASCII-compatible

SMP Table (`translit_default_smp.tsv`)¶

Already annotated with block-level comments in the file itself. Covers: - Gothic (U+10330–U+1034A) — Wulfila's alphabet, one-to-one Latin correspondence - Old Persian Cuneiform (U+103A0–U+103D5) — Syllabic values - Linear B Syllabary (U+10000–U+1005D) — Conventional syllabic values

Algorithmic Transliteration (not in TSV)¶

These are computed at runtime, not stored in the default table:

Script	Source file	Standard
Hangul syllables (U+AC00–U+D7A3)	`hangul.rs`	Revised Romanization of Korean (RR, 2000) — official South Korean standard. Algorithmic jamo decomposition: 19 initials × 21 vowels × 28 finals = 11,172 syllables
CJK Unified Ideographs (U+4E00–U+9FFF)	`hanzi_pinyin.rs`	Unicode Unihan kMandarin field — toneless pinyin. 20,924 characters

Known Documentation Errors¶

Tibetan claimed as "Wylie-based" in docs/user-guide/language-support.md:100. The actual mapping uses cha for U+0F45 ཅ, ruling out Wylie (which uses ca). The system follows an Indic-phonetic romanization with Hunterian-style aspiration markers. This should be corrected in the docs.

Design Principles¶

The audit reveals a consistent set of design decisions across all blocks:

BGN/PCGN is the primary standard family for non-Latin scripts (Cyrillic, Greek, Arabic, Armenian, Georgian, Hebrew, Lao, Ethiopic). This is the system used by the US Board on Geographic Names and the UK Permanent Committee on Geographical Names.
UNGEGN/Hunterian for South Asian scripts. BGN/PCGN defers to UNGEGN for Indic romanization, and UNGEGN's system is based on the Hunterian scheme.
National/official standards where they exist: RTGS for Thai, RR for Korean, MLC for Myanmar.
Unicode data for CJK: Unihan kMandarin for Chinese, algorithmic RR for Korean, Hepburn for Japanese kana.
ASCII simplification is applied uniformly: Diacritics are dropped, underdots removed, apostrophe modifiers stripped. This is documented as an explicit design constraint, not an oversight.
NFKD decomposition for Latin extensions: Where Unicode provides a decomposition to ASCII-range characters, it is used. Where NFKD fails (IPA, stroked letters, ligatures), phonetic approximation fills the gap.