Architecture: Transliteration Engine¶
How translit converts Unicode text to ASCII, character by character.
Design goals¶
The transliteration engine converts arbitrary Unicode strings to ASCII equivalents suitable for URLs, filenames, search indices, and cross-lingual display. The core tradeoff is speed vs. linguistic accuracy: translit chooses O(1) per-character lookup with no sentence context, sacrificing disambiguation of polyphonic characters (see Limitations) for predictable, high-throughput output.
Cow return type¶
transliterate_impl() returns Cow<'a, str>. Pure-ASCII input (the common case in English-dominated workloads) is returned as Borrowed with zero allocation. Non-ASCII input builds an Owned string. The ASCII check uses str::is_ascii(), which compiles to a SIMD-friendly byte scan — sub-nanosecond for short strings.
Capacity pre-sizing¶
Before processing, the engine samples the first non-ASCII codepoint to pick a buffer multiplier:
- CJK ideographs, Hangul, kana (U+3000–U+9FFF, U+AC00–U+D7AF, U+F900–U+FAFF): 4× the input byte length — each character typically expands to a multi-letter pinyin/romaji syllable plus a space.
- Latin, Cyrillic, Arabic, and everything else: 1× — most characters map to a single ASCII character.
This heuristic prevents reallocations for CJK-heavy input without over-allocating for Latin text.
Auto-language resolution¶
When lang="auto" is passed, the engine resolves the language code before entering the main transliteration loop. resolve_auto_lang() scans the input for the first non-Latin, non-Common character, maps its script to a default ISO 639-1 code (e.g. Cyrillic → "ru", Han → "zh", Hiragana/Katakana → "ja"), and substitutes it for the "auto" sentinel. If no non-Latin script is found (Latin-only or empty input), lang becomes None and the default table is used.
This resolution happens once per call, after the ASCII fast path but before the per-character loop. It adds zero overhead when lang is None or an explicit code.
Lookup priority¶
Each non-ASCII character goes through a fixed lookup chain:
- Strict ISO 9 mode (
strict_iso9=True): ISO 9 table → default table. Language overrides are bypassed entirely. - Normal mode: language-specific override (if
langis set, including auto-resolved) → default table.
This is a flat two-level dispatch, not a fallback chain. ISO 9 and language modes are mutually exclusive.
Script transition spacing¶
Raw character-by-character concatenation produces unreadable output for CJK text: 北京市 → beijingshi. The engine tracks a prev_class byte to detect script transitions and insert spaces:
| Transition | Example | Spacing |
|---|---|---|
| Ideograph → ideograph | 北京 | space (each character is a "word") |
| Hangul → Hangul | 서울 | space (each syllable is distinct) |
| Kana → kana | ひらがな | no space (kana concatenate into words) |
| Ideograph ↔ kana | 東京タワー | space at boundary |
| CJK ↔ Latin | café北京 | space at boundary |
The prev_class variable (u8, 6 values) is updated per character. Combined with last_appended: Option<char> — which tracks the last character written to the output buffer — spacing decisions are O(1) with no backward scanning of the output string.
Error modes¶
When a character has no mapping in any table, one of three modes applies:
| Mode | Behavior | Use case |
|---|---|---|
"replace" |
Substitute replace_with string (default "[?]") |
Debugging, visibility |
"ignore" |
Silently drop the character | URL slugs, filenames |
"preserve" |
Keep the original Unicode character | Mixed-script display |
Range-based dispatch¶
lookup_default() dispatches by codepoint range before consulting the main table, routing characters to dedicated, higher-quality handlers:
- CJK Unified Ideographs (U+3400–U+9FFF, U+F900–U+FAFF) → Hanzi pinyin PHF table
- Hangul Syllables (U+AC00–U+D7AF) and compatibility jamo (U+3131–U+3163) → algorithmic romanization
- Everything else → flat BMP array (see Data Tables)
This avoids probing the 65K-entry flat array for scripts that have dedicated tables with better mappings.
Abugida / virama handling¶
Brahmic scripts use inherent-vowel systems where a consonant carries an implicit "a" unless suppressed by a virama (halant). The engine handles this via the IndicRole enum:
Consonant— carries inherent "a" in TSV (e.g. ka, ga, ta)DependentVowel— mātrā that replaces the inherent vowelVirama— suppresses the trailing "a" of the preceding consonantNone— independent vowels, digits, punctuation
The is_indic() function identifies whether a codepoint belongs to a supported Brahmic range. Each script has a dedicated *_char_role() function (e.g. tagalog_char_role, sundanese_char_role) that classifies its codepoints into IndicRole variants. The indic_char_role() dispatcher routes to the correct function based on the codepoint range.
Supported Brahmic scripts (25 total): Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Khmer, Balinese, Sundanese, Tai Tham, Cham, Batak, Buginese, Tagalog, Hanunoo, Buhid, Tagbanwa, Meetei Mayek.
Unicode form mappings¶
The flat BMP array includes mechanical mappings for Unicode presentation forms and compatibility characters that appear in real-world data:
| Block | Range | Count | Mapping method |
|---|---|---|---|
| Fullwidth ASCII | FF01–FF5E | 94 | codepoint - 0xFEE0 offset to ASCII equivalent |
| Halfwidth Hangul | FFA0–FFDC | 66 | Via compatibility jamo romanization |
| Enclosed/Circled | 2460–24FF | 160 | Extract value from Unicode name (①→1, Ⓐ→A) |
| Superscript/Subscript | 2070–209F | 29 | Map to base digit or letter |
| Roman Numerals | 2160–2188 | 41 | Multi-char values (Ⅱ→II, Ⅻ→XII) |
| Modifier Letters | 02B0–02FF | 80 | Superscript letter variants (ʰ→h, ʷ→w) |
| IPA Extensions | 0250–02AF | 96 | Closest ASCII approximation (ɑ→a, ʃ→sh, ŋ→ng) |
| Greek Extended | 1F00–1FFF | 233 | Strip breathing/accent → base Greek → Latin |
| Hangul Jamo | 1100–11FF | 256 | Matches CHOSEONG/JUNGSEONG/JONGSEONG arrays |
| Kangxi Radicals | 2F00–2FD5 | 214 | Decompose to CJK Unified → pinyin |
| CJK Compat Ideographs | F900–FAFF | 472 | Canonical decomposition target → pinyin |