Architecture: Transliteration Engine

How translit converts Unicode text to ASCII, character by character.

Design goals

The transliteration engine converts arbitrary Unicode strings to ASCII equivalents suitable for URLs, filenames, search indices, and cross-lingual display. The core tradeoff is speed vs. linguistic accuracy: translit chooses O(1) per-character lookup with no sentence context, sacrificing disambiguation of polyphonic characters (see Limitations) for predictable, high-throughput output.

Cow return type

transliterate_impl() returns Cow<'a, str>. Pure-ASCII input (the common case in English-dominated workloads) is returned as Borrowed with zero allocation. Non-ASCII input builds an Owned string. The ASCII check uses str::is_ascii(), which compiles to a SIMD-friendly byte scan — sub-nanosecond for short strings.

Capacity pre-sizing

Before processing, the engine samples the first non-ASCII codepoint to pick a buffer multiplier:

  • CJK ideographs, Hangul, kana (U+3000–U+9FFF, U+AC00–U+D7AF, U+F900–U+FAFF): the input byte length — each character typically expands to a multi-letter pinyin/romaji syllable plus a space.
  • Latin, Cyrillic, Arabic, and everything else: — most characters map to a single ASCII character.

This heuristic prevents reallocations for CJK-heavy input without over-allocating for Latin text.

Auto-language resolution

When lang="auto" is passed, the engine resolves the language code before entering the main transliteration loop. resolve_auto_lang() scans the input for the first non-Latin, non-Common character, maps its script to a default ISO 639-1 code (e.g. Cyrillic → "ru", Han → "zh", Hiragana/Katakana → "ja"), and substitutes it for the "auto" sentinel. If no non-Latin script is found (Latin-only or empty input), lang becomes None and the default table is used.

This resolution happens once per call, after the ASCII fast path but before the per-character loop. It adds zero overhead when lang is None or an explicit code.

Lookup priority

Each non-ASCII character goes through a fixed lookup chain:

  1. Strict ISO 9 mode (strict_iso9=True): ISO 9 table → default table. Language overrides are bypassed entirely.
  2. Normal mode: language-specific override (if lang is set, including auto-resolved) → default table.

This is a flat two-level dispatch, not a fallback chain. ISO 9 and language modes are mutually exclusive.

Script transition spacing

Raw character-by-character concatenation produces unreadable output for CJK text: 北京市beijingshi. The engine tracks a prev_class byte to detect script transitions and insert spaces:

Transition Example Spacing
Ideograph → ideograph 北京 space (each character is a "word")
Hangul → Hangul 서울 space (each syllable is distinct)
Kana → kana ひらがな no space (kana concatenate into words)
Ideograph ↔ kana 東京タワー space at boundary
CJK ↔ Latin café北京 space at boundary

The prev_class variable (u8, 6 values) is updated per character. Combined with last_appended: Option<char> — which tracks the last character written to the output buffer — spacing decisions are O(1) with no backward scanning of the output string.

Error modes

When a character has no mapping in any table, one of three modes applies:

Mode Behavior Use case
"replace" Substitute replace_with string (default "[?]") Debugging, visibility
"ignore" Silently drop the character URL slugs, filenames
"preserve" Keep the original Unicode character Mixed-script display

Range-based dispatch

lookup_default() dispatches by codepoint range before consulting the main table, routing characters to dedicated, higher-quality handlers:

  • CJK Unified Ideographs (U+3400–U+9FFF, U+F900–U+FAFF) → Hanzi pinyin PHF table
  • Hangul Syllables (U+AC00–U+D7AF) and compatibility jamo (U+3131–U+3163) → algorithmic romanization
  • Everything else → flat BMP array (see Data Tables)

This avoids probing the 65K-entry flat array for scripts that have dedicated tables with better mappings.

Abugida / virama handling

Brahmic scripts use inherent-vowel systems where a consonant carries an implicit "a" unless suppressed by a virama (halant). The engine handles this via the IndicRole enum:

  • Consonant — carries inherent "a" in TSV (e.g. ka, ga, ta)
  • DependentVowel — mātrā that replaces the inherent vowel
  • Virama — suppresses the trailing "a" of the preceding consonant
  • None — independent vowels, digits, punctuation

The is_indic() function identifies whether a codepoint belongs to a supported Brahmic range. Each script has a dedicated *_char_role() function (e.g. tagalog_char_role, sundanese_char_role) that classifies its codepoints into IndicRole variants. The indic_char_role() dispatcher routes to the correct function based on the codepoint range.

Supported Brahmic scripts (25 total): Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Thai, Lao, Tibetan, Myanmar, Khmer, Balinese, Sundanese, Tai Tham, Cham, Batak, Buginese, Tagalog, Hanunoo, Buhid, Tagbanwa, Meetei Mayek.

Unicode form mappings

The flat BMP array includes mechanical mappings for Unicode presentation forms and compatibility characters that appear in real-world data:

Block Range Count Mapping method
Fullwidth ASCII FF01–FF5E 94 codepoint - 0xFEE0 offset to ASCII equivalent
Halfwidth Hangul FFA0–FFDC 66 Via compatibility jamo romanization
Enclosed/Circled 2460–24FF 160 Extract value from Unicode name (①→1, Ⓐ→A)
Superscript/Subscript 2070–209F 29 Map to base digit or letter
Roman Numerals 2160–2188 41 Multi-char values (Ⅱ→II, Ⅻ→XII)
Modifier Letters 02B0–02FF 80 Superscript letter variants (ʰ→h, ʷ→w)
IPA Extensions 0250–02AF 96 Closest ASCII approximation (ɑ→a, ʃ→sh, ŋ→ng)
Greek Extended 1F00–1FFF 233 Strip breathing/accent → base Greek → Latin
Hangul Jamo 1100–11FF 256 Matches CHOSEONG/JUNGSEONG/JONGSEONG arrays
Kangxi Radicals 2F00–2FD5 214 Decompose to CJK Unified → pinyin
CJK Compat Ideographs F900–FAFF 472 Canonical decomposition target → pinyin