Architecture: Performance

The optimization strategies that make translit 10–53× faster than pure-Python alternatives.

The PyO3 boundary problem

Every call from Python into Rust crosses the PyO3 boundary: argument conversion, GIL management, result conversion back to a Python object. This costs ~300–500 ns per call. For a function like transliterate() that processes a short string in ~60 ns of actual Rust work, the boundary overhead is the dominant cost. Every optimization strategy below either reduces the time spent in the boundary or reduces the number of crossings.

Optimization 1: Python-side ASCII fast-path

The most effective optimization is never crossing the boundary at all. transliterate(), strip_accents(), and normalize() check text.isascii() on the Python side before calling into Rust. This is a ~30–50 ns CPython C call that scans the string's internal buffer. Pure-ASCII strings (the common case in English workloads) return immediately:

Function With fast-path Without
transliterate("hello") 62 ns 407 ns
strip_accents("hello") 36 ns 805 ns

This turns the common case from a ~400 ns function call into a ~60 ns no-op.

Optimization 2: Flat BMP array

The default transliteration table covers U+0080–U+FFFF as a flat [Option<&'static str>; 65408] array indexed by (codepoint - 0x80). Lookup is a bounds check and an array dereference — no hashing, no collision chain, no cache-unfriendly pointer chasing.

The array occupies ~512 KB in the .rodata section. The OS pages it in on demand; only the pages containing accessed codepoint ranges are resident in memory.

This replaced a PHF map for BMP lookups and delivered the single largest speedup: Latin transliteration went from 34× faster than Unidecode to 38× faster.

Optimization 3: Range-based dispatch

Before consulting the flat BMP array, lookup_default() dispatches by codepoint range to dedicated handlers:

  • CJK Unified Ideographs → Hanzi pinyin PHF table
  • Hangul syllables and compatibility jamo → algorithmic romanization

This avoids probing the 65K-entry array for scripts that have their own higher-quality tables. It also means the flat array doesn't need entries for CJK/Hangul ranges, keeping it focused on Latin/Cyrillic/Arabic/Greek and other BMP scripts.

Optimization 4: Cow<'a, str> return

transliterate_impl() returns Cow<'a, str>. When the input is pure ASCII (detected via str::is_ascii()), it returns Borrowed — a pointer to the input with no allocation. Non-ASCII input returns Owned with a pre-sized buffer. The Cow type is transparent to callers and avoids the cost of cloning strings that don't need modification.

Optimization 5: Capacity pre-sizing

The output buffer is pre-sized based on the first non-ASCII character's script:

  • CJK/Hangul/kana: input byte length (each character expands to a multi-letter syllable plus space)
  • Everything else: input byte length (most characters map 1:1)

This heuristic eliminates reallocations for the two most common workload shapes. The cost of over-sizing (a few KB of unused capacity) is negligible compared to the cost of repeated reallocations that memcpy the entire buffer.

Optimization 6: List input (batch processing)

transliterate(), slugify(), normalize(), and strip_accents() accept a list[str] and process all strings in a single PyO3 boundary crossing. For 100 strings, this saves ~24 µs of boundary overhead (240 ns × 100). The saving scales linearly with list size.

Operation (100 strings) List Loop Speedup
transliterate 28.3 µs 82.9 µs 2.9×

Optimization 7: Consistent Rust-native normalization

normalize() uses the Rust unicode-normalization crate (Unicode 16.0) for all calls — single strings, lists, and pipelines. This ensures consistent results across all code paths and eliminates Unicode version mismatches between CPython's unicodedata (Unicode 15.1) and the Rust crate's tables.

While CPython's unicodedata.normalize() is faster for single-string calls (it operates directly on PEP 393 compact strings with zero-copy semantics), the correctness tradeoff is worth the performance cost: using a single Unicode version prevents subtle bugs where different code paths produce different results for codepoints assigned between Unicode versions.

Optimization 8: PHF for specialized data

All secondary lookup tables (Hanzi pinyin, confusables, case folding, emoji) use compile-time perfect hash functions via phf_codegen. PHF provides O(1) lookup with no runtime allocation, no collision handling, and deterministic performance. The tables are generated at build time by build.rs and embedded as static data.

What translit does NOT optimize

Two operations are inherently slower than their CPython C-builtin counterparts:

  • Normalization: unicodedata.normalize() operates on CPython's internal string buffer without copying. translit uses Rust for all normalization (consistency over speed — see Optimization 7).
  • Case folding: str.casefold() is a CPython C builtin with zero allocation overhead. translit's PHF-based fold_case() is within ~4× at the Python level, with the gap dominated by PyO3 boundary-crossing cost rather than algorithmic differences.

These gaps are acceptable because normalization and case folding are rarely the bottleneck in real workloads — transliteration and slugification dominate processing time, and translit is 7–38× faster for those.