Testing and Guarantees¶
Testing methodology¶
Most Unicode text libraries rely on example-based testing: a developer writes a handful of input/output pairs, runs them in CI, and calls it done. Example-based tests verify the specific cases the developer thought of. They say nothing about the rest.
translit combines three techniques that are uncommon in this space: compile-time data integrity assertions, exhaustive domain coverage, and stated invariant specifications. We are not aware of another transliteration or slugification library that publishes all three, though we haven't audited every library in every language.
What "exhaustively tested" means¶
Testing rigor is a spectrum between conventional tests and full formal verification (mathematical proofs of correctness). translit operates at the strongest level achievable without nightly-only tools:
| Level | What it proves | Who does this |
|---|---|---|
| Example-based tests | Specific inputs produce expected outputs | Everyone |
| Property-based tests | Random inputs satisfy stated properties (statistical confidence) | ~5% of open-source projects |
| Exhaustive domain tests | Every element in a bounded domain satisfies stated properties (certainty) | translit |
| Compile-time assertions | Data integrity invariants that fail the build if violated (zero runtime cost) | translit |
| Stated invariant specs | Properties stated as specifications with verification method documented | translit |
| Bounded model checking | Machine-checked proofs of absence of panics, overflow, UB | Future (requires nightly Rust) |
The gap between property-based testing and exhaustive testing is the difference between "we checked 1,000 random Hangul syllables" and "we checked all 11,172 Hangul syllables." The former gives statistical confidence. The latter gives certainty.
How the alternatives compare¶
| Library | Language | Tests | Exhaustive testing |
|---|---|---|---|
| Unidecode | Python | ~200 example tests | None |
| text-unidecode | Python | ~50 example tests | None |
| anyascii | Multi | Basic round-trip + snapshot | None |
| python-slugify | Python | ~80 example tests | None |
| awesome-slugify | Python | ~30 example tests | None |
| confusable_homoglyphs | Python | ~20 example tests | None |
| pathvalidate | Python | Example + parametrize | None |
| unidecode (Rust) | Rust | ~10 example tests | None |
| translit | Rust + Python | 2,900+ tests | Compile-time assertions, exhaustive domain, stated invariants |
These libraries are mature and widely used. The test counts above are approximate (based on public repos at time of writing) and may not reflect internal or downstream test suites. The point is not that they are poorly tested — example-based testing is the norm — but that translit's approach is different in kind.
The three layers of assurance¶
Layer 1: Compile-time data integrity assertions (build.rs)¶
Every time cargo build runs, the build script reads all transliteration TSV data files and asserts:
| Assertion | Scope | Consequence if violated |
|---|---|---|
| All default BMP table values are pure ASCII | 5,000+ mappings | Build fails |
| All SMP table values are pure ASCII | All supplementary mappings | Build fails |
| All 22 language override tables contain only ASCII values | de, ru, ja, fa, ... | Build fails |
| All 20,924 Hanzi pinyin values are pure ASCII | Full CJK block | Build fails |
| Default BMP table has ≥ 5,000 entries | Truncation detection | Build fails |
| Hanzi pinyin table has ≥ 20,000 entries | Truncation detection | Build fails |
| Confusables table has ≥ 1,000 entries | Truncation detection | Build fails |
Additionally, hangul.rs contains const assertions verifying that the Hangul decomposition algorithm constants match the Unicode specification:
- JUNGSEONG_COUNT == 21, JONGSEONG_COUNT == 28
- Total syllable count = 19 × 21 × 28 = 11,172
- Compatibility jamo range = 51 entries exactly
These assertions execute at compile time, not in CI. A release artifact cannot exist if any assertion fails.
Layer 2: Exhaustive domain tests¶
These tests iterate over every element in a bounded Unicode domain. Unlike property-based tests (which sample randomly), exhaustive tests leave zero untested inputs within their domain.
| Domain | Size | What is verified |
|---|---|---|
| All Hangul syllables (U+AC00–U+D7A3) | 11,172 | romanize_hangul() returns Some, output is ASCII, non-empty, decomposition indices in bounds, round-trip formula correct |
| All compatibility jamo (U+3131–U+3163) | 51 | lookup_compat_jamo() returns Some, output is ASCII |
| Full BMP, ErrorMode::Ignore (U+0080–U+FFFF) | 63,488 | transliterate_impl() produces ASCII-only output for every codepoint |
| Full BMP idempotence | 63,488 | f(f(ch)) == f(ch) for every codepoint |
| All CJK Unified Ideographs (U+4E00–U+9FFF) | 20,992 | Output is ASCII, unmapped count < 200 |
| 15 Indic script blocks | ~2,000 codepoints | Every consonant/vowel/virama in the block is correctly classified |
| Determinism | 10 × 100 runs | Same mixed-script input produces identical output 100 times |
Total exhaustive coverage: ~159,000+ individually verified codepoints.
Layer 3: Stated invariant specifications¶
Seven properties are stated as specifications, each with a documented verification method:
| ID | Invariant | Statement | Verification |
|---|---|---|---|
| I1 | ASCII Passthrough | ∀s: s.is_ascii() → f(s) = s | Exhaustive (all 128 ASCII) + Hypothesis 500 |
| I2 | ASCII Output | ∀s: f(s, errors='ignore').is_ascii() | Exhaustive BMP (Rust) + Hypothesis 1,000 incl. SMP |
| I3 | Idempotence | ∀s: f(f(s)) = f(s) | Exhaustive BMP (Rust) + Hypothesis 500 |
| I4 | No Exceptions | ∀s ∈ UTF-8, |s| ≤ 10 MiB: f(s) does not throw | Hypothesis 1,000 + explicit edge cases |
| I5 | Deterministic | ∀s, n>0: f(s) called n times → same result | 100× repeat on 10 mixed-script inputs |
| I6 | Input Size Bounded | ∀s: |s| > 10 MiB → TranslitError | Boundary test at limit |
| I7 | Output Length Bounded | ∀s: |f(s)| ≤ |s|_bytes × 4 + |s|_chars | Hypothesis 1,000 |
Each invariant is a test class with a docstring stating the property. The verification method combines exhaustive enumeration (where the domain is bounded) with Hypothesis property-based testing (where it is not).
See formal-verification.md for the full specification document.
What exhaustive testing does NOT cover¶
Exhaustive testing is not formal verification. We are precise about the boundary:
| Area | Why not verified | Mitigation |
|---|---|---|
| PHF hash correctness | Trusted from phf_codegen crate |
Functional tests exercise every lookup path |
| Linguistic accuracy | Transliteration correctness is empirical, not provable by testing alone | Extensive corpus from native speakers; 83 language reference tests |
| Unicode version drift | New Unicode versions add codepoints | CI tracks Unicode version; unknown chars handled by ErrorMode |
| Memory safety / UB | Requires Miri (nightly-only) | unsafe_code = "forbid" in Cargo.toml — zero unsafe anywhere |
| Absence of panics | Requires Kani bounded model checking (nightly-only) | Property tests with 1,000+ random inputs; no panics in 2,900+ tests |
Future: When nightly Rust is available in CI, we plan to add Kani bounded model checking — a form of formal verification that would prove absence of panics and overflow in romanize_hangul, indic_char_role, and decomposition arithmetic — and Miri UB detection.
Conventional testing (still comprehensive)¶
The exhaustive testing layers sit on top of a conventional test suite that is itself unusually thorough:
Test suite overview¶
| Category | Tests | Coverage |
|---|---|---|
| Python (pytest) | 2,268 | All public API functions |
| Rust (#[test]) | 635 | Core algorithms, tables, edge cases |
| Exhaustive domain (Rust) | 16 | Full BMP, Hangul, CJK, Indic |
| Stated invariants (Python) | 12 | I1–I7 specifications |
| Property-based (Hypothesis) | 500+ examples/property | Full Unicode input space |
| Property-based (proptest) | Rust-side invariants | Normalization, roundtrips |
| Total | 2,900+ |
Per-language reference tests¶
Each of the 83 built-in language profiles has dedicated tests verifying:
- Known transliteration pairs — reference texts with expected output (e.g., "Москва" → "Moskva" for Russian, "Київ" → "Kyiv" for Ukrainian)
- Language override behavior —
lang="xx"produces different output from the default table where expected - ISO 9 and GOST interaction — scholarly modes override language-specific mappings correctly
Security invariant tests¶
tests/test_security_invariants.py uses Hypothesis to verify that security_clean() enforces its security contracts on any input:
| Invariant | Guarantee |
|---|---|
| Bidi stripping | All 13 bidi override/isolate characters removed |
| Zero-width stripping | All 9 zero-width characters removed |
| Confusable neutralization | No cross-script confusables in output |
| NFKC normalization | Output always in NFKC form |
| Whitespace collapse | No consecutive whitespace in output |
| Idempotency | security_clean(security_clean(x)) == security_clean(x) |
CI matrix¶
Every push and pull request runs the full test suite across:
| Axis | Values |
|---|---|
| OS | Ubuntu, macOS, Windows |
| Python | 3.9, 3.10, 3.11, 3.12, 3.13, 3.14 |
| Rust checks | cargo fmt --check, cargo clippy -D warnings, cargo test |
| Python checks | pytest, ruff lint, mypy strict mode, doctest |
Unicode table update process¶
When Unicode versions are updated:
- Dependency update — bump
unicode-segmentation,unicode-normalization, and confusable table crates - Rebuild tables —
build.rsregenerates PHF lookup tables from TSV source data at compile time. Compile-time assertions verify the new data is well-formed. - Exhaustive tests — the full BMP and CJK domain tests verify invariants hold across any new characters
- Property tests — Hypothesis tests verify invariants still hold across the new character space
- Reference text tests — existing per-language tests confirm no behavioral changes for known inputs