Exhaustive Testing & Compile-Time Assurance¶
translit goes beyond conventional unit and property-based testing with three layers of machine-verifiable assurance: compile-time assertions, exhaustive domain coverage, and stated invariant specifications.
Overview¶
"Exhaustively tested" for translit means:
- Compile-time guarantees — Data integrity assertions that fail the build if violated
- Exhaustive domain testing — Every element in bounded Unicode domains is tested (not sampled)
- Stated invariants — Seven properties stated as specifications and verified by exhaustive enumeration or property-based testing
This is stronger than property-based testing alone because exhaustive tests leave zero untested inputs within their domain.
Compile-Time Guarantees (build.rs)¶
The build script verifies data integrity before compilation succeeds:
| Assertion | Scope | What it proves |
|---|---|---|
| All default BMP table values are ASCII | 5,000+ mappings | No transliteration introduces non-ASCII output |
| All SMP table values are ASCII | All SMP mappings | Same guarantee for characters above U+FFFF |
| All language override values are ASCII | 22 language tables | Language-specific overrides are pure ASCII |
| All Hanzi pinyin values are ASCII | 20,924 entries | Chinese romanization is pure ASCII |
| Confusables table count ≥ 1,000 | TR39 table | Confusables data not truncated |
| Default BMP table count ≥ 5,000 | BMP translations | Default table not truncated |
| Hanzi pinyin count ≥ 20,000 | CJK mappings | Pinyin table not truncated |
Additionally, src/tables/hangul.rs contains const assertions:
- JUNGSEONG_COUNT == 21, JONGSEONG_COUNT == 28 (Unicode spec constants)
- Total Hangul syllable count = 19 × 21 × 28 = 11,172
- Compatibility jamo range = 51 entries
If any assertion fails, cargo build fails. No runtime overhead.
Exhaustive Domain Coverage¶
Hangul Syllables (11,172 characters)¶
Every precomposed Hangul syllable (U+AC00–U+D7A3) is tested:
- romanize_hangul() returns Some (no unmapped syllables)
- Output is pure ASCII and non-empty
- Decomposition indices are in bounds: cho < 19, jung < 21, jong < 28
- Round-trip: cho * 21 * 28 + jung * 28 + jong == syllable_index
Compatibility Jamo (51 characters)¶
Every standalone jamo (U+3131–U+3163):
- lookup_compat_jamo() returns Some
- Output is pure ASCII
Full BMP — ASCII Output (63,488 characters)¶
Every non-surrogate codepoint U+0080–U+FFFF with ErrorMode::Ignore:
- Output is pure ASCII (proves invariant I2 exhaustively for the BMP)
Full BMP — Idempotence (63,488 characters)¶
Every non-surrogate codepoint U+0080–U+FFFF:
- transliterate(transliterate(ch)) == transliterate(ch) (proves I3 exhaustively)
CJK Unified Ideographs (20,992 characters)¶
Every character U+4E00–U+9FFF: - Output is ASCII and non-empty (every ideograph has a pinyin mapping)
Indic Block Structure (9 core + 4 extended scripts)¶
For each Brahmic script block, structural properties are verified exhaustively:
- Virama at expected offset classified as IndicRole::Virama
- Full consonant range returns IndicRole::Consonant
- Full dependent vowel range returns IndicRole::DependentVowel
Scripts covered: Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala, Tibetan, Myanmar, Khmer, Balinese, Javanese.
Stated Invariants (I1–I7)¶
| ID | Invariant | Statement | Verification |
|---|---|---|---|
| I1 | ASCII Passthrough | ∀s: s.is_ascii() → transliterate(s) = s |
Exhaustive (all 128 ASCII) + Hypothesis |
| I2 | ASCII Output | ∀s: transliterate(s, errors='ignore').is_ascii() |
Exhaustive BMP (Rust) + Hypothesis 1000 (Python, incl. SMP) |
| I3 | Idempotence | ∀s: f(f(s)) = f(s) where f = transliterate(·, errors='ignore') |
Exhaustive BMP (Rust) + Hypothesis 500 (Python) |
| I4 | No Exceptions | ∀s ∈ UTF-8, |s| ≤ 10MiB: transliterate(s) does not throw |
Hypothesis 1000 + edge cases |
| I5 | Deterministic | ∀s, n>0: transliterate(s) called n times → same result |
100× repeat on 10 mixed-script inputs |
| I6 | Input Size Bounded | ∀s: |s| > 10MiB → TranslitError |
Boundary test at 10 MiB / 10 MiB + 1 |
| I7 | Output Length Bounded | ∀s: |f(s)| ≤ |s|_bytes × 4 + |s|_chars |
Hypothesis 1000 |
Property-Based Testing Coverage¶
In addition to exhaustive tests, translit uses:
- proptest (Rust): Property tests in
tests/integration_transliterate.rs - Hypothesis (Python): 79KB of property tests in
tests/test_hypothesis.pycovering transliteration, slugification, normalization, confusables, and more - Fuzz testing:
tests/test_fuzz.pywith random Unicode generation
Total test count: 2,256+ tests across Rust and Python.
What Is NOT Verified¶
| Area | Why not verified | Mitigation |
|---|---|---|
| PHF hash correctness | Trusted from phf_codegen crate (widely used, well-tested) |
Functional tests exercise every lookup path |
| Linguistic accuracy | Transliteration correctness is empirical, not provable by testing alone | Extensive test corpus from native speakers; regression tests |
| Unicode version drift | New Unicode versions may add codepoints | CI tracks Unicode version; new chars fall through to ErrorMode |
| Memory safety (UB) | Requires Miri (nightly only) | unsafe_code = "forbid" in Cargo.toml; no unsafe anywhere |
Future: Nightly CI Extensions¶
When nightly Rust is available in CI:
- Kani bounded model checking: Would add a form of formal verification — proving absence of panics, overflow, and out-of-bounds for
indic_char_role,romanize_hangul, and decomposition arithmetic - Miri UB detection: Run the full test suite under Miri to detect undefined behavior, use-after-free, and data races