Architecture: Security & Hostname Validation¶
How translit defends against Unicode-based attacks in hostnames, filenames, and user-supplied text.
Threat model¶
Unicode introduces attack surfaces that don't exist in ASCII-only systems:
- Homoglyph spoofing: Cyrillic
а(U+0430) and Latina(U+0061) are visually identical. An attacker registerspаypal.com(Cyrillic а) to impersonatepaypal.com. - Mixed-script confusion: combining characters from multiple scripts in a single label to create plausible-looking but fraudulent domains.
- Confusable sequences: characters from one script that look like characters from another, even when not mixed within a single label.
- Path traversal:
../../etc/passwdor dot-sequence tricks in filenames. - Reserved name injection: Windows reserved names like
NUL,CON,AUXembedded in filenames.
Hostname safety: multi-stage canonicalization¶
is_safe_hostname() implements a conservative check that flags anything suspicious rather than trying to determine benign intent. The pipeline:
- NFKC normalization: compatibility decomposition + canonical composition. This collapses fullwidth Latin, circled letters, and other visual variants to their base forms before analysis.
- Per-label script detection: each dot-separated label is analyzed independently via
detect_scripts(). A label with characters from multiple scripts is flagged as mixed-script. - High-risk pair detection: mixed-script labels are checked against a hardcoded set of high-risk script combinations — Cyrillic+Latin, Greek+Latin, Armenian+Latin. These are the combinations most commonly exploited in homoglyph attacks.
- Confusable detection: each label is checked via
is_confusable(), which probes the TR39 confusable table for characters that map to different Latin equivalents. - Canonical form generation:
normalize_confusables()produces a Latin-canonical form that reveals what the hostname "looks like" to a Latin-script reader.
The function returns both a boolean safety verdict and a SafeHostnameDetails object with full diagnostic information (detected scripts, mixed-script flag, confusable flag, canonical form). This allows callers to implement their own policy on top of translit's detection.
Design tradeoff: the check is deliberately conservative. A fully-Cyrillic domain like яндекс.ру is not flagged as unsafe (it's not mixed-script), but a domain with even one Cyrillic character mixed into an otherwise-Latin label is flagged. False positives are preferred over false negatives in this security context.
Confusable normalization¶
The confusable table is a PHF map of ~6,000 Unicode characters to their Latin visual equivalents, derived from Unicode TR39. normalize_confusables() replaces each confusable character with its Latin equivalent; is_confusable() performs an early-return scan that stops at the first confusable character found.
Both functions are O(n) in the length of the input with O(1) per-character PHF lookup and zero allocation beyond the output string.
Filename sanitization security¶
sanitize_filename() applies multiple defense layers:
- Path traversal removal: strips
/,\, and collapses..sequences. - Platform-specific illegal character removal: Windows-banned characters (
<,>,:,",|,?,*) on universal and Windows platforms. - Dot-sequence collapsing: multiple consecutive dots are collapsed to a single dot, preventing
...tricks. - Reserved name prefixing: Windows reserved names (NUL, CON, PRN, AUX, COM1–COM9, LPT1–LPT9) are detected and prefixed with
_. - Post-truncation reserved name check: after
max_lengthtruncation, the result is re-checked against reserved names. If truncation created a reserved name (e.g.,NULtra.txttruncated to 3 bytes becomesNUL), the_prefix is re-applied. - Empty result fallback: if sanitization removes all characters, a default name is substituted.
The reserved-name check examines the stem (before the first dot) and compares case-insensitively. The post-truncation check was added to fix a bug where truncation could create a valid-looking but dangerous filename.
Script detection¶
detect_scripts() classifies each character in a string by its Unicode script property, returning a deduplicated list of script names. This is the foundation for mixed-script detection in both is_safe_hostname() and the standalone is_mixed_script() predicate.
Script deduplication uses a HashSet<&str> to maintain O(1) per-character insertion while preserving first-seen ordering in the output vector. This replaced an earlier O(n²) implementation that scanned the output vector for duplicates.
Precompiled security pipelines¶
The security_clean preset composes the security-relevant transforms into a single pipeline: NFKC normalization → confusable normalization → bidi stripping → whitespace collapse. This is the recommended entry point for security-sensitive text processing where the goal is to canonicalize Unicode content by neutralizing homoglyph spoofing, removing dangerous bidi overrides, and collapsing invisible characters.
The sanitize_user_input preset extends this approach for web application input: NFKC normalization → zalgo stripping → confusable normalization → bidi stripping → whitespace collapse. It adds zalgo text protection (capping combining marks at 2 per base character) while preserving the original script (no transliteration). This is the recommended entry point for sanitizing user-submitted form data, comments, and API payloads.
Zalgo detection¶
The is_zalgo() predicate detects excessive combining mark stacking by walking the NFD decomposition and counting consecutive combining marks per base character. The default threshold of 3 marks is safe for all legitimate scripts — Vietnamese ệ (the most combining-mark-heavy legitimate character in common use) has exactly 2 marks in NFD. The strip_zalgo() function caps marks at a configurable limit (default: 2), preserving legitimate diacritics while removing abuse.