Normalize-First Canonicalization¶
Put Unicode normalization at the front of every text pipeline, run the remaining steps in a fixed, grapheme-correct order, and decide up front whether your output needs to be reversible or script-pure. translit turns these from tribal knowledge into guarantees: pipeline step order is single-source and invariant-checked, and normalization provably never splits a grapheme cluster.
This page is a set of recipes built from existing functions — it introduces no new API.
Why normalize first¶
The same visible text can be encoded many ways (see Normalization). Preprocessing that runs before normalization — stripping, folding, transliterating, matching — sees those inconsistent encodings and produces inconsistent results. Worse, naive preprocessing can split an Indic conjunct or a combining-mark sequence, or mix scripts, corrupting both security checks and downstream models.
Normalizing first collapses the representations to one canonical form, so every later step operates on stable input.
Guarantee 1 — the step order can't drift¶
TextPipeline always runs its steps in a fixed, optimal order
regardless of the order you pass the arguments — normalization first, the
final whitespace cleanup last:
from translit import TextPipeline
pipe = TextPipeline(
fold_case=True, # passed first…
normalize="NFKC", # …but normalize always runs first
confusables=True,
collapse_whitespace=True,
)
assert [name for name, _param in pipe.steps] == ['normalize', 'confusables', 'fold_case', 'strip_control', 'strip_zero_width', 'collapse_whitespace']
The order a pipeline reports (pipe.steps) is, by construction, the order it
executes — both read from one shared list inside the engine. A step cannot be
reported at one position and run at another (the class of bug that #141 was). If
you are introspecting a pipeline to audit it, what you see is what runs.
Guarantee 2 — normalization is grapheme-correct¶
Normalization respects grapheme-cluster boundaries. For every form (NFC/NFD/NFKC/NFKD):
import translit
normalize_whole = lambda s, f: translit.normalize(s, form=f)
normalize_parts = lambda s, f: "".join(
translit.normalize(g, form=f) for g in translit.grapheme_split(s)
)
s = "क्ष" # Devanagari conjunct: KA + virama + SSA
assert normalize_whole(s, "NFC") == normalize_parts(s, "NFC")
In plain terms: normalization never orphans a combining mark, never splits an Indic conjunct, and never merges across cluster boundaries. This is verified exhaustively over every Hangul syllable, every Devanagari conjunct, the full combining-diacriticals block, and the whole BMP.
One intended exception to watch for: NFKC/NFKD change the grapheme count by
expanding compatibility characters (the ligature fi becomes fi, two
clusters). That is normalization working as designed, not a boundary violation —
but it is one more reason to choose your form deliberately (below).
If you need to shorten text without cutting a cluster in half, use
grapheme_truncate, which only cuts on boundaries.
Recipe — script purity (one script in, one script out)¶
Mixed-script text is a classic spoofing vector (pаypаl with Cyrillic а).
Detect it with is_mixed_script, and fold it to a single script with
normalize_confusables:
import translit
raw = "pаypаl" # contains Cyrillic а (U+0430)
# Normalize first — NFKC folds compatibility variants (fullwidth, ligatures)
# so the script check sees canonical input, never a disguised bypass.
s = translit.normalize(raw, form="NFKC")
assert translit.is_mixed_script(s) == True
pure = translit.normalize_confusables(s, target_script="latin")
assert pure == 'paypal'
assert translit.is_mixed_script(pure) == False
- Flag with
is_mixed_scriptwhen you only need to reject suspicious input (e.g. before storing a username). For hostnames,is_safe_hostnamereturns per-label mixed-script and confusable details. - Fold with
normalize_confusables(target_script=...)when you want to coerce input to a canonical script for comparison.
Normalize first, then check or fold — confusable detection is most reliable on canonical input.
Recipe — reversibility-preserving canonicalization (use NFC, not NFKC)¶
If you may need to convert text back to its native script later — translit
supports reverse transliteration for Greek, Russian, and Ukrainian via
transliterate(text, target=lang) — canonicalize with NFC, never NFKC.
NFKC's compatibility folding is lossy and destroys the information a reversal would need:
import translit
assert translit.normalize("⁵", form="NFC") == '⁵' # superscript five — preserved
assert translit.normalize("⁵", form="NFKC") == '5' # folded to ASCII — unrecoverable
An NFC-first canonicalization keeps the door open to a clean round-trip:
native = "Москва"
canonical = translit.normalize(native, form="NFC") # canonical, lossless
romanized = translit.transliterate(canonical, lang="ru")
assert romanized == 'Moskva'
back = translit.transliterate(romanized, target="ru")
assert back == 'Москва' # round-trips
For the reversible direction, also avoid the steps that erase recoverable
information — strip_accents, fold_case, and transliteration to ASCII — unless
you keep the original alongside the canonical key.
This is the deliberate counterpart to the security/search canonicalization
recipes (security_clean, catalog_key, search_key), which use NFKC on
purpose: they want the lossy folding so that ⁵, fi, and fullwidth variants
all collapse to one comparison key. Reversibility and aggressive folding are
opposite goals — choose per use case.
Choosing a normalization form¶
| Goal | Form | Why |
|---|---|---|
| Storage, comparison, reversible canonicalization | NFC | Canonical and lossless; preserves the round-trip to native script. |
| Security keys, search keys, dedup | NFKC | Folds compatibility variants (⁵→5, fi→fi, fullwidth→ASCII) into one key — lossy by design. |
| Accent stripping (as an intermediate) | NFD / NFKD | Decomposes so combining marks can be removed; see strip_accents. |
When unsure, normalize with NFC first; reach for NFKC only when you explicitly want compatibility folding and do not need the original back.
See also¶
- Normalization — the forms in depth
- Text Pipeline — composing the steps
- Precompiled Pipelines —
security_clean,catalog_key,search_key, and the policy profiles - Confusable Detection — script and homoglyph analysis