Normalization¶
Unicode normalization ensures that equivalent sequences of characters are represented identically. translit provides fast normalization using the Rust unicode-normalization crate.
Why normalize?¶
The same visible text can have multiple Unicode representations:
# These look identical but are different byte sequences:
a = "\u00e9" # U+00E9 (precomposed)
b = "\u0065\u0301" # U+0065 U+0301 (decomposed: e + combining acute)
a == b # => False (without normalization!)
Normalization resolves this by converting to a canonical form.
Normalization forms¶
| Form | Name | Description |
|---|---|---|
| NFC | Canonical Decomposition + Composition | Precomposed characters. Most common for storage and comparison. |
| NFD | Canonical Decomposition | Decomposed characters. Useful for accent stripping. |
| NFKC | Compatibility Decomposition + Composition | Like NFC but also normalizes compatibility characters (fi→fi, ²→2). |
| NFKD | Compatibility Decomposition | Like NFD with compatibility decomposition. |
Basic usage¶
from translit import normalize
# NFC: compose into single codepoints
normalize("e\u0301") # => "é" (U+00E9)
# NFD: decompose into base + combining marks
normalize("é", form="NFD") # => "e\u0301"
# NFKC: compatibility + compose
normalize("finance", form="NFKC") # => "finance"
normalize("2²", form="NFKC") # => "22"
# NFKD: compatibility + decompose
normalize("fi", form="NFKD") # => "fi"
Checking normalization¶
Test whether a string is already in a given form without performing the full normalization:
from translit import is_normalized
is_normalized("hello") # => True (ASCII is always NFC)
is_normalized("é", form="NFC") # => True (precomposed)
is_normalized("é", form="NFD") # => False
is_normalized("e\u0301", form="NFD") # => True (decomposed)
The NF enum¶
For programmatic use, the NF enum provides the four forms:
from translit import NF, normalize
normalize("fi", form=NF.KC.value) # => "fi"
| Member | Value |
|---|---|
NF.C |
"NFC" |
NF.D |
"NFD" |
NF.KC |
"NFKC" |
NF.KD |
"NFKD" |
When to use which form¶
- NFC — Default for most applications. Store and compare text in NFC.
- NFD — Use when you need to manipulate combining marks (e.g.,
strip_accents()uses NFD internally). - NFKC — Use for search indexes and text matching where fi should match fi.
- NFKD — Use for deep decomposition before further processing.
Performance¶
Normalization is implemented in Rust via the unicode-normalization crate. Strings that are already in the target form are detected quickly via is_normalized() without allocation.