Normalization

Unicode normalization ensures that equivalent sequences of characters are represented identically. translit provides fast normalization using the Rust unicode-normalization crate.

Why normalize?

The same visible text can have multiple Unicode representations:

# These look identical but are different byte sequences:
a = "\u00e9"       # U+00E9 (precomposed)
b = "\u0065\u0301" # U+0065 U+0301 (decomposed: e + combining acute)

a == b        # => False (without normalization!)

Normalization resolves this by converting to a canonical form.

Normalization forms

Form Name Description
NFC Canonical Decomposition + Composition Precomposed characters. Most common for storage and comparison.
NFD Canonical Decomposition Decomposed characters. Useful for accent stripping.
NFKC Compatibility Decomposition + Composition Like NFC but also normalizes compatibility characters (fi→fi, ²→2).
NFKD Compatibility Decomposition Like NFD with compatibility decomposition.

Basic usage

from translit import normalize

# NFC: compose into single codepoints
normalize("e\u0301")                  # => "é" (U+00E9)

# NFD: decompose into base + combining marks
normalize("é", form="NFD")           # => "e\u0301"

# NFKC: compatibility + compose
normalize("finance", form="NFKC")     # => "finance"
normalize("2²", form="NFKC")         # => "22"

# NFKD: compatibility + decompose
normalize("fi", form="NFKD")          # => "fi"

Checking normalization

Test whether a string is already in a given form without performing the full normalization:

from translit import is_normalized

is_normalized("hello")                # => True (ASCII is always NFC)
is_normalized("é", form="NFC")       # => True (precomposed)
is_normalized("é", form="NFD")       # => False
is_normalized("e\u0301", form="NFD") # => True (decomposed)

The NF enum

For programmatic use, the NF enum provides the four forms:

from translit import NF, normalize

normalize("fi", form=NF.KC.value)     # => "fi"
Member Value
NF.C "NFC"
NF.D "NFD"
NF.KC "NFKC"
NF.KD "NFKD"

When to use which form

  • NFC — Default for most applications. Store and compare text in NFC.
  • NFD — Use when you need to manipulate combining marks (e.g., strip_accents() uses NFD internally).
  • NFKC — Use for search indexes and text matching where fi should match fi.
  • NFKD — Use for deep decomposition before further processing.

Performance

Normalization is implemented in Rust via the unicode-normalization crate. Strings that are already in the target form are detected quickly via is_normalized() without allocation.