Precompiled Pipelines¶

Ready-to-use multi-step text processing pipelines. Each is a single compiled Rust function with no pipeline construction overhead at call time.

security_clean¶

security_clean ¶

security_clean(text: str) -> str

Security-focused text canonicalization.

Pipeline: NFKC → confusables → strip bidi/format → collapse_whitespace

Collapses fullwidth bypasses, neutralizes homoglyph spoofing, strips dangerous bidi overrides and soft hyphens, then normalizes whitespace (collapsing runs, stripping control chars and zero-width injections).

Parameters:	`text` (`str`) – Input string (user-submitted, network-received, etc.).

Returns:	`str` – Canonicalized string safe for security-sensitive comparisons.

Examples:

>>> security_clean("Ηello Ꮤorld")  # Greek Η + Cherokee Ꮤ → Latin
'Hello World'

Pipeline steps¶

NFKC → confusables → strip bidi/format → collapse_whitespace

from translit import security_clean

security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")   # => "Real text"
security_clean("Ηello Ꮤorld")    # => "Hello World"  (Greek Η + Cherokee Ꮤ → Latin)

ml_normalize¶

ml_normalize ¶

ml_normalize(text: str, *, lang: str | None = None, emoji: str = 'cldr') -> str

ML/NLP text normalization pipeline.

NFKC → emoji→text → [transliterate] → strip_accents →

fold_case → collapse_whitespace

Produces clean, accent-free, lowercased text suitable for tokenizers, embeddings, and feature extraction. Emoji are expanded to their CLDR short-name descriptions.

Parameters:	`text` (`str`) – Input Unicode string. `lang` (`str \| None`, default: `None` ) – Optional language code for transliteration (e.g. "de", "ja"). `emoji` (`str`, default: `'cldr'` ) – Emoji handling mode. `"cldr"` — expand emoji to CLDR short names (default). `"none"` — leave emoji characters unchanged.

Returns:	`str` – Clean, accent-free, lowercased text.

Raises:	`TranslitError` – If emoji is not `"cldr"` or `"none"`, or if an internal Rust error occurs.

Examples:

>>> ml_normalize("Café RÉSUMÉ")
'cafe resume'
>>> ml_normalize("München", lang="de")
'muenchen'

Pipeline steps¶

NFKC → emoji→text → [transliterate] → strip_accents → fold_case → collapse_whitespace

from translit import ml_normalize

ml_normalize("Café RÉSUMÉ")         # => "cafe resume"
ml_normalize("München", lang="de")  # => "muenchen"
ml_normalize("I ❤️ Python 🐍")      # => "i red heart python snake"

catalog_key¶

catalog_key ¶

catalog_key(text: str, *, lang: str | None = None, strict_iso9: bool = False) -> str

Library catalog key generation pipeline.

NFKC → transliterate → confusables → strip_accents →

fold_case → collapse_whitespace

Produces a canonical deduplication key for bibliographic titles.

Parameters:	`text` (`str`) – Input title or heading. `lang` (`str \| None`, default: `None` ) – Language code for transliteration (e.g. "ru", "ja"). `strict_iso9` (`bool`, default: `False` ) – Use ISO 9:1995 scholarly transliteration for Cyrillic.

Returns:	`str` – Canonical deduplication key string.

Raises:	`TranslitError` – If an internal Rust error occurs.

Examples:

>>> catalog_key("  Café  RÉSUMÉ  ")
'cafe resume'
>>> catalog_key("ΩMEGA  café")
'omega cafe'

Pipeline steps¶

NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace

from translit import catalog_key

catalog_key("  Café  RÉSUMÉ  ")       # => "cafe resume"
catalog_key("Москва", lang="ru")      # => "moskva"
catalog_key("Москва", lang="auto")    # => "moskva" (auto-detects Russian)
catalog_key("Müller", lang="de")      # => "mueller"

display_clean¶

display_clean ¶

display_clean(text: str) -> str

Display-safe text cleaning pipeline.

Pipeline: strip bidi/format → collapse_whitespace (strip control + strip zero-width)

Lightweight cleanup for user-submitted content destined for rendering. Strips bidirectional overrides (which can visually reorder text to hide malicious content), soft hyphens, control characters, and zero-width injections, then collapses runs of whitespace to single spaces.

Parameters:	`text` (`str`) – Input string (user-submitted content).

Returns:	`str` – Cleaned string safe for display rendering.

Examples:

>>> display_clean("hello\x00world\u200b!")
'helloworld!'
>>> display_clean("  spaced   out  ")
'spaced out'

Pipeline steps¶

strip_bidi → strip_control → strip_zero_width → collapse_whitespace

from translit import display_clean

display_clean("hello\x00world\u200b!")  # => "helloworld!"
display_clean("  spaced   out  ")       # => "spaced out"
display_clean("admin\u202Euser")        # => "adminuser" (bidi override stripped)

search_key¶

search_key ¶

search_key(text: str, *, lang: str | None = None) -> str

Search index key generation pipeline.

NFKC → transliterate → strip_accents → fold_case →

collapse_whitespace

Produces a case-insensitive, accent-insensitive, script-insensitive lookup key. Like :func:catalog_key but without confusable normalization — lighter and faster for search indexes.

Parameters:	`text` (`str`) – Input text to generate a search key from. `lang` (`str \| None`, default: `None` ) – Language code for transliteration (e.g. "ru", "de").

Returns:	`str` – Normalized search key string.

Examples:

>>> search_key("  Café  RÉSUMÉ  ")
'cafe resume'
>>> search_key("Москва")
'moskva'
>>> search_key("Über allen Gipfeln")
'uber allen gipfeln'

Pipeline steps¶

NFKC → transliterate → strip_accents → fold_case → collapse_whitespace

from translit import search_key

search_key("Café RÉSUMÉ")              # => "cafe resume"
search_key("Москва", lang="ru")        # => "moskva"
search_key("ΩMEGA", lang="auto")       # => "omega"

sort_key¶

sort_key ¶

sort_key(text: str, *, lang: str | None = None) -> str

Sort key generation pipeline.

Pipeline: NFKC → transliterate → fold_case → collapse_whitespace

Like :func:search_key but without accent stripping, preserving base accented characters for correct alphabetical ordering.

Parameters:	`text` (`str`) – Input text to generate a sort key from. `lang` (`str \| None`, default: `None` ) – Language code for transliteration (e.g. "ru", "de").

Returns:	`str` – Normalized sort key string.

Examples:

>>> sort_key("Война и мир")
'voyna i mir'
>>> sort_key("Über allen Gipfeln")
'uber allen gipfeln'
>>> sort_key("  Café  ")
'cafe'

Pipeline steps¶

NFKC → transliterate → fold_case → collapse_whitespace

from translit import sort_key

sort_key("Über", lang="de")            # => "ueber"
sort_key("Война и мир", lang="ru")     # => "voyna i mir"
sort_key("Café")                       # => "cafe"

sanitize_user_input¶

sanitize_user_input ¶

sanitize_user_input(text: str) -> str

Sanitize user-submitted input for web applications.

Preserves the original script (no transliteration) while neutralizing common attack vectors: zalgo stacking, homoglyph spoofing, bidi overrides, zero-width injections, and control characters.

Pipeline: NFKC → strip_zalgo → confusables → strip_bidi → collapse_whitespace

Parameters:	`text` (`str`) – User-submitted input string.

Returns:	`str` – Sanitized string safe for storage and display.

Examples:

>>> sanitize_user_input("Hello, world!")
'Hello, world!'
>>> sanitize_user_input("p\u0430ypal")  # Cyrillic а → Latin a
'paypal'
>>> sanitize_user_input("admin\u202euser")  # RLO stripped
'adminuser'

Pipeline steps¶

NFKC → strip_zalgo → confusables → strip_bidi → collapse_whitespace

from translit import sanitize_user_input

sanitize_user_input("Hello, world!")        # => "Hello, world!"
sanitize_user_input("p\u0430ypal")          # => "paypal" (Cyrillic а → Latin a)
sanitize_user_input("admin\u202Euser")      # => "adminuser" (bidi override stripped)

Unlike security_clean, this pipeline also strips zalgo text (excessive combining mark stacking). Unlike catalog_key/search_key, it does not transliterate — the original script is preserved.

PRESETS¶

from translit import PRESETS

Dict mapping preset function names to their ordered pipeline steps. Each value is a list of (step_name, parameter) tuples in execution order.

>>> from translit import PRESETS
>>> PRESETS["security_clean"]
[('normalize', 'NFKC'), ('confusables', 'latin'), ('strip_bidi', None), ('collapse_whitespace', None)]
>>> PRESETS["sanitize_user_input"]
[('normalize', 'NFKC'), ('strip_zalgo', None), ('confusables', 'latin'), ('strip_bidi', None), ('collapse_whitespace', None)]

Use PRESETS to audit exactly which transforms a preset applies, or to build equivalent TextPipeline configurations.

Policy Profiles¶

Named policy profiles provide pre-configured TextPipeline instances for common institutional and application workflows.

get_pipeline¶

from translit import get_pipeline

pipe = get_pipeline("scholarly_cyrillic_iso9")
pipe("Москва")   # → "moskva"

Returns a fresh TextPipeline configured for the named profile. Raises TranslitError for unknown profiles.

list_profiles¶

from translit import list_profiles

print(list_profiles())
# ['library_catalog_key_eu', 'ml_corpus_normalize', 'scholarly_cyrillic_iso9',
#  'search_index', 'web_input_sanitize']

Returns sorted list of available profile names.

Available profiles¶

Profile	Steps	Output
`scholarly_cyrillic_iso9`	NFKC → transliterate (ISO 9) → fold_case → collapse_whitespace	UTF-8
`library_catalog_key_eu`	NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace	ASCII
`web_input_sanitize`	NFKC → confusables → collapse_whitespace	UTF-8
`ml_corpus_normalize`	NFKC → demojize → strip_accents → fold_case → collapse_whitespace	ASCII
`search_index`	NFKC → transliterate → strip_accents → fold_case → collapse_whitespace	ASCII

See Policy Templates for detailed usage guidance and institutional recipes.