Precompiled Pipelines

Ready-to-use multi-step text processing pipelines. Each is a single compiled Rust function with no pipeline construction overhead at call time.

security_clean

security_clean

security_clean(text: str) -> str

Security-focused text canonicalization.

Pipeline: NFKC → confusables → strip bidi/format → collapse_whitespace

Collapses fullwidth bypasses, neutralizes homoglyph spoofing, strips dangerous bidi overrides and soft hyphens, then normalizes whitespace (collapsing runs, stripping control chars and zero-width injections).

Parameters:
  • text (str) –

    Input string (user-submitted, network-received, etc.).

Returns:
  • str

    Canonicalized string safe for security-sensitive comparisons.

Examples:

>>> security_clean("Ηello Ꮤorld")  # Greek Η + Cherokee Ꮤ → Latin
'Hello World'

Pipeline steps

NFKC → confusables → strip bidi/format → collapse_whitespace

from translit import security_clean

security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")   # => "Real text"
security_clean("Ηello Ꮤorld")    # => "Hello World"  (Greek Η + Cherokee Ꮤ → Latin)

ml_normalize

ml_normalize

ml_normalize(text: str, *, lang: str | None = None, emoji: str = 'cldr') -> str

ML/NLP text normalization pipeline.

NFKC → emoji→text → [transliterate] → strip_accents →

fold_case → collapse_whitespace

Produces clean, accent-free, lowercased text suitable for tokenizers, embeddings, and feature extraction. Emoji are expanded to their CLDR short-name descriptions.

Parameters:
  • text (str) –

    Input Unicode string.

  • lang (str | None, default: None ) –

    Optional language code for transliteration (e.g. "de", "ja").

  • emoji (str, default: 'cldr' ) –

    Emoji handling mode. "cldr" — expand emoji to CLDR short names (default). "none" — leave emoji characters unchanged.

Returns:
  • str

    Clean, accent-free, lowercased text.

Raises:
  • TranslitError

    If emoji is not "cldr" or "none", or if an internal Rust error occurs.

Examples:

>>> ml_normalize("Café RÉSUMÉ")
'cafe resume'
>>> ml_normalize("München", lang="de")
'muenchen'

Pipeline steps

NFKC → emoji→text → [transliterate] → strip_accents → fold_case → collapse_whitespace

from translit import ml_normalize

ml_normalize("Café RÉSUMÉ")         # => "cafe resume"
ml_normalize("München", lang="de")  # => "muenchen"
ml_normalize("I ❤️ Python 🐍")      # => "i red heart python snake"

catalog_key

catalog_key

catalog_key(text: str, *, lang: str | None = None, strict_iso9: bool = False) -> str

Library catalog key generation pipeline.

NFKC → transliterate → confusables → strip_accents →

fold_case → collapse_whitespace

Produces a canonical deduplication key for bibliographic titles.

Parameters:
  • text (str) –

    Input title or heading.

  • lang (str | None, default: None ) –

    Language code for transliteration (e.g. "ru", "ja").

  • strict_iso9 (bool, default: False ) –

    Use ISO 9:1995 scholarly transliteration for Cyrillic.

Returns:
  • str

    Canonical deduplication key string.

Raises:
  • TranslitError

    If an internal Rust error occurs.

Examples:

>>> catalog_key("  Café  RÉSUMÉ  ")
'cafe resume'
>>> catalog_key("ΩMEGA  café")
'omega cafe'

Pipeline steps

NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace

from translit import catalog_key

catalog_key("  Café  RÉSUMÉ  ")       # => "cafe resume"
catalog_key("Москва", lang="ru")      # => "moskva"
catalog_key("Москва", lang="auto")    # => "moskva" (auto-detects Russian)
catalog_key("Müller", lang="de")      # => "mueller"

display_clean

display_clean

display_clean(text: str) -> str

Display-safe text cleaning pipeline.

Pipeline: strip bidi/format → collapse_whitespace (strip control + strip zero-width)

Lightweight cleanup for user-submitted content destined for rendering. Strips bidirectional overrides (which can visually reorder text to hide malicious content), soft hyphens, control characters, and zero-width injections, then collapses runs of whitespace to single spaces.

Parameters:
  • text (str) –

    Input string (user-submitted content).

Returns:
  • str

    Cleaned string safe for display rendering.

Examples:

>>> display_clean("hello\x00world\u200b!")
'helloworld!'
>>> display_clean("  spaced   out  ")
'spaced out'

Pipeline steps

strip_bidistrip_controlstrip_zero_widthcollapse_whitespace

from translit import display_clean

display_clean("hello\x00world\u200b!")  # => "helloworld!"
display_clean("  spaced   out  ")       # => "spaced out"
display_clean("admin\u202Euser")        # => "adminuser" (bidi override stripped)

search_key

search_key

search_key(text: str, *, lang: str | None = None) -> str

Search index key generation pipeline.

NFKC → transliterate → strip_accents → fold_case →

collapse_whitespace

Produces a case-insensitive, accent-insensitive, script-insensitive lookup key. Like :func:catalog_key but without confusable normalization — lighter and faster for search indexes.

Parameters:
  • text (str) –

    Input text to generate a search key from.

  • lang (str | None, default: None ) –

    Language code for transliteration (e.g. "ru", "de").

Returns:
  • str

    Normalized search key string.

Examples:

>>> search_key("  Café  RÉSUMÉ  ")
'cafe resume'
>>> search_key("Москва")
'moskva'
>>> search_key("Über allen Gipfeln")
'uber allen gipfeln'

Pipeline steps

NFKC → transliterate → strip_accents → fold_case → collapse_whitespace

from translit import search_key

search_key("Café RÉSUMÉ")              # => "cafe resume"
search_key("Москва", lang="ru")        # => "moskva"
search_key("ΩMEGA", lang="auto")       # => "omega"

sort_key

sort_key

sort_key(text: str, *, lang: str | None = None) -> str

Sort key generation pipeline.

Pipeline: NFKC → transliterate → fold_case → collapse_whitespace

Like :func:search_key but without accent stripping, preserving base accented characters for correct alphabetical ordering.

Parameters:
  • text (str) –

    Input text to generate a sort key from.

  • lang (str | None, default: None ) –

    Language code for transliteration (e.g. "ru", "de").

Returns:
  • str

    Normalized sort key string.

Examples:

>>> sort_key("Война и мир")
'voyna i mir'
>>> sort_key("Über allen Gipfeln")
'uber allen gipfeln'
>>> sort_key("  Café  ")
'cafe'

Pipeline steps

NFKC → transliterate → fold_case → collapse_whitespace

from translit import sort_key

sort_key("Über", lang="de")            # => "ueber"
sort_key("Война и мир", lang="ru")     # => "voyna i mir"
sort_key("Café")                       # => "cafe"

sanitize_user_input

sanitize_user_input

sanitize_user_input(text: str) -> str

Sanitize user-submitted input for web applications.

Preserves the original script (no transliteration) while neutralizing common attack vectors: zalgo stacking, homoglyph spoofing, bidi overrides, zero-width injections, and control characters.

Pipeline: NFKC → strip_zalgo → confusables → strip_bidi → collapse_whitespace

Parameters:
  • text (str) –

    User-submitted input string.

Returns:
  • str

    Sanitized string safe for storage and display.

Examples:

>>> sanitize_user_input("Hello, world!")
'Hello, world!'
>>> sanitize_user_input("p\u0430ypal")  # Cyrillic а → Latin a
'paypal'
>>> sanitize_user_input("admin\u202euser")  # RLO stripped
'adminuser'

Pipeline steps

NFKC → strip_zalgo → confusables → strip_bidi → collapse_whitespace

from translit import sanitize_user_input

sanitize_user_input("Hello, world!")        # => "Hello, world!"
sanitize_user_input("p\u0430ypal")          # => "paypal" (Cyrillic а → Latin a)
sanitize_user_input("admin\u202Euser")      # => "adminuser" (bidi override stripped)

Unlike security_clean, this pipeline also strips zalgo text (excessive combining mark stacking). Unlike catalog_key/search_key, it does not transliterate — the original script is preserved.


PRESETS

from translit import PRESETS

Dict mapping preset function names to their ordered pipeline steps. Each value is a list of (step_name, parameter) tuples in execution order.

>>> from translit import PRESETS
>>> PRESETS["security_clean"]
[('normalize', 'NFKC'), ('confusables', 'latin'), ('strip_bidi', None), ('collapse_whitespace', None)]
>>> PRESETS["sanitize_user_input"]
[('normalize', 'NFKC'), ('strip_zalgo', None), ('confusables', 'latin'), ('strip_bidi', None), ('collapse_whitespace', None)]

Use PRESETS to audit exactly which transforms a preset applies, or to build equivalent TextPipeline configurations.


Policy Profiles

Named policy profiles provide pre-configured TextPipeline instances for common institutional and application workflows.

get_pipeline

from translit import get_pipeline

pipe = get_pipeline("scholarly_cyrillic_iso9")
pipe("Москва")   # → "moskva"

Returns a fresh TextPipeline configured for the named profile. Raises TranslitError for unknown profiles.

list_profiles

from translit import list_profiles

print(list_profiles())
# ['library_catalog_key_eu', 'ml_corpus_normalize', 'scholarly_cyrillic_iso9',
#  'search_index', 'web_input_sanitize']

Returns sorted list of available profile names.

Available profiles

Profile Steps Output
scholarly_cyrillic_iso9 NFKC → transliterate (ISO 9) → fold_case → collapse_whitespace UTF-8
library_catalog_key_eu NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace ASCII
web_input_sanitize NFKC → confusables → collapse_whitespace UTF-8
ml_corpus_normalize NFKC → demojize → strip_accents → fold_case → collapse_whitespace ASCII
search_index NFKC → transliterate → strip_accents → fold_case → collapse_whitespace ASCII

See Policy Templates for detailed usage guidance and institutional recipes.