Precompiled Pipelines¶
Ready-to-use multi-step text processing pipelines. Each is a single compiled Rust function with no pipeline construction overhead at call time.
security_clean¶
security_clean ¶
security_clean(text: str) -> str
Security-focused text canonicalization.
Pipeline: NFKC → confusables → strip bidi/format → collapse_whitespace
Collapses fullwidth bypasses, neutralizes homoglyph spoofing, strips dangerous bidi overrides and soft hyphens, then normalizes whitespace (collapsing runs, stripping control chars and zero-width injections).
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> security_clean("Ηello Ꮤorld") # Greek Η + Cherokee Ꮤ → Latin
'Hello World'
Pipeline steps¶
NFKC → confusables → strip bidi/format → collapse_whitespace
from translit import security_clean
security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥") # => "Real text"
security_clean("Ηello Ꮤorld") # => "Hello World" (Greek Η + Cherokee Ꮤ → Latin)
ml_normalize¶
ml_normalize ¶
ml_normalize(text: str, *, lang: str | None = None, emoji: str = 'cldr') -> str
ML/NLP text normalization pipeline.
NFKC → emoji→text → [transliterate] → strip_accents →
fold_case → collapse_whitespace
Produces clean, accent-free, lowercased text suitable for tokenizers, embeddings, and feature extraction. Emoji are expanded to their CLDR short-name descriptions.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> ml_normalize("Café RÉSUMÉ")
'cafe resume'
>>> ml_normalize("München", lang="de")
'muenchen'
Pipeline steps¶
NFKC → emoji→text → [transliterate] → strip_accents → fold_case → collapse_whitespace
from translit import ml_normalize
ml_normalize("Café RÉSUMÉ") # => "cafe resume"
ml_normalize("München", lang="de") # => "muenchen"
ml_normalize("I ❤️ Python 🐍") # => "i red heart python snake"
catalog_key¶
catalog_key ¶
catalog_key(text: str, *, lang: str | None = None, strict_iso9: bool = False) -> str
Library catalog key generation pipeline.
NFKC → transliterate → confusables → strip_accents →
fold_case → collapse_whitespace
Produces a canonical deduplication key for bibliographic titles.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> catalog_key(" Café RÉSUMÉ ")
'cafe resume'
>>> catalog_key("ΩMEGA café")
'omega cafe'
Pipeline steps¶
NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace
from translit import catalog_key
catalog_key(" Café RÉSUMÉ ") # => "cafe resume"
catalog_key("Москва", lang="ru") # => "moskva"
catalog_key("Москва", lang="auto") # => "moskva" (auto-detects Russian)
catalog_key("Müller", lang="de") # => "mueller"
display_clean¶
display_clean ¶
display_clean(text: str) -> str
Display-safe text cleaning pipeline.
Pipeline: strip bidi/format → collapse_whitespace (strip control + strip zero-width)
Lightweight cleanup for user-submitted content destined for rendering. Strips bidirectional overrides (which can visually reorder text to hide malicious content), soft hyphens, control characters, and zero-width injections, then collapses runs of whitespace to single spaces.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> display_clean("hello\x00world\u200b!")
'helloworld!'
>>> display_clean(" spaced out ")
'spaced out'
Pipeline steps¶
strip_bidi → strip_control → strip_zero_width → collapse_whitespace
from translit import display_clean
display_clean("hello\x00world\u200b!") # => "helloworld!"
display_clean(" spaced out ") # => "spaced out"
display_clean("admin\u202Euser") # => "adminuser" (bidi override stripped)
search_key¶
search_key ¶
search_key(text: str, *, lang: str | None = None) -> str
Search index key generation pipeline.
NFKC → transliterate → strip_accents → fold_case →
collapse_whitespace
Produces a case-insensitive, accent-insensitive, script-insensitive
lookup key. Like :func:catalog_key but without confusable
normalization — lighter and faster for search indexes.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> search_key(" Café RÉSUMÉ ")
'cafe resume'
>>> search_key("Москва")
'moskva'
>>> search_key("Über allen Gipfeln")
'uber allen gipfeln'
Pipeline steps¶
NFKC → transliterate → strip_accents → fold_case → collapse_whitespace
from translit import search_key
search_key("Café RÉSUMÉ") # => "cafe resume"
search_key("Москва", lang="ru") # => "moskva"
search_key("ΩMEGA", lang="auto") # => "omega"
sort_key¶
sort_key ¶
sort_key(text: str, *, lang: str | None = None) -> str
Sort key generation pipeline.
Pipeline: NFKC → transliterate → fold_case → collapse_whitespace
Like :func:search_key but without accent stripping, preserving base
accented characters for correct alphabetical ordering.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> sort_key("Война и мир")
'voyna i mir'
>>> sort_key("Über allen Gipfeln")
'uber allen gipfeln'
>>> sort_key(" Café ")
'cafe'
Pipeline steps¶
NFKC → transliterate → fold_case → collapse_whitespace
from translit import sort_key
sort_key("Über", lang="de") # => "ueber"
sort_key("Война и мир", lang="ru") # => "voyna i mir"
sort_key("Café") # => "cafe"
sanitize_user_input¶
sanitize_user_input ¶
sanitize_user_input(text: str) -> str
Sanitize user-submitted input for web applications.
Preserves the original script (no transliteration) while neutralizing common attack vectors: zalgo stacking, homoglyph spoofing, bidi overrides, zero-width injections, and control characters.
Pipeline: NFKC → strip_zalgo → confusables → strip_bidi → collapse_whitespace
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> sanitize_user_input("Hello, world!")
'Hello, world!'
>>> sanitize_user_input("p\u0430ypal") # Cyrillic а → Latin a
'paypal'
>>> sanitize_user_input("admin\u202euser") # RLO stripped
'adminuser'
Pipeline steps¶
NFKC → strip_zalgo → confusables → strip_bidi → collapse_whitespace
from translit import sanitize_user_input
sanitize_user_input("Hello, world!") # => "Hello, world!"
sanitize_user_input("p\u0430ypal") # => "paypal" (Cyrillic а → Latin a)
sanitize_user_input("admin\u202Euser") # => "adminuser" (bidi override stripped)
Unlike security_clean, this pipeline also strips zalgo text (excessive combining mark stacking). Unlike catalog_key/search_key, it does not transliterate — the original script is preserved.
PRESETS¶
from translit import PRESETS
Dict mapping preset function names to their ordered pipeline steps. Each value is a list of (step_name, parameter) tuples in execution order.
>>> from translit import PRESETS
>>> PRESETS["security_clean"]
[('normalize', 'NFKC'), ('confusables', 'latin'), ('strip_bidi', None), ('collapse_whitespace', None)]
>>> PRESETS["sanitize_user_input"]
[('normalize', 'NFKC'), ('strip_zalgo', None), ('confusables', 'latin'), ('strip_bidi', None), ('collapse_whitespace', None)]
Use PRESETS to audit exactly which transforms a preset applies, or to build equivalent TextPipeline configurations.
Policy Profiles¶
Named policy profiles provide pre-configured TextPipeline instances for common institutional and application workflows.
get_pipeline¶
from translit import get_pipeline
pipe = get_pipeline("scholarly_cyrillic_iso9")
pipe("Москва") # → "moskva"
Returns a fresh TextPipeline configured for the named profile. Raises TranslitError for unknown profiles.
list_profiles¶
from translit import list_profiles
print(list_profiles())
# ['library_catalog_key_eu', 'ml_corpus_normalize', 'scholarly_cyrillic_iso9',
# 'search_index', 'web_input_sanitize']
Returns sorted list of available profile names.
Available profiles¶
| Profile | Steps | Output |
|---|---|---|
scholarly_cyrillic_iso9 |
NFKC → transliterate (ISO 9) → fold_case → collapse_whitespace | UTF-8 |
library_catalog_key_eu |
NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace | ASCII |
web_input_sanitize |
NFKC → confusables → collapse_whitespace | UTF-8 |
ml_corpus_normalize |
NFKC → demojize → strip_accents → fold_case → collapse_whitespace | ASCII |
search_index |
NFKC → transliterate → strip_accents → fold_case → collapse_whitespace | ASCII |
See Policy Templates for detailed usage guidance and institutional recipes.