Policy Templates

Pre-built configurations for common institutional and application workflows. Each template is a named policy profile available via get_pipeline(), or a recommended TextPipeline configuration.


Using Policy Profiles

from translit import get_pipeline, list_profiles

# See all available profiles
print(list_profiles())

# Get a configured pipeline
pipe = get_pipeline("scholarly_cyrillic_iso9")
result = pipe("Москва")

Each call to get_pipeline() returns a fresh TextPipeline instance.


Available Profiles

scholarly_cyrillic_iso9

Use case: Academic publishing, linguistic research, library cataloging of Cyrillic texts.

pipe = get_pipeline("scholarly_cyrillic_iso9")
pipe("Юность")   # → "junost"
pipe("Москва")   # → "moskva"
Property Value
Steps NFKC → transliterate (ISO 9) → fold_case → collapse_whitespace
Output charset UTF-8 (ISO 9 diacritics preserved before case folding)
Reversibility Partially (case folding is lossy)
Script coverage All Cyrillic scripts

library_catalog_key_eu

Use case: European public library catalog deduplication, bibliographic key generation.

pipe = get_pipeline("library_catalog_key_eu")
pipe("München — Bayern")   # → "munchen  bayern" or similar
pipe("Città di Firenze")   # → "citta di firenze"
Property Value
Steps NFKC → transliterate → confusables → strip_accents → fold_case → collapse_whitespace
Output charset ASCII
Reversibility No (lossy)
Script coverage All 83 language profiles

web_input_sanitize

Use case: Web form input cleaning, comment sanitization, display-safe text.

pipe = get_pipeline("web_input_sanitize")
pipe("  Hello   World  ")   # → "Hello World"
Property Value
Steps NFKC → confusables → collapse_whitespace
Output charset UTF-8 (original script preserved)
Reversibility No (NFKC is lossy for some characters)
Security Neutralizes confusable homoglyphs

Note

For full protection against zalgo text and bidi injection, use the sanitize_user_input() precompiled pipeline instead — it includes strip_zalgo and strip_bidi steps that TextPipeline does not support.

ml_corpus_normalize

Use case: NLP/ML text preprocessing, corpus normalization, embedding preparation.

pipe = get_pipeline("ml_corpus_normalize")
pipe("Héllo WÖRLD 🎉")   # → "hello world :party_popper:"
Property Value
Steps NFKC → demojize → strip_accents → fold_case → collapse_whitespace
Output charset ASCII + emoji names
Reversibility No (lossy)
Script coverage All scripts

search_index

Use case: Full-text search index generation, cross-language search keys.

pipe = get_pipeline("search_index")
pipe("München")   # → "munchen"
pipe("Москва")    # → "moskva"
Property Value
Steps NFKC → transliterate → strip_accents → fold_case → collapse_whitespace
Output charset ASCII
Reversibility No (lossy)
Script coverage All 83 language profiles

Precompiled Pipelines vs Policy Profiles

Policy profiles use TextPipeline (Python-configurable steps). For maximum performance and security coverage, use the precompiled pipelines instead — they run entirely in Rust:

Need Use
Security-critical input sanitization sanitize_user_input()
Catalog/bibliography keys catalog_key()
Search index keys search_key()
Sort-friendly keys sort_key()
Security canonicalization security_clean()
ML preprocessing ml_normalize()

Policy profiles are best for custom workflows where you need the flexibility of TextPipeline parameters, or when you want symbolic profile names in configuration files.


Custom Institutional Profiles

Organizations can define their own profiles by constructing TextPipeline directly:

from translit import TextPipeline

# Government/legal: strict ASCII, no transliteration (preserve originals)
legal_clean = TextPipeline(
    normalize="NFKC",
    confusables=True,
    fold_case=True,
    collapse_whitespace=True,
)

# Archive/museum: preserve script, minimal normalization
archive_clean = TextPipeline(
    normalize="NFC",
    collapse_whitespace=True,
)