Core Transforms¶
Functions that transform text. All are pure functions — they never mutate the input.
transliterate¶
transliterate ¶
transliterate(text: str, *, lang: str | None = ..., target: str | None = ..., errors: ErrorMode = ..., replace_with: str = ..., strict_iso9: bool = ..., gost7034: bool = ..., tones: bool = ..., context: bool = ...) -> strtransliterate(text: list[str], *, lang: str | None = ..., target: str | None = ..., errors: ErrorMode = ..., replace_with: str = ..., strict_iso9: bool = ..., gost7034: bool = ..., tones: bool = ..., context: bool = ...) -> list[str] transliterate(text: str | list[str], *, lang: str | None = None, target: str | None = None, errors: ErrorMode = 'replace', replace_with: str = '[?]', strict_iso9: bool = False, gost7034: bool = False, tones: bool = False, context: bool = False) -> str | list[str]
Unicode → ASCII transliteration.
Accepts a single string or a list of strings. When a list is passed, all strings are processed in a single Rust call for better throughput.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> transliterate("café résumé")
'cafe resume'
>>> transliterate(["café", "naïve"])
['cafe', 'naive']
>>> transliterate("München", lang="de")
'Muenchen'
>>> transliterate("Moskva", target="ru")
'Москва'
slugify¶
slugify ¶
slugify(text: str, *, separator: str = ..., lowercase: bool = ..., max_length: int = ..., word_boundary: bool = ..., save_order: bool = ..., stopwords: Iterable[str] = ..., regex_pattern: str | None = ..., replacements: Iterable[tuple[str, str]] = ..., allow_unicode: bool = ..., lang: str | None = ..., entities: bool = ..., decimal: bool = ..., hexadecimal: bool = ...) -> strslugify(text: list[str], *, separator: str = ..., lowercase: bool = ..., max_length: int = ..., word_boundary: bool = ..., save_order: bool = ..., stopwords: Iterable[str] = ..., regex_pattern: str | None = ..., replacements: Iterable[tuple[str, str]] = ..., allow_unicode: bool = ..., lang: str | None = ..., entities: bool = ..., decimal: bool = ..., hexadecimal: bool = ...) -> list[str] slugify(text: str | list[str], *, separator: str = '-', lowercase: bool = True, max_length: int = 0, word_boundary: bool = False, save_order: bool = False, stopwords: Iterable[str] = (), regex_pattern: str | None = None, replacements: Iterable[tuple[str, str]] = (), allow_unicode: bool = False, lang: str | None = None, entities: bool = True, decimal: bool = True, hexadecimal: bool = True) -> str | list[str]
Generate a URL-safe slug from Unicode text.
Full pipeline: decode entities → transliterate → lowercase → strip non-alphanumeric → collapse separators → apply stopwords/max_length.
Parameter-compatible with python-slugify.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> slugify("Hello World!")
'hello-world'
>>> slugify("Straße nach München", lang="de")
'strasse-nach-muenchen'
>>> slugify("My Title", separator="_")
'my_title'
>>> slugify("The Big Fox", stopwords=["the"])
'big-fox'
>>> slugify("Very Long Title Here", max_length=10, word_boundary=True)
'very-long'
normalize¶
normalize ¶
normalize(text: str, *, form: NormalizationForm = ...) -> strnormalize(text: list[str], *, form: NormalizationForm = ...) -> list[str] normalize(text: str | list[str], *, form: NormalizationForm = 'NFC') -> str | list[str]
Unicode normalization.
Accepts a single string or a list of strings.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> normalize("é", form="NFC")
'é'
>>> normalize(["é", "ño"], form="NFC")
['é', 'ño']
normalize_confusables¶
normalize_confusables ¶
normalize_confusables(text: str, *, target_script: str = 'latin') -> str
Replace Unicode confusable homoglyphs with target-script equivalents.
Uses Unicode TR39 confusables table. Characters without a confusable equivalent in the target script pass through unchanged (visual mapping only, not transliteration).
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> normalize_confusables("Ηello") # Greek Η looks like Latin H
'Hello'
>>> normalize_confusables("раypal") # Cyrillic р/а look like Latin p/a
'paypal'
>>> normalize_confusables("paypal", target_script="cyrillic")
'раураӏ'
sanitize_filename¶
sanitize_filename ¶
sanitize_filename(text: str, *, separator: str = '_', max_length: int = 255, platform: Platform = 'universal', lang: str | None = None, preserve_extension: bool = True, replacement_text: str | None = None, max_len: int | None = None) -> str
Sanitize a string into a safe filename.
Transliterate → strip OS-illegal chars → collapse separators → handle reserved names (CON, NUL, etc.) → truncate respecting extension.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Examples:
>>> sanitize_filename("My Report (final).pdf")
'My_Report_(final).pdf'
>>> sanitize_filename("CON.txt") # reserved on Windows
'_CON.txt'
>>> sanitize_filename("résumé.docx", lang="fr")
'resume.docx'
strip_accents¶
strip_accents ¶
strip_accents(text: str) -> strstrip_accents(text: list[str]) -> list[str] strip_accents(text: str | list[str]) -> str | list[str]
Remove diacritical marks while preserving base characters.
NFD decompose → strip combining marks → NFC recompose. Accepts a single string or a list of strings.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> strip_accents("café résumé naïve")
'cafe resume naive'
>>> strip_accents(["café", "naïve"])
['cafe', 'naive']
fold_case¶
fold_case ¶
fold_case(text: str) -> str
Full Unicode case folding per CaseFolding.txt (Unicode 16.0).
Unlike str.lower(), this implements the complete Unicode Case Folding
algorithm with all 1,557 status-C and status-F mappings. Covers Latin
(ß→ss, ſ→s, İ→i̇), Greek (ς→σ, variant forms ϐ→β, ϑ→θ, ϕ→φ, ϖ→π,
ϰ→κ, ϱ→ρ), Cyrillic, Armenian (ligature և→եւ), Georgian Mtavruli,
Cherokee, Adlam, Deseret, Osage, Warang Citi, fullwidth Latin,
and all Latin ligature expansions (fi→fi, fl→fl, ff→ff, ffi→ffi,
ffl→ffl, ſt→st, st→st).
Equivalent to str.casefold() but executed in Rust via a
compile-time PHF (perfect hash function) table. Pure-ASCII strings
take a branchless fast path with no table lookup.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> fold_case("Straße")
'strasse'
>>> fold_case("ΣΟΦΙΑ")
'σοφια'
>>> fold_case("find")
'find'
collapse_whitespace¶
collapse_whitespace ¶
collapse_whitespace(text: str, *, strip_control: bool = True, strip_zero_width: bool = True) -> str
Normalize all Unicode whitespace variants to single ASCII spaces.
Optionally strip control characters and zero-width characters.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> collapse_whitespace(" hello world ")
'hello world'
>>> collapse_whitespace("tabs\there\ttoo")
'tabs here too'
>>> collapse_whitespace("a\u200Bb\u200Bc") # zero-width spaces
'abc'
demojize¶
demojize ¶
demojize(text: str, *, strip_modifiers: bool = False, errors: ErrorMode = 'replace', replace_with: str = '[?]', provider: EmojiProvider | None = None, delimiters: tuple[str, str] | None = None) -> str
Expand emoji sequences to their CLDR short-name text descriptions.
Output is always the bare CLDR short name as plain text.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
| Warns: |
|
|---|
Examples:
>>> demojize("I ❤️ Python 🐍")
'I red heart Python snake'
set_emoji_provider¶
set_emoji_provider ¶
set_emoji_provider(provider: EmojiProvider | None = None) -> None
Set a global emoji provider for all demojize calls.
The provider must implement the :class:EmojiProvider protocol.
Pass None to reset to the built-in default (latest English CLDR).
| Parameters: |
|
|---|
Examples:
>>> set_emoji_provider(None) # reset to default provider
strip_bidi¶
strip_bidi ¶
strip_bidi(text: str) -> str
Strip bidirectional override and formatting characters (UAX #9).
Removes: soft hyphen (U+00AD), Arabic Letter Mark (U+061C), LRM/RLM (U+200E/F), bidi embeddings/overrides (U+202A–U+202E), bidi isolates (U+2066–U+2069).
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> strip_bidi("hello\u200eworld") # remove LRM
'helloworld'
>>> strip_bidi("hello\u061cworld") # remove Arabic Letter Mark
'helloworld'
>>> strip_bidi("safe text") # no bidi chars → unchanged
'safe text'
strip_zalgo¶
strip_zalgo ¶
strip_zalgo(text: str, *, max_marks: int = 2) -> str
Strip excessive combining marks, preserving legitimate diacritics.
Caps the number of combining marks per base character at max_marks. Operates in NFD space and recomposes to NFC.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> strip_zalgo("café") # 1 combining mark — preserved
'café'
>>> strip_zalgo("Việt Nam") # 2 marks — preserved
'Việt Nam'
Caps the number of combining marks per base character, preserving legitimate diacritics (é, ñ, ệ) while removing zalgo stacking abuse.
from translit import strip_zalgo
strip_zalgo("café") # => "café" (1 mark — preserved)
strip_zalgo("Việt Nam") # => "Việt Nam" (2 marks — preserved)
# Strip all combining marks (like strip_accents)
strip_zalgo("café", max_marks=0) # => "cafe"
List input (batch processing)¶
transliterate, slugify, normalize, and strip_accents accept either a single str or a list[str]. When a list is passed, all strings are processed in a single Rust call, amortizing the Python → Rust boundary overhead. The return type matches the input type.
from translit import transliterate, slugify
titles = ["café résumé", "Straße nach München", "Москва"]
transliterate(titles)
# => ["cafe resume", "Strasse nach Munchen", "Moskva"]
slugify(titles, lang="de")
# => ["cafe-resume", "strasse-nach-muenchen", "moskva"]
For large datasets, passing a list is significantly faster than calling the function in a Python loop. See Performance for benchmarks.
Compatibility aliases¶
The following aliases are provided for migration convenience:
| Alias | Target | Matches |
|---|---|---|
unidecode |
transliterate |
Unidecode / text-unidecode |
ascii_fold |
transliterate |
Elasticsearch ICU folding |
casefold |
fold_case |
str.casefold() |
remove_accents |
strip_accents |
sklearn / ML ecosystems |
from translit import unidecode, casefold, remove_accents
unidecode("café") # => "cafe"
casefold("Straße") # => "strasse"
remove_accents("café") # => "cafe"