Core Transforms¶

Functions that transform text. All are pure functions — they never mutate the input.

transliterate¶

transliterate ¶

transliterate(text: str, *, lang: str | None = ..., target: str | None = ..., errors: ErrorMode = ..., replace_with: str = ..., strict_iso9: bool = ..., gost7034: bool = ..., tones: bool = ..., context: bool = ...) -> str

transliterate(text: list[str], *, lang: str | None = ..., target: str | None = ..., errors: ErrorMode = ..., replace_with: str = ..., strict_iso9: bool = ..., gost7034: bool = ..., tones: bool = ..., context: bool = ...) -> list[str]

transliterate(text: str | list[str], *, lang: str | None = None, target: str | None = None, errors: ErrorMode = 'replace', replace_with: str = '[?]', strict_iso9: bool = False, gost7034: bool = False, tones: bool = False, context: bool = False) -> str | list[str]

Unicode → ASCII transliteration.

Accepts a single string or a list of strings. When a list is passed, all strings are processed in a single Rust call for better throughput.

Parameters:

text (str | list[str]) –

Input Unicode string, or list of strings for batch processing.
lang (str | None, default: None ) –

Language code for language-specific mappings. e.g. "de" (ü→ue), "ja" (kanji→romaji), "zh" (hanzi→pinyin). Use "auto" to detect the dominant non-Latin script and select the appropriate language automatically. Use "ja-kunrei" for Kunrei-shiki romanization of Japanese kana. None uses best-effort default tables.
target (str | None, default: None ) –

Target language code for reverse transliteration (romanized Latin → native script). Mutually exclusive with lang. Use :func:reverse_langs to list supported languages.
errors (ErrorMode, default: 'replace' ) –

How to handle untransliterable characters. "replace" — substitute with replace_with. "ignore" — silently drop. "preserve" — keep the original character.
replace_with (str, default: '[?]' ) –

Replacement string when errors="replace". An empty string ("") is equivalent to errors="ignore" — the character is silently dropped. This matches the behaviour of the Unidecode library.
strict_iso9 (bool, default: False ) –

Use ISO 9:1995 scholarly transliteration for Cyrillic. When True, overrides both default and lang-specific mappings with the international standard used in linguistics and library science (e.g. й→j, ю→ju, я→ja).
gost7034 (bool, default: False ) –

Use GOST R 7.0.34-2014 simplified transliteration for Russian Cyrillic. Mutually exclusive with strict_iso9. Key differences from default: х→x, ц→c, щ→shh, й→j.
tones (bool, default: False ) –

Output toned pinyin (with diacritics) for CJK characters. e.g. "běi jīng" instead of "bei jing". Coverage includes the ~2000 most common characters; others fall through to toneless pinyin.

Returns:	`str \| list[str]` – ASCII transliteration of the input. Returns `str` when given `str`, `str \| list[str]` – `list[str]` when given `list[str]`.

Raises:	`TranslitError` – If an internal Rust error occurs (e.g. invalid `errors` value passed at runtime). `ValueError` – If both strict_iso9 and gost7034 are True. `ValueError` – If both lang and target are set. `ValueError` – If target is set with forward-only parameters.

Examples:

>>> transliterate("café résumé")
'cafe resume'
>>> transliterate(["café", "naïve"])
['cafe', 'naive']
>>> transliterate("München", lang="de")
'Muenchen'
>>> transliterate("Moskva", target="ru")
'Москва'

slugify¶

slugify ¶

slugify(text: str, *, separator: str = ..., lowercase: bool = ..., max_length: int = ..., word_boundary: bool = ..., save_order: bool = ..., stopwords: Iterable[str] = ..., regex_pattern: str | None = ..., replacements: Iterable[tuple[str, str]] = ..., allow_unicode: bool = ..., lang: str | None = ..., entities: bool = ..., decimal: bool = ..., hexadecimal: bool = ...) -> str

slugify(text: list[str], *, separator: str = ..., lowercase: bool = ..., max_length: int = ..., word_boundary: bool = ..., save_order: bool = ..., stopwords: Iterable[str] = ..., regex_pattern: str | None = ..., replacements: Iterable[tuple[str, str]] = ..., allow_unicode: bool = ..., lang: str | None = ..., entities: bool = ..., decimal: bool = ..., hexadecimal: bool = ...) -> list[str]

slugify(text: str | list[str], *, separator: str = '-', lowercase: bool = True, max_length: int = 0, word_boundary: bool = False, save_order: bool = False, stopwords: Iterable[str] = (), regex_pattern: str | None = None, replacements: Iterable[tuple[str, str]] = (), allow_unicode: bool = False, lang: str | None = None, entities: bool = True, decimal: bool = True, hexadecimal: bool = True) -> str | list[str]

Generate a URL-safe slug from Unicode text.

Full pipeline: decode entities → transliterate → lowercase → strip non-alphanumeric → collapse separators → apply stopwords/max_length.

Parameter-compatible with python-slugify.

Parameters:

text (str | list[str]) –

Input Unicode string.
separator (str, default: '-' ) –

Character(s) between slug words.
lowercase (bool, default: True ) –

Convert to lowercase.
max_length (int, default: 0 ) –

Maximum slug length in bytes (0 = unlimited). With allow_unicode=True, multi-byte characters count as 2–4 bytes each — use :func:grapheme_truncate for character-aware limiting.
word_boundary (bool, default: False ) –

When truncating via max_length, cut at word boundaries.
save_order (bool, default: False ) –

Accepted for python-slugify compatibility but has no effect — word order is always preserved.
stopwords (Iterable[str], default: () ) –

Words to remove from the slug.
regex_pattern (str | None, default: None ) –

Custom regex for stripping characters.
replacements (Iterable[tuple[str, str]], default: () ) –

Pre-transliteration (old, new) substitution pairs.
allow_unicode (bool, default: False ) –

Keep non-ASCII letters instead of transliterating.
lang (str | None, default: None ) –

Language code for transliteration (e.g. "de", "ru", "auto").
entities (bool, default: True ) –

Decode HTML entities before processing.
decimal (bool, default: True ) –

Decode HTML decimal entities ({).
hexadecimal (bool, default: True ) –

Decode HTML hex entities ({).

Returns:	`str \| list[str]` – URL-safe slug string.

Raises:	`TranslitError` – If an internal Rust error occurs. `NotImplementedError` – If `pretranslate` is passed as a callable (only dict is supported in the compatibility shim).

Examples:

>>> slugify("Hello World!")
'hello-world'
>>> slugify("Straße nach München", lang="de")
'strasse-nach-muenchen'
>>> slugify("My Title", separator="_")
'my_title'
>>> slugify("The Big Fox", stopwords=["the"])
'big-fox'
>>> slugify("Very Long Title Here", max_length=10, word_boundary=True)
'very-long'

normalize¶

normalize ¶

normalize(text: str, *, form: NormalizationForm = ...) -> str

normalize(text: list[str], *, form: NormalizationForm = ...) -> list[str]

normalize(text: str | list[str], *, form: NormalizationForm = 'NFC') -> str | list[str]

Unicode normalization.

Accepts a single string or a list of strings.

Parameters:	`text` (`str \| list[str]`) – Input string, or list of strings for batch processing. `form` (`NormalizationForm`, default: `'NFC'` ) – Normalization form — "NFC", "NFD", "NFKC", or "NFKD".

Returns:	`str \| list[str]` – Normalized string(s). Returns `str` when given `str`, `str \| list[str]` – `list[str]` when given `list[str]`.

Examples:

>>> normalize("é", form="NFC")
'é'
>>> normalize(["é", "ño"], form="NFC")
['é', 'ño']

normalize_confusables¶

normalize_confusables ¶

normalize_confusables(text: str, *, target_script: str = 'latin') -> str

Replace Unicode confusable homoglyphs with target-script equivalents.

Uses Unicode TR39 confusables table. Characters without a confusable equivalent in the target script pass through unchanged (visual mapping only, not transliteration).

Parameters:	`text` (`str`) – Input string potentially containing homoglyphs. `target_script` (`str`, default: `'latin'` ) – Script to normalize toward. Supported values: `"latin"` (default, ~2,063 mappings) and `"cyrillic"` (~1,369 mappings).

Returns:	`str` – String with confusable characters replaced by target-script equivalents.

Raises:	`TranslitError` – If target_script is not a supported value.

Examples:

>>> normalize_confusables("Ηello")  # Greek Η looks like Latin H
'Hello'
>>> normalize_confusables("раypal")  # Cyrillic р/а look like Latin p/a
'paypal'
>>> normalize_confusables("paypal", target_script="cyrillic")
'раураӏ'

sanitize_filename¶

sanitize_filename ¶

sanitize_filename(text: str, *, separator: str = '_', max_length: int = 255, platform: Platform = 'universal', lang: str | None = None, preserve_extension: bool = True, replacement_text: str | None = None, max_len: int | None = None) -> str

Sanitize a string into a safe filename.

Transliterate → strip OS-illegal chars → collapse separators → handle reserved names (CON, NUL, etc.) → truncate respecting extension.

Parameters:

text (str) –

Input string (title, user input, etc.).
separator (str, default: '_' ) –

Replacement for spaces and stripped characters. Also accepted as replacement_text (pathvalidate compatibility).
max_length (int, default: 255 ) –

Maximum filename length measured in bytes (UTF-8 encoded), not characters. Default 255 matches the ext4/APFS/NTFS filesystem limit. Truncation always lands on a character boundary to avoid splitting multi-byte sequences. Also accepted as max_len (pathvalidate compatibility).
platform (Platform, default: 'universal' ) –

Target platform — "universal", "windows", or "posix".
lang (str | None, default: None ) –

Language code for transliteration (e.g. "de", "ja").
preserve_extension (bool, default: True ) –

When True (default), the file extension is kept intact within max_length. If the extension alone (including the leading .) is ≥ max_length, the extension is dropped and the whole result is truncated to max_length bytes. When False, the entire string is truncated to max_length bytes without special treatment of the extension.

Returns:	`str` – Safe filename string.

Raises:	`TranslitError` – If an internal Rust error occurs.

Examples:

>>> sanitize_filename("My Report (final).pdf")
'My_Report_(final).pdf'
>>> sanitize_filename("CON.txt")  # reserved on Windows
'_CON.txt'
>>> sanitize_filename("résumé.docx", lang="fr")
'resume.docx'

strip_accents¶

strip_accents ¶

strip_accents(text: str) -> str

strip_accents(text: list[str]) -> list[str]

strip_accents(text: str | list[str]) -> str | list[str]

Remove diacritical marks while preserving base characters.

NFD decompose → strip combining marks → NFC recompose. Accepts a single string or a list of strings.

Parameters:	`text` (`str \| list[str]`) – Input string, or list of strings for batch processing.

Returns:	`str \| list[str]` – String(s) with diacritical marks removed.

Examples:

>>> strip_accents("café résumé naïve")
'cafe resume naive'
>>> strip_accents(["café", "naïve"])
['cafe', 'naive']

fold_case¶

fold_case ¶

fold_case(text: str) -> str

Full Unicode case folding per CaseFolding.txt (Unicode 16.0).

Unlike str.lower(), this implements the complete Unicode Case Folding algorithm with all 1,557 status-C and status-F mappings. Covers Latin (ß→ss, ſ→s, İ→i̇), Greek (ς→σ, variant forms ϐ→β, ϑ→θ, ϕ→φ, ϖ→π, ϰ→κ, ϱ→ρ), Cyrillic, Armenian (ligature և→եւ), Georgian Mtavruli, Cherokee, Adlam, Deseret, Osage, Warang Citi, fullwidth Latin, and all Latin ligature expansions (ﬁ→fi, ﬂ→fl, ﬀ→ff, ﬃ→ffi, ﬄ→ffl, ﬅ→st, ﬆ→st).

Equivalent to str.casefold() but executed in Rust via a compile-time PHF (perfect hash function) table. Pure-ASCII strings take a branchless fast path with no table lookup.

Parameters:	`text` (`str`) – Input string.

Returns:	`str` – Case-folded string. Characters not in CaseFolding.txt map to `str` – themselves. Output satisfies `fold_case(fold_case(x)) == fold_case(x)` `str` – (idempotent).

Examples:

>>> fold_case("Straße")
'strasse'
>>> fold_case("ΣΟΦΙΑ")
'σοφια'
>>> fold_case("ﬁnd")
'find'

collapse_whitespace¶

collapse_whitespace ¶

collapse_whitespace(text: str, *, strip_control: bool = True, strip_zero_width: bool = True) -> str

Normalize all Unicode whitespace variants to single ASCII spaces.

Optionally strip control characters and zero-width characters.

Parameters:

text (str) –

Input string.
strip_control (bool, default: True ) –

Remove C0/C1 control characters (U+0000–U+001F, U+007F–U+009F) except tab and newline. Carriage return (\r) is stripped, so Windows-style \r\n becomes \n.
strip_zero_width (bool, default: True ) –

Remove zero-width space (U+200B), zero-width non-joiner (U+200C), zero-width joiner (U+200D), and word joiner (U+2060).

Returns:	`str` – String with whitespace collapsed and optionally cleaned.

Examples:

>>> collapse_whitespace("  hello   world  ")
'hello world'
>>> collapse_whitespace("tabs\there\ttoo")
'tabs here too'
>>> collapse_whitespace("a\u200Bb\u200Bc")  # zero-width spaces
'abc'

demojize¶

demojize ¶

demojize(text: str, *, strip_modifiers: bool = False, errors: ErrorMode = 'replace', replace_with: str = '[?]', provider: EmojiProvider | None = None, delimiters: tuple[str, str] | None = None) -> str

Expand emoji sequences to their CLDR short-name text descriptions.

Output is always the bare CLDR short name as plain text.

Parameters:

text (str) –

Input string potentially containing emoji.
strip_modifiers (bool, default: False ) –

If True, collapse skin tone and hair style variants to their base form (e.g. "woman raising hand" instead of "woman raising hand: medium-dark skin tone").
errors (ErrorMode, default: 'replace' ) –

How to handle emoji not in the provider's data. "replace" — substitute with replace_with. "ignore" — silently drop. "preserve" — keep the original emoji.
replace_with (str, default: '[?]' ) –

Replacement string when errors="replace".
provider (EmojiProvider | None, default: None ) –

An object implementing the :class:EmojiProvider protocol. Overrides the global provider for this call. None uses the global provider or the built-in default.
delimiters (tuple[str, str] | None, default: None ) –

emoji library compatibility — ignored with a DeprecationWarning. translit always outputs bare CLDR short names without delimiters; wrap the result yourself if you need delimiters (e.g. f":{name}:").

Returns:	`str` – Text with emoji replaced by their descriptions.

Raises:	`TranslitError` – If an internal Rust error occurs.

Warns:	`UserWarning` – If the provider raises an exception or returns a non-string value. The built-in CLDR tables are used as a fallback for that sequence.

Examples:

>>> demojize("I ❤️ Python 🐍")
'I red heart Python snake'

set_emoji_provider¶

set_emoji_provider ¶

set_emoji_provider(provider: EmojiProvider | None = None) -> None

Set a global emoji provider for all demojize calls.

The provider must implement the :class:EmojiProvider protocol.

Pass None to reset to the built-in default (latest English CLDR).

Parameters:	`provider` (`EmojiProvider \| None`, default: `None` ) – An object implementing the :class:`EmojiProvider` protocol, or None to reset to the built-in default.

Examples:

>>> set_emoji_provider(None)  # reset to default provider

strip_bidi¶

strip_bidi ¶

strip_bidi(text: str) -> str

Strip bidirectional override and formatting characters (UAX #9).

Removes: soft hyphen (U+00AD), Arabic Letter Mark (U+061C), LRM/RLM (U+200E/F), bidi embeddings/overrides (U+202A–U+202E), bidi isolates (U+2066–U+2069).

Parameters:	`text` (`str`) – Input string.

Returns:	`str` – String with bidi override and formatting characters removed.

Examples:

>>> strip_bidi("hello\u200eworld")  # remove LRM
'helloworld'
>>> strip_bidi("hello\u061cworld")  # remove Arabic Letter Mark
'helloworld'
>>> strip_bidi("safe text")  # no bidi chars → unchanged
'safe text'

strip_zalgo¶

strip_zalgo ¶

strip_zalgo(text: str, *, max_marks: int = 2) -> str

Strip excessive combining marks, preserving legitimate diacritics.

Caps the number of combining marks per base character at max_marks. Operates in NFD space and recomposes to NFC.

Parameters:	`text` (`str`) – Input string (may contain zalgo abuse). `max_marks` (`int`, default: `2` ) – Maximum combining marks to keep per base character (default: `2`). Set to `0` to strip all combining marks (equivalent to :func:`strip_accents`).

Returns:	`str` – String with excess combining marks removed.

Examples:

>>> strip_zalgo("café")  # 1 combining mark — preserved
'café'
>>> strip_zalgo("Việt Nam")  # 2 marks — preserved
'Việt Nam'

Caps the number of combining marks per base character, preserving legitimate diacritics (é, ñ, ệ) while removing zalgo stacking abuse.

from translit import strip_zalgo

strip_zalgo("café")           # => "café"  (1 mark — preserved)
strip_zalgo("Việt Nam")       # => "Việt Nam"  (2 marks — preserved)

# Strip all combining marks (like strip_accents)
strip_zalgo("café", max_marks=0)  # => "cafe"

List input (batch processing)¶

transliterate, slugify, normalize, and strip_accents accept either a single str or a list[str]. When a list is passed, all strings are processed in a single Rust call, amortizing the Python → Rust boundary overhead. The return type matches the input type.

from translit import transliterate, slugify

titles = ["café résumé", "Straße nach München", "Москва"]

transliterate(titles)
# => ["cafe resume", "Strasse nach Munchen", "Moskva"]

slugify(titles, lang="de")
# => ["cafe-resume", "strasse-nach-muenchen", "moskva"]

For large datasets, passing a list is significantly faster than calling the function in a Python loop. See Performance for benchmarks.

Compatibility aliases¶

The following aliases are provided for migration convenience:

Alias	Target	Matches
`unidecode`	`transliterate`	Unidecode / text-unidecode
`ascii_fold`	`transliterate`	Elasticsearch ICU folding
`casefold`	`fold_case`	`str.casefold()`
`remove_accents`	`strip_accents`	sklearn / ML ecosystems

from translit import unidecode, casefold, remove_accents

unidecode("café")        # => "cafe"
casefold("Straße")       # => "strasse"
remove_accents("café")   # => "cafe"