Core Transforms

Functions that transform text. All are pure functions — they never mutate the input.

transliterate

transliterate

transliterate(text: str, *, lang: str | None = ..., target: str | None = ..., errors: ErrorMode = ..., replace_with: str = ..., strict_iso9: bool = ..., gost7034: bool = ..., tones: bool = ..., context: bool = ...) -> str
transliterate(text: list[str], *, lang: str | None = ..., target: str | None = ..., errors: ErrorMode = ..., replace_with: str = ..., strict_iso9: bool = ..., gost7034: bool = ..., tones: bool = ..., context: bool = ...) -> list[str]
transliterate(text: str | list[str], *, lang: str | None = None, target: str | None = None, errors: ErrorMode = 'replace', replace_with: str = '[?]', strict_iso9: bool = False, gost7034: bool = False, tones: bool = False, context: bool = False) -> str | list[str]

Unicode → ASCII transliteration.

Accepts a single string or a list of strings. When a list is passed, all strings are processed in a single Rust call for better throughput.

Parameters:
  • text (str | list[str]) –

    Input Unicode string, or list of strings for batch processing.

  • lang (str | None, default: None ) –

    Language code for language-specific mappings. e.g. "de" (ü→ue), "ja" (kanji→romaji), "zh" (hanzi→pinyin). Use "auto" to detect the dominant non-Latin script and select the appropriate language automatically. Use "ja-kunrei" for Kunrei-shiki romanization of Japanese kana. None uses best-effort default tables.

  • target (str | None, default: None ) –

    Target language code for reverse transliteration (romanized Latin → native script). Mutually exclusive with lang. Use :func:reverse_langs to list supported languages.

  • errors (ErrorMode, default: 'replace' ) –

    How to handle untransliterable characters. "replace" — substitute with replace_with. "ignore" — silently drop. "preserve" — keep the original character.

  • replace_with (str, default: '[?]' ) –

    Replacement string when errors="replace". An empty string ("") is equivalent to errors="ignore" — the character is silently dropped. This matches the behaviour of the Unidecode library.

  • strict_iso9 (bool, default: False ) –

    Use ISO 9:1995 scholarly transliteration for Cyrillic. When True, overrides both default and lang-specific mappings with the international standard used in linguistics and library science (e.g. й→j, ю→ju, я→ja).

  • gost7034 (bool, default: False ) –

    Use GOST R 7.0.34-2014 simplified transliteration for Russian Cyrillic. Mutually exclusive with strict_iso9. Key differences from default: х→x, ц→c, щ→shh, й→j.

  • tones (bool, default: False ) –

    Output toned pinyin (with diacritics) for CJK characters. e.g. "běi jīng" instead of "bei jing". Coverage includes the ~2000 most common characters; others fall through to toneless pinyin.

Returns:
  • str | list[str]

    ASCII transliteration of the input. Returns str when given str,

  • str | list[str]

    list[str] when given list[str].

Raises:
  • TranslitError

    If an internal Rust error occurs (e.g. invalid errors value passed at runtime).

  • ValueError

    If both strict_iso9 and gost7034 are True.

  • ValueError

    If both lang and target are set.

  • ValueError

    If target is set with forward-only parameters.

Examples:

>>> transliterate("café résumé")
'cafe resume'
>>> transliterate(["café", "naïve"])
['cafe', 'naive']
>>> transliterate("München", lang="de")
'Muenchen'
>>> transliterate("Moskva", target="ru")
'Москва'

slugify

slugify

slugify(text: str, *, separator: str = ..., lowercase: bool = ..., max_length: int = ..., word_boundary: bool = ..., save_order: bool = ..., stopwords: Iterable[str] = ..., regex_pattern: str | None = ..., replacements: Iterable[tuple[str, str]] = ..., allow_unicode: bool = ..., lang: str | None = ..., entities: bool = ..., decimal: bool = ..., hexadecimal: bool = ...) -> str
slugify(text: list[str], *, separator: str = ..., lowercase: bool = ..., max_length: int = ..., word_boundary: bool = ..., save_order: bool = ..., stopwords: Iterable[str] = ..., regex_pattern: str | None = ..., replacements: Iterable[tuple[str, str]] = ..., allow_unicode: bool = ..., lang: str | None = ..., entities: bool = ..., decimal: bool = ..., hexadecimal: bool = ...) -> list[str]
slugify(text: str | list[str], *, separator: str = '-', lowercase: bool = True, max_length: int = 0, word_boundary: bool = False, save_order: bool = False, stopwords: Iterable[str] = (), regex_pattern: str | None = None, replacements: Iterable[tuple[str, str]] = (), allow_unicode: bool = False, lang: str | None = None, entities: bool = True, decimal: bool = True, hexadecimal: bool = True) -> str | list[str]

Generate a URL-safe slug from Unicode text.

Full pipeline: decode entities → transliterate → lowercase → strip non-alphanumeric → collapse separators → apply stopwords/max_length.

Parameter-compatible with python-slugify.

Parameters:
  • text (str | list[str]) –

    Input Unicode string.

  • separator (str, default: '-' ) –

    Character(s) between slug words.

  • lowercase (bool, default: True ) –

    Convert to lowercase.

  • max_length (int, default: 0 ) –

    Maximum slug length in bytes (0 = unlimited). With allow_unicode=True, multi-byte characters count as 2–4 bytes each — use :func:grapheme_truncate for character-aware limiting.

  • word_boundary (bool, default: False ) –

    When truncating via max_length, cut at word boundaries.

  • save_order (bool, default: False ) –

    Accepted for python-slugify compatibility but has no effect — word order is always preserved.

  • stopwords (Iterable[str], default: () ) –

    Words to remove from the slug.

  • regex_pattern (str | None, default: None ) –

    Custom regex for stripping characters.

  • replacements (Iterable[tuple[str, str]], default: () ) –

    Pre-transliteration (old, new) substitution pairs.

  • allow_unicode (bool, default: False ) –

    Keep non-ASCII letters instead of transliterating.

  • lang (str | None, default: None ) –

    Language code for transliteration (e.g. "de", "ru", "auto").

  • entities (bool, default: True ) –

    Decode HTML entities before processing.

  • decimal (bool, default: True ) –

    Decode HTML decimal entities ({).

  • hexadecimal (bool, default: True ) –

    Decode HTML hex entities ({).

Returns:
  • str | list[str]

    URL-safe slug string.

Raises:
  • TranslitError

    If an internal Rust error occurs.

  • NotImplementedError

    If pretranslate is passed as a callable (only dict is supported in the compatibility shim).

Examples:

>>> slugify("Hello World!")
'hello-world'
>>> slugify("Straße nach München", lang="de")
'strasse-nach-muenchen'
>>> slugify("My Title", separator="_")
'my_title'
>>> slugify("The Big Fox", stopwords=["the"])
'big-fox'
>>> slugify("Very Long Title Here", max_length=10, word_boundary=True)
'very-long'

normalize

normalize

normalize(text: str, *, form: NormalizationForm = ...) -> str
normalize(text: list[str], *, form: NormalizationForm = ...) -> list[str]
normalize(text: str | list[str], *, form: NormalizationForm = 'NFC') -> str | list[str]

Unicode normalization.

Accepts a single string or a list of strings.

Parameters:
  • text (str | list[str]) –

    Input string, or list of strings for batch processing.

  • form (NormalizationForm, default: 'NFC' ) –

    Normalization form — "NFC", "NFD", "NFKC", or "NFKD".

Returns:
  • str | list[str]

    Normalized string(s). Returns str when given str,

  • str | list[str]

    list[str] when given list[str].

Examples:

>>> normalize("é", form="NFC")
'é'
>>> normalize(["é", "ño"], form="NFC")
['é', 'ño']

normalize_confusables

normalize_confusables

normalize_confusables(text: str, *, target_script: str = 'latin') -> str

Replace Unicode confusable homoglyphs with target-script equivalents.

Uses Unicode TR39 confusables table. Characters without a confusable equivalent in the target script pass through unchanged (visual mapping only, not transliteration).

Parameters:
  • text (str) –

    Input string potentially containing homoglyphs.

  • target_script (str, default: 'latin' ) –

    Script to normalize toward. Supported values: "latin" (default, ~2,063 mappings) and "cyrillic" (~1,369 mappings).

Returns:
  • str

    String with confusable characters replaced by target-script equivalents.

Raises:
  • TranslitError

    If target_script is not a supported value.

Examples:

>>> normalize_confusables("Ηello")  # Greek Η looks like Latin H
'Hello'
>>> normalize_confusables("раypal")  # Cyrillic р/а look like Latin p/a
'paypal'
>>> normalize_confusables("paypal", target_script="cyrillic")
'раураӏ'

sanitize_filename

sanitize_filename

sanitize_filename(text: str, *, separator: str = '_', max_length: int = 255, platform: Platform = 'universal', lang: str | None = None, preserve_extension: bool = True, replacement_text: str | None = None, max_len: int | None = None) -> str

Sanitize a string into a safe filename.

Transliterate → strip OS-illegal chars → collapse separators → handle reserved names (CON, NUL, etc.) → truncate respecting extension.

Parameters:
  • text (str) –

    Input string (title, user input, etc.).

  • separator (str, default: '_' ) –

    Replacement for spaces and stripped characters. Also accepted as replacement_text (pathvalidate compatibility).

  • max_length (int, default: 255 ) –

    Maximum filename length measured in bytes (UTF-8 encoded), not characters. Default 255 matches the ext4/APFS/NTFS filesystem limit. Truncation always lands on a character boundary to avoid splitting multi-byte sequences. Also accepted as max_len (pathvalidate compatibility).

  • platform (Platform, default: 'universal' ) –

    Target platform — "universal", "windows", or "posix".

  • lang (str | None, default: None ) –

    Language code for transliteration (e.g. "de", "ja").

  • preserve_extension (bool, default: True ) –

    When True (default), the file extension is kept intact within max_length. If the extension alone (including the leading .) is ≥ max_length, the extension is dropped and the whole result is truncated to max_length bytes. When False, the entire string is truncated to max_length bytes without special treatment of the extension.

Returns:
  • str

    Safe filename string.

Raises:
  • TranslitError

    If an internal Rust error occurs.

Examples:

>>> sanitize_filename("My Report (final).pdf")
'My_Report_(final).pdf'
>>> sanitize_filename("CON.txt")  # reserved on Windows
'_CON.txt'
>>> sanitize_filename("résumé.docx", lang="fr")
'resume.docx'

strip_accents

strip_accents

strip_accents(text: str) -> str
strip_accents(text: list[str]) -> list[str]
strip_accents(text: str | list[str]) -> str | list[str]

Remove diacritical marks while preserving base characters.

NFD decompose → strip combining marks → NFC recompose. Accepts a single string or a list of strings.

Parameters:
  • text (str | list[str]) –

    Input string, or list of strings for batch processing.

Returns:
  • str | list[str]

    String(s) with diacritical marks removed.

Examples:

>>> strip_accents("café résumé naïve")
'cafe resume naive'
>>> strip_accents(["café", "naïve"])
['cafe', 'naive']

fold_case

fold_case

fold_case(text: str) -> str

Full Unicode case folding per CaseFolding.txt (Unicode 16.0).

Unlike str.lower(), this implements the complete Unicode Case Folding algorithm with all 1,557 status-C and status-F mappings. Covers Latin (ß→ss, ſ→s, İ→i̇), Greek (ς→σ, variant forms ϐ→β, ϑ→θ, ϕ→φ, ϖ→π, ϰ→κ, ϱ→ρ), Cyrillic, Armenian (ligature և→եւ), Georgian Mtavruli, Cherokee, Adlam, Deseret, Osage, Warang Citi, fullwidth Latin, and all Latin ligature expansions (fi→fi, fl→fl, ff→ff, ffi→ffi, ffl→ffl, ſt→st, st→st).

Equivalent to str.casefold() but executed in Rust via a compile-time PHF (perfect hash function) table. Pure-ASCII strings take a branchless fast path with no table lookup.

Parameters:
  • text (str) –

    Input string.

Returns:
  • str

    Case-folded string. Characters not in CaseFolding.txt map to

  • str

    themselves. Output satisfies fold_case(fold_case(x)) == fold_case(x)

  • str

    (idempotent).

Examples:

>>> fold_case("Straße")
'strasse'
>>> fold_case("ΣΟΦΙΑ")
'σοφια'
>>> fold_case("find")
'find'

collapse_whitespace

collapse_whitespace

collapse_whitespace(text: str, *, strip_control: bool = True, strip_zero_width: bool = True) -> str

Normalize all Unicode whitespace variants to single ASCII spaces.

Optionally strip control characters and zero-width characters.

Parameters:
  • text (str) –

    Input string.

  • strip_control (bool, default: True ) –

    Remove C0/C1 control characters (U+0000–U+001F, U+007F–U+009F) except tab and newline. Carriage return (\r) is stripped, so Windows-style \r\n becomes \n.

  • strip_zero_width (bool, default: True ) –

    Remove zero-width space (U+200B), zero-width non-joiner (U+200C), zero-width joiner (U+200D), and word joiner (U+2060).

Returns:
  • str

    String with whitespace collapsed and optionally cleaned.

Examples:

>>> collapse_whitespace("  hello   world  ")
'hello world'
>>> collapse_whitespace("tabs\there\ttoo")
'tabs here too'
>>> collapse_whitespace("a\u200Bb\u200Bc")  # zero-width spaces
'abc'

demojize

demojize

demojize(text: str, *, strip_modifiers: bool = False, errors: ErrorMode = 'replace', replace_with: str = '[?]', provider: EmojiProvider | None = None, delimiters: tuple[str, str] | None = None) -> str

Expand emoji sequences to their CLDR short-name text descriptions.

Output is always the bare CLDR short name as plain text.

Parameters:
  • text (str) –

    Input string potentially containing emoji.

  • strip_modifiers (bool, default: False ) –

    If True, collapse skin tone and hair style variants to their base form (e.g. "woman raising hand" instead of "woman raising hand: medium-dark skin tone").

  • errors (ErrorMode, default: 'replace' ) –

    How to handle emoji not in the provider's data. "replace" — substitute with replace_with. "ignore" — silently drop. "preserve" — keep the original emoji.

  • replace_with (str, default: '[?]' ) –

    Replacement string when errors="replace".

  • provider (EmojiProvider | None, default: None ) –

    An object implementing the :class:EmojiProvider protocol. Overrides the global provider for this call. None uses the global provider or the built-in default.

  • delimiters (tuple[str, str] | None, default: None ) –

    emoji library compatibility — ignored with a DeprecationWarning. translit always outputs bare CLDR short names without delimiters; wrap the result yourself if you need delimiters (e.g. f":{name}:").

Returns:
  • str

    Text with emoji replaced by their descriptions.

Raises:
  • TranslitError

    If an internal Rust error occurs.

Warns:
  • UserWarning

    If the provider raises an exception or returns a non-string value. The built-in CLDR tables are used as a fallback for that sequence.

Examples:

>>> demojize("I ❤️ Python 🐍")
'I red heart Python snake'

set_emoji_provider

set_emoji_provider

set_emoji_provider(provider: EmojiProvider | None = None) -> None

Set a global emoji provider for all demojize calls.

The provider must implement the :class:EmojiProvider protocol.

Pass None to reset to the built-in default (latest English CLDR).

Parameters:
  • provider (EmojiProvider | None, default: None ) –

    An object implementing the :class:EmojiProvider protocol, or None to reset to the built-in default.

Examples:

>>> set_emoji_provider(None)  # reset to default provider

strip_bidi

strip_bidi

strip_bidi(text: str) -> str

Strip bidirectional override and formatting characters (UAX #9).

Removes: soft hyphen (U+00AD), Arabic Letter Mark (U+061C), LRM/RLM (U+200E/F), bidi embeddings/overrides (U+202A–U+202E), bidi isolates (U+2066–U+2069).

Parameters:
  • text (str) –

    Input string.

Returns:
  • str

    String with bidi override and formatting characters removed.

Examples:

>>> strip_bidi("hello\u200eworld")  # remove LRM
'helloworld'
>>> strip_bidi("hello\u061cworld")  # remove Arabic Letter Mark
'helloworld'
>>> strip_bidi("safe text")  # no bidi chars → unchanged
'safe text'

strip_zalgo

strip_zalgo

strip_zalgo(text: str, *, max_marks: int = 2) -> str

Strip excessive combining marks, preserving legitimate diacritics.

Caps the number of combining marks per base character at max_marks. Operates in NFD space and recomposes to NFC.

Parameters:
  • text (str) –

    Input string (may contain zalgo abuse).

  • max_marks (int, default: 2 ) –

    Maximum combining marks to keep per base character (default: 2). Set to 0 to strip all combining marks (equivalent to :func:strip_accents).

Returns:
  • str

    String with excess combining marks removed.

Examples:

>>> strip_zalgo("café")  # 1 combining mark — preserved
'café'
>>> strip_zalgo("Việt Nam")  # 2 marks — preserved
'Việt Nam'

Caps the number of combining marks per base character, preserving legitimate diacritics (é, ñ, ệ) while removing zalgo stacking abuse.

from translit import strip_zalgo

strip_zalgo("café")           # => "café"  (1 mark — preserved)
strip_zalgo("Việt Nam")       # => "Việt Nam"  (2 marks — preserved)

# Strip all combining marks (like strip_accents)
strip_zalgo("café", max_marks=0)  # => "cafe"

List input (batch processing)

transliterate, slugify, normalize, and strip_accents accept either a single str or a list[str]. When a list is passed, all strings are processed in a single Rust call, amortizing the Python → Rust boundary overhead. The return type matches the input type.

from translit import transliterate, slugify

titles = ["café résumé", "Straße nach München", "Москва"]

transliterate(titles)
# => ["cafe resume", "Strasse nach Munchen", "Moskva"]

slugify(titles, lang="de")
# => ["cafe-resume", "strasse-nach-muenchen", "moskva"]

For large datasets, passing a list is significantly faster than calling the function in a Python loop. See Performance for benchmarks.

Compatibility aliases

The following aliases are provided for migration convenience:

Alias Target Matches
unidecode transliterate Unidecode / text-unidecode
ascii_fold transliterate Elasticsearch ICU folding
casefold fold_case str.casefold()
remove_accents strip_accents sklearn / ML ecosystems
from translit import unidecode, casefold, remove_accents

unidecode("café")        # => "cafe"
casefold("Straße")       # => "strasse"
remove_accents("café")   # => "cafe"