Classes

Stateful objects and builders for repeated or specialized text processing.

Text

Text

Immutable wrapper for fluent Unicode text processing.

Wrap a string, chain transforms in any order, extract with .value or str().

Examples:

>>> from translit import Text
>>> Text("Straße").fold_case().value
'strasse'
>>> Text("  hello   world  ").collapse_whitespace().value
'hello world'
>>> str(Text("café").strip_accents())
'cafe'

value property

value: str

Return the underlying string.

normalize

normalize(*, form: NormalizationForm = 'NFC') -> Text

Unicode normalization (NFC, NFD, NFKC, NFKD).

normalize_confusables

normalize_confusables(*, target_script: str = 'latin') -> Text

Replace confusable homoglyphs with target-script equivalents.

strip_accents

strip_accents() -> Text

Remove diacritical marks, preserving base characters.

transliterate

transliterate(*, lang: str | None = None, target: str | None = None, errors: ErrorMode = 'replace', replace_with: str = '[?]', strict_iso9: bool = False, gost7034: bool = False) -> Text

Unicode → ASCII transliteration.

fold_case

fold_case() -> Text

Full Unicode case folding per CaseFolding.txt (1,557 mappings).

Covers Latin, Greek, Cyrillic, Armenian, Georgian, Cherokee, Adlam, Deseret, Osage, Warang Citi, fullwidth Latin, and all ligature expansions. Equivalent to str.casefold().

collapse_whitespace

collapse_whitespace(*, strip_control: bool = True, strip_zero_width: bool = True) -> Text

Normalize whitespace to single ASCII spaces; optionally strip control characters and zero-width characters.

slugify

slugify(*, separator: str = '-', lowercase: bool = True, max_length: int = 0, word_boundary: bool = False, save_order: bool = False, stopwords: Iterable[str] = (), regex_pattern: str | None = None, replacements: Iterable[tuple[str, str]] = (), allow_unicode: bool = False, lang: str | None = None, entities: bool = True, decimal: bool = True, hexadecimal: bool = True) -> Text

Generate a URL-safe slug.

sanitize_filename

sanitize_filename(*, separator: str = '_', max_length: int = 255, platform: Platform = 'universal', lang: str | None = None, preserve_extension: bool = True) -> Text

Sanitize into a safe filename.

demojize

demojize(*, strip_modifiers: bool = False, errors: ErrorMode = 'replace', replace_with: str = '[?]', provider: EmojiProvider | None = None) -> Text

Expand emoji to CLDR short-name text descriptions.

strip_bidi

strip_bidi() -> Text

Strip bidirectional override and formatting characters.

security_clean

security_clean() -> Text

Apply the security_clean precompiled pipeline.

NFKC → confusables → strip bidi/format → collapse_whitespace.

ml_normalize

ml_normalize(*, lang: str | None = None, emoji: str = 'cldr') -> Text

Apply the ml_normalize precompiled pipeline.

NFKC → emoji→text → [transliterate] → strip_accents → fold_case → collapse_whitespace.

display_clean

display_clean() -> Text

Apply the display_clean precompiled pipeline.

Collapse whitespace, strip control and zero-width characters.

is_ascii

is_ascii() -> bool

True if all characters are U+0000–U+007F.

is_normalized

is_normalized(*, form: NormalizationForm = 'NFC') -> bool

True if already in the specified normalization form.

is_confusable

is_confusable(*, target_script: str = 'latin') -> bool

True if text contains confusable homoglyphs.

is_mixed_script

is_mixed_script() -> bool

True if text contains characters from multiple Unicode scripts.

detect_scripts

detect_scripts() -> list[Script]

Return Unicode scripts present, in order of first appearance.

grapheme_len

grapheme_len() -> int

Count user-perceived characters (extended grapheme clusters).

grapheme_split

grapheme_split() -> list[str]

Split into extended grapheme clusters.

grapheme_truncate

grapheme_truncate(max_graphemes: int) -> Text

Truncate to at most max_graphemes grapheme clusters.

catalog_key

catalog_key(*, lang: str | None = None, strict_iso9: bool = False) -> Text

Library catalog key generation for bibliographic deduplication.

Usage

from translit import Text

result = (
    Text("Ünïcödé Café ☕")
    .normalize("NFKC")
    .transliterate()
    .strip_accents()
    .fold_case()
    .value
)
# => "unicode cafe hot beverage"

Each transform method returns a new Text instance (immutable semantics, matching Python str). Predicates return their native type (bool, list) and do not chain.

Chainable transforms

All core transforms are available as methods:

Method Returns Description
.normalize(form=) Text Unicode normalization
.normalize_confusables() Text Replace confusable homoglyphs
.strip_accents() Text Remove diacritical marks
.transliterate(lang=, ...) Text Unicode → ASCII
.fold_case() Text Full Unicode case folding
.collapse_whitespace() Text Normalize whitespace
.slugify(...) Text Generate URL-safe slug
.sanitize_filename(...) Text Safe filename
.demojize(...) Text Emoji → text descriptions
.strip_bidi() Text Strip bidi overrides
.security_clean() Text Security pipeline
.ml_normalize(...) Text ML/NLP pipeline
.display_clean() Text Display cleanup pipeline
.catalog_key(...) Text Catalog key pipeline
.grapheme_truncate(n) Text Truncate to n graphemes

Non-chaining predicates

Method Returns Description
.is_ascii() bool All characters are ASCII
.is_normalized(form=) bool Already in normalization form
.is_confusable() bool Contains confusable homoglyphs
.is_mixed_script() bool Multiple Unicode scripts
.detect_scripts() list[Script] Scripts present
.grapheme_len() int User-perceived character count
.grapheme_split() list[str] Split into grapheme clusters

Result extraction

Use .value or str() to extract the underlying string:

text = Text("café").strip_accents()
text.value   # => "cafe"
str(text)    # => "cafe"

Text supports ==, hash(), len(), and bool() — comparing against the underlying string value.


Slugifier

Slugifier

Reusable configured slugifier. Call instance as slugifier(text) -> str.

Examples:

>>> s = Slugifier(separator="_", lang="de")
>>> s("Ärger im Büro")
'aerger_im_buero'

Usage

from translit import Slugifier

slug = Slugifier(separator="_", lang="de", max_length=50)
slug("Ärger im Büro")     # => "aerger_im_buero"
slug("Über den Wolken")   # => "ueber_den_wolken"

# Auto-detect language from script
auto_slug = Slugifier(lang="auto")
auto_slug("Москва")       # => "moskva" (detects Cyrillic → Russian)

Accepts all the same parameters as slugify(). Construct once, call many times.


UniqueSlugifier

UniqueSlugifier

Stateful slugifier that tracks previously generated slugs.

Appends incrementing suffixes for uniqueness. Optional check callback for external uniqueness (e.g. database lookup).

Examples:

>>> u = UniqueSlugifier()
>>> u("My Post")
'my-post'
>>> u("My Post")
'my-post-1'

reset

reset() -> None

Clear the internal set of seen slugs.

Usage

from translit import UniqueSlugifier

unique = UniqueSlugifier()
unique("My Post")   # => "my-post"
unique("My Post")   # => "my-post-1"
unique("My Post")   # => "my-post-2"

unique.reset()      # clear seen slugs
unique("My Post")   # => "my-post"

External uniqueness check

def exists_in_db(slug: str) -> bool:
    return db.slugs.filter(slug=slug).exists()

unique = UniqueSlugifier(check=exists_in_db)

The check callback is called for each candidate slug. If it returns True, the slugifier increments the suffix and tries again.


TextPipeline

TextPipeline

Composable, pre-compiled text cleaning pipeline.

Operations execute in fixed optimal order regardless of construction order.

Examples:

>>> pipe = TextPipeline(normalize="NFC", fold_case=True, collapse_whitespace=True)
>>> pipe("  Héllo  WÖRLD  ")
'héllo wörld'

steps property

steps: list[tuple[str, str | None]]

Return the ordered list of active pipeline steps.

Each entry is a (step_name, parameter) tuple. Steps are listed in execution order. parameter is None for parameterless steps (e.g. fold_case), or a string value for steps that accept one (e.g. ("normalize", "NFC")).

Examples:

>>> pipe = TextPipeline(normalize="NFC", fold_case=True)
>>> pipe.steps
[('normalize', 'NFC'), ('fold_case', None)]

explain

explain() -> str

Return a human-readable description of the pipeline.

Examples:

>>> pipe = TextPipeline(normalize="NFC", fold_case=True)
>>> print(pipe.explain())
TextPipeline with 2 steps:
  1. normalize (NFC)
  2. fold_case

Usage

from translit import TextPipeline

pipe = TextPipeline(
    normalize="NFC",
    confusables=True,
    strip_accents=True,
    fold_case=True,
    collapse_whitespace=True,
)

pipe("  Héllo Wörld  ")  # => "hello world"

Execution order

Operations execute in this fixed order regardless of construction order:

  1. Normalize → 2. Confusables → 3. Demojize → 4. Strip accents → 5. Transliterate → 6. Fold case → 7. Collapse whitespace

Performance

The pipeline is pre-compiled at construction. Enabled steps are stored as a bitflag set — only enabled steps execute at call time.


Compatibility aliases (awesome-slugify)

These classes provide drop-in replacements for awesome-slugify's Slugify and UniqueSlugify. They accept awesome-slugify's parameter names and map them to native translit parameters.

See the migration guide for full details.

Slugify

Slugify

awesome-slugify-compatible Slugify class.

Accepts both awesome-slugify parameter names (to_lower, stop_words, safe_chars, capitalize, pretranslate) and native translit names.

Usage::

from translit import Slugify
custom = Slugify(to_lower=True)
custom("Hello World")  # => "hello-world"

This is a drop-in replacement for from slugify import Slugify.

from translit import Slugify

# Same API as awesome-slugify
custom = Slugify(to_lower=True)
custom("Hello World")  # => "hello-world"

# Attribute-style configuration (awesome-slugify pattern)
s = Slugify()
s.to_lower = True
s.stop_words = ("the", "a")
s.max_length = 200
s("The Big Fox")  # => "big-fox"

Accepts both awesome-slugify parameter names (to_lower, stop_words, safe_chars, capitalize, pretranslate) and native translit names (lowercase, stopwords, replacements).

Defaults to to_lower=False (matching awesome-slugify). For python-slugify compatibility (which defaults to lowercase=True), use the native Slugifier class or the slugify() function.


UniqueSlugify

UniqueSlugify

Bases: Slugify

awesome-slugify-compatible UniqueSlugify class.

Tracks previously generated slugs and appends numeric suffixes to guarantee uniqueness.

Usage::

from translit import UniqueSlugify
unique = UniqueSlugify()
unique("My Post")   # => "My-Post"
unique("My Post")   # => "My-Post-1"

This is a drop-in replacement for from slugify import UniqueSlugify.

reset

reset() -> None

Clear the internal set of seen slugs.

from translit import UniqueSlugify

unique = UniqueSlugify(to_lower=True)
unique("My Post")   # => "my-post"
unique("My Post")   # => "my-post-1"

unique.reset()
unique("My Post")   # => "my-post"

Extends Slugify with uniqueness tracking. Accepts uids and unique_check parameters from awesome-slugify.


Preconfigured instances

Drop-in replacements for awesome-slugify's preconfigured slugifiers:

from translit import (
    slugify_url,       # lowercase, strips articles, max 200 chars
    slugify_filename,  # underscore separator, preserves -., max 255 chars
    slugify_unicode,   # keeps non-ASCII letters
    slugify_ru,        # Russian transliteration
    slugify_de,        # German transliteration (ä→ae, ö→oe, ü→ue)
    slugify_el,        # Greek transliteration
)

slugify_url("The Big Fox")        # => "big-fox"
slugify_de("Ärger im Büro")       # => "Aerger-im-Buero"
slugify_filename("My Report.pdf") # => "My_Report.pdf"