Classes¶

Stateful objects and builders for repeated or specialized text processing.

Text¶

Text ¶

Immutable wrapper for fluent Unicode text processing.

Wrap a string, chain transforms in any order, extract with .value or str().

Examples:

>>> from translit import Text
>>> Text("Straße").fold_case().value
'strasse'
>>> Text("  hello   world  ").collapse_whitespace().value
'hello world'
>>> str(Text("café").strip_accents())
'cafe'

value `property` ¶

value: str

Return the underlying string.

normalize ¶

normalize(*, form: NormalizationForm = 'NFC') -> Text

Unicode normalization (NFC, NFD, NFKC, NFKD).

normalize_confusables ¶

normalize_confusables(*, target_script: str = 'latin') -> Text

Replace confusable homoglyphs with target-script equivalents.

strip_accents ¶

strip_accents() -> Text

Remove diacritical marks, preserving base characters.

transliterate ¶

transliterate(*, lang: str | None = None, target: str | None = None, errors: ErrorMode = 'replace', replace_with: str = '[?]', strict_iso9: bool = False, gost7034: bool = False) -> Text

Unicode → ASCII transliteration.

fold_case ¶

fold_case() -> Text

Full Unicode case folding per CaseFolding.txt (1,557 mappings).

Covers Latin, Greek, Cyrillic, Armenian, Georgian, Cherokee, Adlam, Deseret, Osage, Warang Citi, fullwidth Latin, and all ligature expansions. Equivalent to str.casefold().

collapse_whitespace ¶

collapse_whitespace(*, strip_control: bool = True, strip_zero_width: bool = True) -> Text

Normalize whitespace to single ASCII spaces; optionally strip control characters and zero-width characters.

slugify ¶

slugify(*, separator: str = '-', lowercase: bool = True, max_length: int = 0, word_boundary: bool = False, save_order: bool = False, stopwords: Iterable[str] = (), regex_pattern: str | None = None, replacements: Iterable[tuple[str, str]] = (), allow_unicode: bool = False, lang: str | None = None, entities: bool = True, decimal: bool = True, hexadecimal: bool = True) -> Text

Generate a URL-safe slug.

sanitize_filename ¶

sanitize_filename(*, separator: str = '_', max_length: int = 255, platform: Platform = 'universal', lang: str | None = None, preserve_extension: bool = True) -> Text

Sanitize into a safe filename.

demojize ¶

demojize(*, strip_modifiers: bool = False, errors: ErrorMode = 'replace', replace_with: str = '[?]', provider: EmojiProvider | None = None) -> Text

Expand emoji to CLDR short-name text descriptions.

strip_bidi ¶

strip_bidi() -> Text

Strip bidirectional override and formatting characters.

security_clean ¶

security_clean() -> Text

Apply the security_clean precompiled pipeline.

NFKC → confusables → strip bidi/format → collapse_whitespace.

ml_normalize ¶

ml_normalize(*, lang: str | None = None, emoji: str = 'cldr') -> Text

Apply the ml_normalize precompiled pipeline.

NFKC → emoji→text → [transliterate] → strip_accents → fold_case → collapse_whitespace.

display_clean ¶

display_clean() -> Text

Apply the display_clean precompiled pipeline.

Collapse whitespace, strip control and zero-width characters.

is_ascii ¶

is_ascii() -> bool

True if all characters are U+0000–U+007F.

is_normalized ¶

is_normalized(*, form: NormalizationForm = 'NFC') -> bool

True if already in the specified normalization form.

is_confusable ¶

is_confusable(*, target_script: str = 'latin') -> bool

True if text contains confusable homoglyphs.

is_mixed_script ¶

is_mixed_script() -> bool

True if text contains characters from multiple Unicode scripts.

detect_scripts ¶

detect_scripts() -> list[Script]

Return Unicode scripts present, in order of first appearance.

grapheme_len ¶

grapheme_len() -> int

Count user-perceived characters (extended grapheme clusters).

grapheme_split ¶

grapheme_split() -> list[str]

Split into extended grapheme clusters.

grapheme_truncate ¶

grapheme_truncate(max_graphemes: int) -> Text

Truncate to at most max_graphemes grapheme clusters.

catalog_key ¶

catalog_key(*, lang: str | None = None, strict_iso9: bool = False) -> Text

Library catalog key generation for bibliographic deduplication.

Usage¶

from translit import Text

result = (
    Text("Ünïcödé Café ☕")
    .normalize("NFKC")
    .transliterate()
    .strip_accents()
    .fold_case()
    .value
)
# => "unicode cafe hot beverage"

Each transform method returns a new Text instance (immutable semantics, matching Python str). Predicates return their native type (bool, list) and do not chain.

Chainable transforms¶

All core transforms are available as methods:

Method	Returns	Description
`.normalize(form=)`	`Text`	Unicode normalization
`.normalize_confusables()`	`Text`	Replace confusable homoglyphs
`.strip_accents()`	`Text`	Remove diacritical marks
`.transliterate(lang=, ...)`	`Text`	Unicode → ASCII
`.fold_case()`	`Text`	Full Unicode case folding
`.collapse_whitespace()`	`Text`	Normalize whitespace
`.slugify(...)`	`Text`	Generate URL-safe slug
`.sanitize_filename(...)`	`Text`	Safe filename
`.demojize(...)`	`Text`	Emoji → text descriptions
`.strip_bidi()`	`Text`	Strip bidi overrides
`.security_clean()`	`Text`	Security pipeline
`.ml_normalize(...)`	`Text`	ML/NLP pipeline
`.display_clean()`	`Text`	Display cleanup pipeline
`.catalog_key(...)`	`Text`	Catalog key pipeline
`.grapheme_truncate(n)`	`Text`	Truncate to n graphemes

Non-chaining predicates¶

Method	Returns	Description
`.is_ascii()`	`bool`	All characters are ASCII
`.is_normalized(form=)`	`bool`	Already in normalization form
`.is_confusable()`	`bool`	Contains confusable homoglyphs
`.is_mixed_script()`	`bool`	Multiple Unicode scripts
`.detect_scripts()`	`list[Script]`	Scripts present
`.grapheme_len()`	`int`	User-perceived character count
`.grapheme_split()`	`list[str]`	Split into grapheme clusters

Result extraction¶

Use .value or str() to extract the underlying string:

text = Text("café").strip_accents()
text.value   # => "cafe"
str(text)    # => "cafe"

Text supports ==, hash(), len(), and bool() — comparing against the underlying string value.

Slugifier¶

Slugifier ¶

Reusable configured slugifier. Call instance as slugifier(text) -> str.

Examples:

>>> s = Slugifier(separator="_", lang="de")
>>> s("Ärger im Büro")
'aerger_im_buero'

Usage¶

from translit import Slugifier

slug = Slugifier(separator="_", lang="de", max_length=50)
slug("Ärger im Büro")     # => "aerger_im_buero"
slug("Über den Wolken")   # => "ueber_den_wolken"

# Auto-detect language from script
auto_slug = Slugifier(lang="auto")
auto_slug("Москва")       # => "moskva" (detects Cyrillic → Russian)

Accepts all the same parameters as slugify(). Construct once, call many times.

UniqueSlugifier¶

UniqueSlugifier ¶

Stateful slugifier that tracks previously generated slugs.

Appends incrementing suffixes for uniqueness. Optional check callback for external uniqueness (e.g. database lookup).

Examples:

>>> u = UniqueSlugifier()
>>> u("My Post")
'my-post'
>>> u("My Post")
'my-post-1'

reset ¶

reset() -> None

Clear the internal set of seen slugs.

Usage¶

from translit import UniqueSlugifier

unique = UniqueSlugifier()
unique("My Post")   # => "my-post"
unique("My Post")   # => "my-post-1"
unique("My Post")   # => "my-post-2"

unique.reset()      # clear seen slugs
unique("My Post")   # => "my-post"

External uniqueness check¶

def exists_in_db(slug: str) -> bool:
    return db.slugs.filter(slug=slug).exists()

unique = UniqueSlugifier(check=exists_in_db)

The check callback is called for each candidate slug. If it returns True, the slugifier increments the suffix and tries again.

TextPipeline¶

TextPipeline ¶

Composable, pre-compiled text cleaning pipeline.

Operations execute in fixed optimal order regardless of construction order.

Examples:

>>> pipe = TextPipeline(normalize="NFC", fold_case=True, collapse_whitespace=True)
>>> pipe("  Héllo  WÖRLD  ")
'héllo wörld'

steps `property` ¶

steps: list[tuple[str, str | None]]

Return the ordered list of active pipeline steps.

Each entry is a (step_name, parameter) tuple. Steps are listed in execution order. parameter is None for parameterless steps (e.g. fold_case), or a string value for steps that accept one (e.g. ("normalize", "NFC")).

Examples:

>>> pipe = TextPipeline(normalize="NFC", fold_case=True)
>>> pipe.steps
[('normalize', 'NFC'), ('fold_case', None)]

explain ¶

explain() -> str

Return a human-readable description of the pipeline.

Examples:

>>> pipe = TextPipeline(normalize="NFC", fold_case=True)
>>> print(pipe.explain())
TextPipeline with 2 steps:
  1. normalize (NFC)
  2. fold_case

Usage¶

from translit import TextPipeline

pipe = TextPipeline(
    normalize="NFC",
    confusables=True,
    strip_accents=True,
    fold_case=True,
    collapse_whitespace=True,
)

pipe("  Héllo Wörld  ")  # => "hello world"

Execution order¶

Operations execute in this fixed order regardless of construction order:

Normalize → 2. Confusables → 3. Demojize → 4. Strip accents → 5. Transliterate → 6. Fold case → 7. Collapse whitespace

Performance¶

The pipeline is pre-compiled at construction. Enabled steps are stored as a bitflag set — only enabled steps execute at call time.

Compatibility aliases (awesome-slugify)¶

These classes provide drop-in replacements for awesome-slugify's Slugify and UniqueSlugify. They accept awesome-slugify's parameter names and map them to native translit parameters.

See the migration guide for full details.

Slugify¶

Slugify ¶

awesome-slugify-compatible Slugify class.

Accepts both awesome-slugify parameter names (to_lower, stop_words, safe_chars, capitalize, pretranslate) and native translit names.

Usage::

from translit import Slugify
custom = Slugify(to_lower=True)
custom("Hello World")  # => "hello-world"

This is a drop-in replacement for from slugify import Slugify.

from translit import Slugify

# Same API as awesome-slugify
custom = Slugify(to_lower=True)
custom("Hello World")  # => "hello-world"

# Attribute-style configuration (awesome-slugify pattern)
s = Slugify()
s.to_lower = True
s.stop_words = ("the", "a")
s.max_length = 200
s("The Big Fox")  # => "big-fox"

Accepts both awesome-slugify parameter names (to_lower, stop_words, safe_chars, capitalize, pretranslate) and native translit names (lowercase, stopwords, replacements).

Defaults to to_lower=False (matching awesome-slugify). For python-slugify compatibility (which defaults to lowercase=True), use the native Slugifier class or the slugify() function.

UniqueSlugify¶

UniqueSlugify ¶

Bases: Slugify

awesome-slugify-compatible UniqueSlugify class.

Tracks previously generated slugs and appends numeric suffixes to guarantee uniqueness.

Usage::

from translit import UniqueSlugify
unique = UniqueSlugify()
unique("My Post")   # => "My-Post"
unique("My Post")   # => "My-Post-1"

This is a drop-in replacement for from slugify import UniqueSlugify.

reset ¶

reset() -> None

Clear the internal set of seen slugs.

from translit import UniqueSlugify

unique = UniqueSlugify(to_lower=True)
unique("My Post")   # => "my-post"
unique("My Post")   # => "my-post-1"

unique.reset()
unique("My Post")   # => "my-post"

Extends Slugify with uniqueness tracking. Accepts uids and unique_check parameters from awesome-slugify.

Preconfigured instances¶

Drop-in replacements for awesome-slugify's preconfigured slugifiers:

from translit import (
    slugify_url,       # lowercase, strips articles, max 200 chars
    slugify_filename,  # underscore separator, preserves -., max 255 chars
    slugify_unicode,   # keeps non-ASCII letters
    slugify_ru,        # Russian transliteration
    slugify_de,        # German transliteration (ä→ae, ö→oe, ü→ue)
    slugify_el,        # Greek transliteration
)

slugify_url("The Big Fox")        # => "big-fox"
slugify_de("Ärger im Büro")       # => "Aerger-im-Buero"
slugify_filename("My Report.pdf") # => "My_Report.pdf"

Classes¶

Text¶

Text ¶

value property ¶

normalize ¶

normalize_confusables ¶

strip_accents ¶

transliterate ¶

fold_case ¶

collapse_whitespace ¶

slugify ¶

sanitize_filename ¶

demojize ¶

strip_bidi ¶

security_clean ¶

ml_normalize ¶

display_clean ¶

is_ascii ¶

is_normalized ¶

is_confusable ¶

is_mixed_script ¶

detect_scripts ¶

grapheme_len ¶

grapheme_split ¶

grapheme_truncate ¶

catalog_key ¶

Usage¶

Chainable transforms¶

Non-chaining predicates¶

Result extraction¶

Slugifier¶

Slugifier ¶

Usage¶

UniqueSlugifier¶

UniqueSlugifier ¶

reset ¶

Usage¶

External uniqueness check¶

TextPipeline¶

TextPipeline ¶

steps property ¶

explain ¶

Usage¶

Execution order¶

Performance¶

Compatibility aliases (awesome-slugify)¶

Slugify¶

Slugify ¶

UniqueSlugify¶

UniqueSlugify ¶

reset ¶

Preconfigured instances¶

value `property` ¶

steps `property` ¶