Classes¶
Stateful objects and builders for repeated or specialized text processing.
Text¶
Text ¶
Immutable wrapper for fluent Unicode text processing.
Wrap a string, chain transforms in any order, extract with .value
or str().
Examples:
>>> from translit import Text
>>> Text("Straße").fold_case().value
'strasse'
>>> Text(" hello world ").collapse_whitespace().value
'hello world'
>>> str(Text("café").strip_accents())
'cafe'
normalize ¶
normalize(*, form: NormalizationForm = 'NFC') -> Text
Unicode normalization (NFC, NFD, NFKC, NFKD).
normalize_confusables ¶
normalize_confusables(*, target_script: str = 'latin') -> Text
Replace confusable homoglyphs with target-script equivalents.
transliterate ¶
transliterate(*, lang: str | None = None, target: str | None = None, errors: ErrorMode = 'replace', replace_with: str = '[?]', strict_iso9: bool = False, gost7034: bool = False) -> Text
Unicode → ASCII transliteration.
fold_case ¶
fold_case() -> Text
Full Unicode case folding per CaseFolding.txt (1,557 mappings).
Covers Latin, Greek, Cyrillic, Armenian, Georgian, Cherokee,
Adlam, Deseret, Osage, Warang Citi, fullwidth Latin, and all
ligature expansions. Equivalent to str.casefold().
collapse_whitespace ¶
collapse_whitespace(*, strip_control: bool = True, strip_zero_width: bool = True) -> Text
Normalize whitespace to single ASCII spaces; optionally strip control characters and zero-width characters.
slugify ¶
slugify(*, separator: str = '-', lowercase: bool = True, max_length: int = 0, word_boundary: bool = False, save_order: bool = False, stopwords: Iterable[str] = (), regex_pattern: str | None = None, replacements: Iterable[tuple[str, str]] = (), allow_unicode: bool = False, lang: str | None = None, entities: bool = True, decimal: bool = True, hexadecimal: bool = True) -> Text
Generate a URL-safe slug.
sanitize_filename ¶
sanitize_filename(*, separator: str = '_', max_length: int = 255, platform: Platform = 'universal', lang: str | None = None, preserve_extension: bool = True) -> Text
Sanitize into a safe filename.
demojize ¶
demojize(*, strip_modifiers: bool = False, errors: ErrorMode = 'replace', replace_with: str = '[?]', provider: EmojiProvider | None = None) -> Text
Expand emoji to CLDR short-name text descriptions.
security_clean ¶
security_clean() -> Text
Apply the security_clean precompiled pipeline.
NFKC → confusables → strip bidi/format → collapse_whitespace.
ml_normalize ¶
ml_normalize(*, lang: str | None = None, emoji: str = 'cldr') -> Text
Apply the ml_normalize precompiled pipeline.
NFKC → emoji→text → [transliterate] → strip_accents → fold_case → collapse_whitespace.
display_clean ¶
display_clean() -> Text
Apply the display_clean precompiled pipeline.
Collapse whitespace, strip control and zero-width characters.
is_normalized ¶
is_normalized(*, form: NormalizationForm = 'NFC') -> bool
True if already in the specified normalization form.
is_confusable ¶
is_confusable(*, target_script: str = 'latin') -> bool
True if text contains confusable homoglyphs.
is_mixed_script ¶
is_mixed_script() -> bool
True if text contains characters from multiple Unicode scripts.
detect_scripts ¶
detect_scripts() -> list[Script]
Return Unicode scripts present, in order of first appearance.
grapheme_truncate ¶
grapheme_truncate(max_graphemes: int) -> Text
Truncate to at most max_graphemes grapheme clusters.
catalog_key ¶
catalog_key(*, lang: str | None = None, strict_iso9: bool = False) -> Text
Library catalog key generation for bibliographic deduplication.
Usage¶
from translit import Text
result = (
Text("Ünïcödé Café ☕")
.normalize("NFKC")
.transliterate()
.strip_accents()
.fold_case()
.value
)
# => "unicode cafe hot beverage"
Each transform method returns a new Text instance (immutable semantics, matching Python str). Predicates return their native type (bool, list) and do not chain.
Chainable transforms¶
All core transforms are available as methods:
| Method | Returns | Description |
|---|---|---|
.normalize(form=) |
Text |
Unicode normalization |
.normalize_confusables() |
Text |
Replace confusable homoglyphs |
.strip_accents() |
Text |
Remove diacritical marks |
.transliterate(lang=, ...) |
Text |
Unicode → ASCII |
.fold_case() |
Text |
Full Unicode case folding |
.collapse_whitespace() |
Text |
Normalize whitespace |
.slugify(...) |
Text |
Generate URL-safe slug |
.sanitize_filename(...) |
Text |
Safe filename |
.demojize(...) |
Text |
Emoji → text descriptions |
.strip_bidi() |
Text |
Strip bidi overrides |
.security_clean() |
Text |
Security pipeline |
.ml_normalize(...) |
Text |
ML/NLP pipeline |
.display_clean() |
Text |
Display cleanup pipeline |
.catalog_key(...) |
Text |
Catalog key pipeline |
.grapheme_truncate(n) |
Text |
Truncate to n graphemes |
Non-chaining predicates¶
| Method | Returns | Description |
|---|---|---|
.is_ascii() |
bool |
All characters are ASCII |
.is_normalized(form=) |
bool |
Already in normalization form |
.is_confusable() |
bool |
Contains confusable homoglyphs |
.is_mixed_script() |
bool |
Multiple Unicode scripts |
.detect_scripts() |
list[Script] |
Scripts present |
.grapheme_len() |
int |
User-perceived character count |
.grapheme_split() |
list[str] |
Split into grapheme clusters |
Result extraction¶
Use .value or str() to extract the underlying string:
text = Text("café").strip_accents()
text.value # => "cafe"
str(text) # => "cafe"
Text supports ==, hash(), len(), and bool() — comparing against the underlying string value.
Slugifier¶
Slugifier ¶
Reusable configured slugifier. Call instance as slugifier(text) -> str.
Examples:
>>> s = Slugifier(separator="_", lang="de")
>>> s("Ärger im Büro")
'aerger_im_buero'
Usage¶
from translit import Slugifier
slug = Slugifier(separator="_", lang="de", max_length=50)
slug("Ärger im Büro") # => "aerger_im_buero"
slug("Über den Wolken") # => "ueber_den_wolken"
# Auto-detect language from script
auto_slug = Slugifier(lang="auto")
auto_slug("Москва") # => "moskva" (detects Cyrillic → Russian)
Accepts all the same parameters as slugify(). Construct once, call many times.
UniqueSlugifier¶
UniqueSlugifier ¶
Stateful slugifier that tracks previously generated slugs.
Appends incrementing suffixes for uniqueness. Optional check callback for external uniqueness (e.g. database lookup).
Examples:
>>> u = UniqueSlugifier()
>>> u("My Post")
'my-post'
>>> u("My Post")
'my-post-1'
Usage¶
from translit import UniqueSlugifier
unique = UniqueSlugifier()
unique("My Post") # => "my-post"
unique("My Post") # => "my-post-1"
unique("My Post") # => "my-post-2"
unique.reset() # clear seen slugs
unique("My Post") # => "my-post"
External uniqueness check¶
def exists_in_db(slug: str) -> bool:
return db.slugs.filter(slug=slug).exists()
unique = UniqueSlugifier(check=exists_in_db)
The check callback is called for each candidate slug. If it returns True, the slugifier increments the suffix and tries again.
TextPipeline¶
TextPipeline ¶
Composable, pre-compiled text cleaning pipeline.
Operations execute in fixed optimal order regardless of construction order.
Examples:
>>> pipe = TextPipeline(normalize="NFC", fold_case=True, collapse_whitespace=True)
>>> pipe(" Héllo WÖRLD ")
'héllo wörld'
steps
property
¶
steps: list[tuple[str, str | None]]
Return the ordered list of active pipeline steps.
Each entry is a (step_name, parameter) tuple. Steps are listed
in execution order. parameter is None for parameterless
steps (e.g. fold_case), or a string value for steps that accept
one (e.g. ("normalize", "NFC")).
Examples:
>>> pipe = TextPipeline(normalize="NFC", fold_case=True)
>>> pipe.steps
[('normalize', 'NFC'), ('fold_case', None)]
explain ¶
explain() -> str
Return a human-readable description of the pipeline.
Examples:
>>> pipe = TextPipeline(normalize="NFC", fold_case=True)
>>> print(pipe.explain())
TextPipeline with 2 steps:
1. normalize (NFC)
2. fold_case
Usage¶
from translit import TextPipeline
pipe = TextPipeline(
normalize="NFC",
confusables=True,
strip_accents=True,
fold_case=True,
collapse_whitespace=True,
)
pipe(" Héllo Wörld ") # => "hello world"
Execution order¶
Operations execute in this fixed order regardless of construction order:
- Normalize → 2. Confusables → 3. Demojize → 4. Strip accents → 5. Transliterate → 6. Fold case → 7. Collapse whitespace
Performance¶
The pipeline is pre-compiled at construction. Enabled steps are stored as a bitflag set — only enabled steps execute at call time.
Compatibility aliases (awesome-slugify)¶
These classes provide drop-in replacements for awesome-slugify's Slugify and UniqueSlugify. They accept awesome-slugify's parameter names and map them to native translit parameters.
See the migration guide for full details.
Slugify¶
Slugify ¶
awesome-slugify-compatible Slugify class.
Accepts both awesome-slugify parameter names (to_lower, stop_words,
safe_chars, capitalize, pretranslate) and native translit names.
Usage::
from translit import Slugify
custom = Slugify(to_lower=True)
custom("Hello World") # => "hello-world"
This is a drop-in replacement for from slugify import Slugify.
from translit import Slugify
# Same API as awesome-slugify
custom = Slugify(to_lower=True)
custom("Hello World") # => "hello-world"
# Attribute-style configuration (awesome-slugify pattern)
s = Slugify()
s.to_lower = True
s.stop_words = ("the", "a")
s.max_length = 200
s("The Big Fox") # => "big-fox"
Accepts both awesome-slugify parameter names (to_lower, stop_words, safe_chars, capitalize, pretranslate) and native translit names (lowercase, stopwords, replacements).
Defaults to to_lower=False (matching awesome-slugify). For python-slugify compatibility (which defaults to lowercase=True), use the native Slugifier class or the slugify() function.
UniqueSlugify¶
UniqueSlugify ¶
Bases: Slugify
awesome-slugify-compatible UniqueSlugify class.
Tracks previously generated slugs and appends numeric suffixes to guarantee uniqueness.
Usage::
from translit import UniqueSlugify
unique = UniqueSlugify()
unique("My Post") # => "My-Post"
unique("My Post") # => "My-Post-1"
This is a drop-in replacement for from slugify import UniqueSlugify.
from translit import UniqueSlugify
unique = UniqueSlugify(to_lower=True)
unique("My Post") # => "my-post"
unique("My Post") # => "my-post-1"
unique.reset()
unique("My Post") # => "my-post"
Extends Slugify with uniqueness tracking. Accepts uids and unique_check parameters from awesome-slugify.
Preconfigured instances¶
Drop-in replacements for awesome-slugify's preconfigured slugifiers:
from translit import (
slugify_url, # lowercase, strips articles, max 200 chars
slugify_filename, # underscore separator, preserves -., max 255 chars
slugify_unicode, # keeps non-ASCII letters
slugify_ru, # Russian transliteration
slugify_de, # German transliteration (ä→ae, ö→oe, ü→ue)
slugify_el, # Greek transliteration
)
slugify_url("The Big Fox") # => "big-fox"
slugify_de("Ärger im Büro") # => "Aerger-im-Buero"
slugify_filename("My Report.pdf") # => "My_Report.pdf"