translit

Documentation License: MIT

Unicode text infrastructure for Python: transliteration, normalization, and safety analysis, powered by Rust.

Documentation | API Reference | PyPI

Demo

Try translit in your browser

Features

All text processing is implemented in Rust with O(1) PHF lookups and exposed to Python via PyO3.

Installation

pip install translit-rs

The package installs as translit-rs on PyPI but imports as translit:

import translit  # not translit_rs

Requires Python 3.9+. Wheels are available for Linux, macOS, and Windows.

Quick start

from translit import transliterate, slugify, sanitize_filename

# Latin/Cyrillic/Greek
transliterate("café")          # → "cafe"
transliterate("Москва")        # → "Moskva"
transliterate("Ünïcödé")       # → "Unicode"

# Chinese (Hanzi → Pinyin)
transliterate("北京市")         # → "bei jing shi"
slugify("北京烤鸭")            # → "bei-jing-kao-ya"

# Korean (Hangul → Revised Romanization)
transliterate("서울")           # → "seo ul"
slugify("대한민국")            # → "dae-han-min-gug"

# Japanese (Hiragana/Katakana → Hepburn)
transliterate("ひらがな")       # → "hiragana"
transliterate("カタカナ")       # → "katakana"

# Language-specific transliteration
transliterate("Ärger", lang="de")  # → "Aerger"
transliterate("Київ", lang="uk")   # → "Kyiv"

# Auto-detect language from script
transliterate("Москва", lang="auto")  # → "Moskva" (detects Cyrillic → Russian)
transliterate("ภาษาไทย", lang="auto")  # → Thai transliteration (detects Thai)

# Reverse transliteration (Latin → native script)
transliterate("Moskva", target="ru")   # → "Москва"
transliterate("Athina", target="el")   # → "Αθηνα"

# Slugification
slugify("Hello World!")            # → "hello-world"
slugify("café au lait")           # → "cafe-au-lait"

# Filename sanitization
sanitize_filename("my file<>.txt")         # → "my_file.txt"
sanitize_filename("CON.txt")               # → "_CON.txt"
sanitize_filename("../../etc/passwd")      # → ".etc_passwd"

CJK transliteration

Chinese characters are mapped to toneless pinyin from the Unicode Unihan kMandarin field, covering the full CJK Unified Ideographs block (U+4E00–U+9FFF, 20,924 characters). Korean Hangul syllables are algorithmically decomposed into jamo and romanized using the Revised Romanization standard (all 11,172 precomposed syllables). Japanese hiragana and katakana use Modified Hepburn; kanji fall back to Chinese pinyin readings.

This is context-free, character-by-character transliteration, the same approach as Unidecode. See limitations.md for details on polyphony, phonological rules, and other trade-offs.

Precompiled pipelines

from translit import security_clean, ml_normalize, catalog_key, sanitize_user_input, strip_obfuscation

# Security: NFKC → confusables → strip bidi → collapse whitespace
security_clean("ℝ𝕖𝕒𝕝 𝕥𝕖𝕩𝕥")  # → "Real text"

# ML/NLP: NFKC → emoji→text → transliterate → strip accents → fold case
ml_normalize("Café ☕ Ünïcödé")  # → "cafe hot beverage unicode"

# Library catalog: NFKC → transliterate → confusables → strip accents → fold case
catalog_key("Москва", lang="ru")  # → "moskva"
catalog_key("ΩMEGA  café")        # → "omega cafe"

# Web input: NFKC → strip zalgo → confusables → strip bidi → collapse whitespace
sanitize_user_input("p\u0430ypal")  # → "paypal" (homoglyph neutralized)

# Maximum deobfuscation: homoglyphs, zalgo, invisible chars → clean text
strip_obfuscation("p\u0440odu\u0441t")       # → "product" (Cyrillic р→p, с→c via TR39)
strip_obfuscation("p\u0430yp\u0430l 🔥🔥")  # → "paypal fire fire"
# Note: does NOT transliterate — chain with transliterate() if needed

Text builder

from translit import Text

result = (
    Text("Ünïcödé Café ☕")
    .normalize("NFKC")
    .transliterate()
    .strip_accents()
    .fold_case()
    .value
)
# → "unicode cafe hot beverage"

Package structure

The API is organized into domain-specific namespaces. All functions are also available at the top level for convenience.

Namespace Purpose Key functions
translit Core transforms transliterate, slugify, Text, TextPipeline
translit.normalization Unicode normalization normalize, strip_accents, fold_case, collapse_whitespace
translit.security Safety analysis is_confusable, is_mixed_script, is_safe_hostname, security_clean
translit.files Filename handling sanitize_filename
translit.codec Byte decoding decode_to_utf8, detect_encoding
# Namespace imports
from translit.security import is_confusable, security_clean
from translit.codec import decode_to_utf8
from translit.normalization import fold_case

# Top-level imports also work
from translit import is_confusable, security_clean, decode_to_utf8, fold_case

Script policies

Transliteration applies different policies depending on the script. This table documents what each script does and which standard it follows.

Script Policy Standard / Source Example
Latin (accented) Accent stripping Unicode NFKD decomposition ée
Cyrillic Phonetic romanization BGN/PCGN (default), ISO 9:1995 (strict_iso9=True), GOST R 7.0.34 (gost7034=True) МоскваMoskva
Greek Transliteration BGN/PCGN romanization ΑθήναAthena
Chinese (Hanzi) Romanization Unihan kMandarin (toneless pinyin) 北京bei jing
Korean (Hangul) Romanization Revised Romanization of Korean 서울seo ul
Japanese (Kana) Romanization Modified Hepburn ひらがなhiragana
Japanese (Kanji) Romanization Falls back to Chinese pinyin readings 東京dong jing
Arabic Transliteration Buckwalter-derived مرحباmrhba
Hebrew Transliteration Common Israeli שלוםshlvm
Devanagari Transliteration UNGEGN/IAST-derived नमस्तेnamaste
Bengali Transliteration UNGEGN-derived কলকাতাkalakata
Tamil Transliteration UNGEGN-derived தமிழ்tamizh
Telugu Transliteration UNGEGN-derived తెలుగుtelugu
Gujarati Transliteration UNGEGN-derived ગુજરાતીgujarati
Kannada Transliteration UNGEGN-derived ಕನ್ನಡkannada
Malayalam Transliteration UNGEGN-derived മലയാളംmalayalam
Odia Transliteration UNGEGN-derived ଓଡିଆodia
Sinhala Transliteration UNGEGN-derived සිංහලsimhala
Gurmukhi Transliteration UNGEGN-derived ਪੰਜਾਬੀpanjabi
Thai Transliteration RTGS-derived สวัสดีsawatdi
Lao Transliteration BGN/PCGN-derived ລາວlao
Georgian Transliteration National romanization თბილისიtbilisi
Armenian Transliteration BGN/PCGN ԵրևանEryevan

All transliteration is context-free and character-by-character, the same approach as AnyAscii/Unidecode. No linguistic analysis, polyphony handling, or phonological rules. See limitations.md for trade-offs.

Language-specific profiles (e.g., lang="de") apply sparse overrides on top of the default table. For example, German maps üue instead of the default u.

Language profiles

83 built-in language profiles with ISO 9:1995 scholarly Cyrillic support and 10 Indic scripts:

from translit import list_langs, transliterate

print(list_langs())
# ['am', 'ar', 'as', 'bg', 'bn', 'bo', 'ca', 'cs', 'cy', 'da', 'de', 'dv', 'el',
#  'es', 'et', 'fa', 'fi', 'fr', 'ga', 'gu', 'he', 'hi', 'hr', 'hu', 'hy',
#  'is', 'it', 'ja', 'jv', 'ka', 'km', 'kn', 'ko', 'lo', 'lt', 'lv', 'ml', 'mn',
#  'mr', 'mt', 'my', 'ne', 'nl', 'no', 'or', 'pa', 'pl', 'pt', 'ro', 'ru', 'sa',
#  'si', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te', 'th', 'tr', 'uk', 'vi', 'zh']

# ISO 9:1995 scholarly transliteration
transliterate("Юрий", strict_iso9=True)  # → "Jurij"

Performance

translit is compiled Rust with O(1) compile-time perfect hash tables — no regex, no per-character Python iteration, no runtime data loading.

Operation Throughput vs. legacy
Transliterate (Latin) 450M chars/sec 38× faster than Unidecode
Transliterate (Cyrillic) 130M chars/sec 18× faster than Unidecode
Slugify 849K slugs/sec 10–24× faster than python-slugify
Batch transliterate (100 strings) 2.8× faster than loop

See performance.md for full benchmark methodology and results.

Drop-in replacement

translit provides compatibility aliases for painless migration from existing libraries:

from translit import unidecode, casefold, remove_accents

unidecode("café")        # → "cafe"       (alias for transliterate)
casefold("Straße")       # → "strasse"    (alias for fold_case)
remove_accents("café")   # → "cafe"       (alias for strip_accents)

sanitize_filename() also accepts replacement_text and max_len kwargs for pathvalidate compatibility, and is_confusable() accepts greedy for confusable_homoglyphs compatibility. See migration guides for details.

Exhaustive testing

translit is exhaustively tested with three layers of machine-verifiable assurance beyond conventional unit and property-based tests:

  • Compile-time assertions: build.rs asserts all transliteration table values are ASCII and entry counts match expectations — if any check fails, cargo build fails
  • Exhaustive domain coverage: Every Hangul syllable (11,172), every BMP codepoint (63,488), every CJK ideograph (20,992), and every Indic script block are tested individually — zero sampling gaps
  • Stated invariants: Seven stated properties (ASCII passthrough, idempotence, determinism, output bounds, etc.) verified by exhaustive enumeration and Hypothesis

See formal-verification.md for details.


User Guide

Core concepts and usage for each feature area.

  • Getting Started — Installation, first steps, and basic usage
  • Transliteration — Unicode → ASCII with language profiles, plus reverse (Latin → native script)
  • Slugification — URL-safe slug generation, drop-in python-slugify replacement
  • Normalization — NFC / NFD / NFKC / NFKD Unicode normalization
  • Confusable Detection — TR39 homoglyph detection and normalization
  • Filename Sanitization — Cross-platform safe filenames
  • Text Cleaning — Accent stripping, case folding, whitespace collapse
  • Grapheme Clusters — User-perceived character counting, splitting, and truncation
  • Text Pipeline — Composable, pre-compiled multi-step processing
  • Language Support — Built-in profiles, auto-detection, custom profiles
  • Abjad Scripts — Context-aware Arabic, Persian, and Hebrew with dictionary-based vowel restoration
  • Language Detection — How lang="auto" works: script identification, character-level discrimination, fail-safe fallbacks

  • Policy Templates — Named institutional presets for libraries, web apps, ML, and more
  • CLI — Command-line usage, piping, and shell integration
  • Docker — Run translit via Docker without installing Python

API Reference

Complete function signatures, parameters, and return types.

  • Overview — API reference index
  • Core Transformstransliterate, slugify, normalize, sanitize_filename, strip_accents, strip_zalgo, fold_case, collapse_whitespace, demojize, strip_bidi (all accept str or list[str])
  • Precompiled Pipelinessecurity_clean, ml_normalize, catalog_key, display_clean, search_key, sort_key, sanitize_user_input, PRESETS, get_pipeline, list_profiles
  • ClassesText, Slugifier, UniqueSlugifier, TextPipeline, compatibility aliases
  • Predicatesdetect_scripts, inspect_auto_lang, is_mixed_script, is_confusable, is_ascii, is_normalized, is_zalgo, is_safe_hostname
  • Grapheme Clustersgrapheme_len, grapheme_split, grapheme_truncate
  • Encoding Detectiondetect_encoding, decode_to_utf8
  • Language Profileslist_langs, register_lang, register_replacements
  • Enums & TypesScript, NF, EmojiProvider, type aliases, language constants
  • ExceptionsTranslitError

Reference

  • Language Reference — All languages: codes, names, reference texts, and per-language transliteration rule tables
  • Provenance — Standards and sources behind every transliteration mapping

Architecture

Internal design documentation for contributors and advanced users.

  • Transliteration Engine — PHF lookup, language table chain, Indic virama handling
  • Data Tables — TSV format, build.rs code generation, compile-time PHF
  • Pipeline — TextPipeline internals, execution order, step bitflags
  • Emoji Engine — Emoji detection, provider system, pure-Rust path
  • Emoji Plugins — EmojiProvider protocol, custom providers
  • Security — Confusable detection, hostname validation, bidi stripping
  • Performance — Optimization strategies, PHF tables, batch amortization
  • Testing & Guarantees — Test philosophy, property-based testing, security invariants, CI matrix
  • Exhaustive Testing — Compile-time assertions, exhaustive domain coverage, stated invariants (I1–I7)
  • Transliteration Comparison — Character-level diff vs Unidecode and anyascii

Benchmarks


Migration Guides

Parameter-compatible replacements for existing libraries.


Other

  • Limitations — Known constraints, edge cases, and design trade-offs