Changelog¶
All notable changes to this project will be documented in this file.
The format follows Keep a Changelog. Versions follow Semantic Versioning.
[0.5.0] — 2026-03-30¶
Added¶
- Context-aware transliteration for abjad scripts (Arabic, Persian, Hebrew).
transliterate(text, context=True)uses dictionary-based vowel restoration with bigram context disambiguation to produce readable romanized text instead of consonant skeletons. - Arabic: Tashkeela corpus (65.7M words), 182K unigrams + 200K bigrams. Covers 99%+ of newspaper vocabulary.
- Hebrew: Project Ben Yehuda corpus (11.4M words), 227K unigrams + 200K bigrams. Covers literary Hebrew.
- Persian: 266 curated common words + optional Wiktionary expansion (14.9K entries available via harvester script).
list_context_langs(): returns language codes that supportcontext=True(currently["ar", "fa", "he"]).LangMeta.contextfield:"full","partial", or"none"— enables web/WASM clients to show/hide a context toggle per language.ScriptMeta.context_awarefield:bool— enables toggle per detected script.- Dictionary build tooling:
scripts/build_arabic_dict.py— corpus-based Arabic dictionary builderscripts/build_hebrew_dict.py— corpus-based Hebrew dictionary builderscripts/build_persian_dict.py— curated vocabulary Persian builderscripts/harvest_wiktionary_persian.py— Wiktionary Persian harvesterscripts/bootstrap_dicts.sh— reproducible bootstrap from zero with pinned checksums. All parameters auditable, no manual steps.- Abjad transliteration documentation (
docs/user-guide/abjad-transliteration.md) covering all three languages, standards used, comparison with other systems. - pip extras:
pip install translit-rs[arabic],[hebrew],[context]for optional context dictionary installation. - Rust context engine (
src/context.rs): binary dictionary reader, Arabic/Hebrew tokenizer, three-tier resolve (bigram → unigram → context-free fallback), lazy-loaded global singletons viaOnceLock. - 28 context-aware tests (8 Arabic, 14 Persian, 6 Hebrew).
[0.4.0] — 2026-03-29¶
Added¶
strip_obfuscation()preset pipeline: maximum-strength text deobfuscation using TR39 confusable mapping (visual similarity). Neutralizes homoglyph spoofing, zalgo abuse, invisible character injection, and bidi attacks. Does NOT transliterate — chain withtransliterate()explicitly if romanization is also needed. Pipeline: NFKC → strip_zalgo(max_marks=0) → confusables → strip_bidi → strip_zero_width → demojize → strip_accents → fold_case → collapse_whitespace.lang_info()andscript_info()APIs: return structured metadata (display name, script, region) for any language code or script. Backed byLANG_META(83 entries) andSCRIPT_META(55 entries) with import-time drift assertions.- 18 new language codes: ban (Balinese), bax (Bamum), bug (Buginese), chr (Cherokee), cjm (Cham), cop (Coptic), khb (Tai Lue), lis (Lisu), mni (Meitei), nod (Northern Thai), nqo (N'Ko), sat (Santali), su (Sundanese), syr (Syriac), tdd (Tai Le), tl (Tagalog), tzm (Tamazight), vai (Vai). Total: 83 languages.
- 10 new Script enum members: Bamum, Buginese, Cham, Lisu, MeeteiMayek, OlChiki, Sundanese, Tagalog, TaiTham, Tifinagh. Total: 57 scripts.
- Transliteration provenance documentation (
docs/provenance.md): per-block audit of which formal romanization standard each Unicode block follows. - API surface stability tests (
tests/test_api_stability.py): 133 tests locking down function signatures, class methods, enum members, TypedDicts, protocol interfaces, and__all__exports. - Mutation testing survivor killers (
tests/test_mutant_killers.py): 92 tests targeting forward-only parameter validation, default parameter sensitivity, pipeline step tuples, and boundary checks. - Language consistency audit (
scripts/audit_language_consistency.py): checks 11 registration points for Rust/Python/docs/test alignment. Wired into pre-push gate. - 283 empty-string mappings for combining marks and zero-width characters in
translit_default.tsv— these are now silently stripped instead of producing[?]. docs/index.mdis now generated fromREADME.mdviascripts/generate_docs_index.sh— single source of truth, no more drift.
Fixed¶
strip_obfuscation()homoglyph resolution: used phonetic transliteration (Cyrillic р→r, с→s) instead of TR39 visual confusable mapping (р→p, с→c). Removed transliterate from the pipeline; confusables now handles homoglyphs.- Combining marks produce
[?]:transliterate("n\u0303")returned"n[?]"instead of"n". Added empty-string TSV mappings for all Combining Diacritical Marks (U+0300–U+036F), Extended (U+1AB0–U+1AFF), Supplement (U+1DC0–U+1DFF), Symbols (U+20D0–U+20F0), and Half Marks (U+FE20–U+FE2F). - Zero-width characters produce
[?]:transliterate("a\u200Bb")returned"a[?]b". Added empty-string mappings for ZWS, ZWNJ, ZWJ, word joiner, BOM, soft hyphen, bidi marks, and line/paragraph separators. TextPipelineconfusable ordering: confusables ran before transliterate, creating mixed-script gibberish on Cyrillic/Greek input. Swapped execution order so transliterate runs first (matchingcatalog_keypreset).demojize()adjacent emoji concatenation:demojize("🔥🔥")returned"firefire"instead of"fire fire". Added space padding between adjacent emoji-to-text replacements.- SCRIPT_RANGES sort order: MeeteiMayek Extensions was misplaced, breaking
binary search for Ethiopic Extended-A. Added
test_script_ranges_sortedinvariant. - Tibetan incorrectly documented as Wylie: actual mappings use Indic-phonetic romanization (ཅ→cha, not Wylie's ca).
Changed¶
- BREAKING:
transliterate_batch(),slugify_batch(),normalize_batch(), andstrip_accents_batch()removed. The base functions now accept bothstrandlist[str]via@typing.overload. Pass a list to get batch processing:transliterate(["café", "naïve"])→["cafe", "naive"]. - BREAKING:
strip_obfuscation()no longer transliterates. Uses TR39 confusables (visual mapping) instead.lang=parameter removed. Chain withtransliterate()explicitly if romanization is also needed. - CI restructured: lint/test on PRs only (not push-to-main), hypothesis tests excluded (~4s vs ~46s), CodeQL moved to workflow file with path filtering, benchmarks split to own workflow.
- Pinned
ruff==0.15.4in CI andpyproject.tomlto prevent format drift. - Python 3.9 dropped from release CI matrix (PEP 604 syntax incompatible).
[0.3.0] — 2026-03-28¶
Added¶
- Unicode coverage expansion: 2,553 new codepoints across 33 Unicode blocks,
bringing total
translit_default.tsventries from 6,633 to 9,186.
Tier 1 — Forms and extensions (~1,741 codepoints): - Fullwidth ASCII (FF01–FF5E): 94 characters, mechanical offset mapping - Halfwidth Hangul (FFA0–FFDC): 66 characters via compatibility jamo - Enclosed/Circled Alphanumerics (2460–24FF): 160 characters (①→1, Ⓐ→A) - Superscript/Subscript (2070–209F): 29 characters mapped to base forms - Roman Numerals (2160–2188): 41 characters (Ⅰ→I, Ⅱ→II, ... Ⅻ→XII) - Modifier Letters (02B0–02FF): 80 characters (ʰ→h, ʷ→w) - IPA/Phonetic Extensions (0250–02AF): 96 characters (ɑ→a, ʃ→sh, ŋ→ng) - Greek Extended (1F00–1FFF): 233 characters (polytonic → base Greek → Latin) - Hangul Jamo (1100–11FF): 256 individual jamo components - Kangxi Radicals (2F00–2FD5): 214 radical forms → pinyin via CJK decomposition - CJK Compatibility Ideographs (F900–FAFF): 472 characters → pinyin via canonical decomposition targets
Tier 2 — Living scripts (~812 codepoints): - Gap-filling for 7 partially-covered scripts: Balinese, Canadian Syllabics, Cherokee, Coptic, N'Ko, Syriac, Vai - 10 new abugida scripts with virama/inherent-vowel handling: Sundanese, Tai Tham, Cham, Batak, Buginese, Tagalog, Hanunoo, Buhid, Tagbanwa, Meetei Mayek - 4 new alphabetic/syllabic scripts: Tifinagh, Lisu, Ol Chiki, Bamum
- Unicode range constants for 12 new scripts in
src/unicode_ranges.rs:SUNDANESE,TAI_THAM,CHAM,BATAK,BUGINESE,TAGALOG,HANUNOO,BUHID,TAGBANWA,MEETEI_MAYEK,MEETEI_MAYEK_EXT. - 10 new
*_char_role()functions insrc/transliterate.rsfor abugida virama handling (Sundanese, Tai Tham, Cham, Batak, Buginese, Tagalog, Hanunoo, Buhid, Tagbanwa, Meetei Mayek). scripts/generate_unicode_expansion.py: reproducible generator script for all Tier 1 and Tier 2 TSV entries (1,310 lines).cargo-clippypre-commit hook mirroring CI-D warningsto catch lints before push.- Callable module:
import translit; translit("Москва", lang="auto")now works as a shorthand fortranslit.transliterate(...). Uses in-place__class__mutation to preserveunittest.mock.patchcompatibility.
Fixed¶
- Finnish transliteration: removed incorrect alias
fi→sv. Finnish ä/ö are independent phonemes (→a/o via default table), not ae/oe variants as in Swedish/German.Hämäläinennow correctly producesHamalainen. - Icelandic transliteration: removed incorrect ð→dh and Ð→Dh overrides. Default table already maps ð→d (ICAO/passport standard). Retained Æ→Ae override (differs from default AE). Icelandic override count reduced from 6 to 2.
- clippy
manual_range_patternslint inbuginese_char_role: collapsed0x1A17 | 0x1A18 | 0x1A19..=0x1A1Bto0x1A17..=0x1A1B. errors="preserve"dropping visible characters: characters with explicit empty-string TSV mappings (e.g. U+060E Arabic Poetic Verse Sign, U+30FC Katakana Prolonged Sound Mark) are now preserved instead of silently dropped whenerrors="preserve"is set.
Changed¶
is_indic()andindic_char_role()expanded to cover all 11 new Brahmic/abugida script ranges.lookup_lang(): Finnish no longer dispatches to Swedish override table; falls through to default.- Icelandic language TSV (
translit_lang_is.tsv) reduced from 6 to 2 entries. ml_normalizepreset: switched transliteration fromPreservetoIgnoreerror mode — ML pipelines need clean ASCII output, not preserved non-ASCII.
[0.2.0] — 2026-03-27¶
Added¶
- Exhaustive testing framework — three layers of machine-verifiable assurance:
- Compile-time assertions (
build.rs): all transliteration table values asserted ASCII-only, entry count sanity checks (Hanzi ≥20k, BMP ≥5k, confusables ≥1k). Build fails if any assertion is violated. - Exhaustive domain tests (Rust): 16 tests covering all 11,172 Hangul syllables, full BMP (63,488 codepoints) for ASCII output and idempotence, all 20,992 CJK ideographs, all 51 compatibility jamo, and structural verification of 15 Indic script blocks. Zero sampling gaps.
- Stated invariant specifications (Python): 7 stated invariants (I1–I7) verified via exhaustive enumeration and Hypothesis — ASCII passthrough, ASCII output, idempotence, no exceptions, determinism, input size bound, output length bound.
- Two-tier test architecture: formal tests gated behind
#[ignore](Rust) and@pytest.mark.formal(Python) so they don't slow everyday development. Run before release withcargo test -- --ignoredandpytest -m formal. - CLAUDE.md: project-level development guide for automated agents — documents build commands, test tiers, and code conventions.
list_scripts()function for programmatic script discovery.docs/formal-verification.md: specification document for exhaustive testing methodology.- Comprehensive overhaul of
docs/architecture/testing-guarantees.mdwith exhaustive testing differentiator analysis and alternative library comparison.
Changed¶
IndicRoleenum andindic_char_role()/ script-specific char_role functions changed from private topubfor integration test access (parent modules remain#[doc(hidden)]).tables::hangulmodule changed frommodtopub modfor integration test access.- Hangul const assertions added:
JUNGSEONG_COUNT,JONGSEONG_COUNT, total syllable count, and compatibility jamo range verified at compile time. - Total test count: 2,900+ (up from 1,678 in 0.1.5).
[0.1.5] — 2026-03-27¶
Added¶
- Reverse transliteration:
transliterate(text, target="ru")converts Latin → native script for Russian, Ukrainian, and Greek. PHF tables generated at build time from inverted language TSV data. - Toned pinyin:
transliterate("北京", tones=True)returns"běi jīng"with tone marks. Toned readings sourced from UnihankMandarinfield for all 20,924 CJK Unified Ideographs. - ISO 9:1995 scholarly Cyrillic:
transliterate(text, strict_iso9=True)for scholarly romanization. GOST R 7.0.34 variant viagost7034=True. - Japanese Kunrei-shiki (
lang="ja-kunrei"): alternative romanization profile, bringing total language count to 65. - Ancient scripts: Coptic, Gothic, Old Italic, Runic, Ogham transliteration tables.
- CLI short aliases:
t(transliterate),s(slugify),n(normalize),p(pipeline),d(demojize) — e.g.translit t "café". - CLI
--targetflag:translit t --target ru "Moskva"for reverse transliteration. - CLI
--tones,--strict-iso9,--gost7034flags for transliterate subcommand. - CLI
--langflag for slugify subcommand. console_scriptsentry point:translitcommand available afterpip install translit-rs.docs/cli.md: comprehensive CLI documentation with piping, exit codes, examples.- Links section in README.md and docs/index.md for RTD ↔ GitHub cross-references.
Changed¶
transliterate()API unified:reverse_transliterate()merged intotransliterate()viatargetparameter. Old function removed.transliterate_implRust signature now takes 7 arguments (addedtones: bool).- Updated benchmark numbers after
tonesparameter addition (15–46% regression in transliteration hot path due to additional branch; throughput now 450M chars/sec Latin, 130M chars/sec Cyrillic). - Performance documentation updated across 4 files to reflect current benchmark results.
Fixed¶
- clippy
format_push_stringlint inbuild.rs— replacedpush_str(&format!())withwrite!(). - clippy
unreadable_literalin PHF-generatedreverse_translit_phf.rs— suppressed via inner attribute insrc/reverse.rs. - All 219 integration test call sites updated for 7-argument
transliterate_impl.
[0.1.4] — 2026-03-25¶
Added¶
lang="auto"script-based language detection: Whenlang="auto"is passed totransliterate(),slugify(),TextPipeline,Slugifier, or any other call site, the library detects the dominant non-Latin script in the input and maps it to a default language code automatically. Maps 28 scripts to language codes (e.g. Cyrillic→ru, Han→zh, Hiragana/Katakana→ja, Thai→th). Zero overhead forlang=Noneor explicit lang codes.LANG_AUTOconstant ("auto") intranslit._enums.- Georgian transliteration (
lang="ka"): 114 TSV entries covering Mkhedruli, Mtavruli, and supplement ranges. BGN/PCGN national romanization. - Armenian transliteration (
lang="hy"): 86 TSV entries covering uppercase, lowercase, and 5 ligatures (U+FB13–FB17). BGN/PCGN romanization. - Sinhala transliteration (
lang="si"): 90 TSV entries. Extended Indic Brahmic engine range from0x0900..=0x0D7Fto0x0900..=0x0DFFwith dedicatedsinhala_char_role()function for Sinhala-specific offsets. - Thai transliteration (
lang="th"): 87 TSV entries using RTGS romanization. NewScriptClass::Taiwith tone-mark stripping and cancellation handling. - Lao transliteration (
lang="lo"): 67 TSV entries using BGN/PCGN romanization. Shares Tai engine with Thai via offset masking. - Ethiopic transliteration (
lang="am"): 307 TSV entries for Ge'ez alphasyllabary (34 consonant bases × 7 vowel orders + labialized forms + digits). Pure data addition — no engine changes needed. - Myanmar transliteration (
lang="my"): 89 TSV entries. Newmyanmar_char_role()for Brahmic engine with virama (U+1039) and asat (U+103A) support. Medials (U+103B–103E) classified as dependent vowels. - Khmer transliteration (
lang="km"): 110 TSV entries. Newkhmer_char_role()for Brahmic engine with coeng (U+17D2) as virama. All consonants normalized to inherent 'a' regardless of series. - Tibetan transliteration (
lang="bo"): 147 TSV entries. Newtibetan_char_role()for Brahmic engine with halanta (U+0F84) and subjoined consonants (U+0F90–0FBC). - Unicode range constants:
TIBETAN(0x0F00–0x0FFF),MYANMAR(0x1000–0x109F),KHMER(0x1780–0x17FF) insrc/unicode_ranges.rs. - Comprehensive test coverage: example-based tests for all 9 new scripts, property-based tests (hypothesis + proptest), multi-script mixture tests.
- Built-in language count: 51 → 60.
Changed¶
is_indic()extended to include Tibetan, Myanmar, and Khmer ranges for Brahmic abugida processing.indic_char_role()dispatches to script-specific functions for Sinhala, Tibetan, Myanmar, and Khmer codepoint ranges.
[0.1.3] — 2026-03-25¶
Added¶
strip_controlandstrip_zero_widthnow work as independent pipeline steps without requiringcollapse_whitespace=True. Previously they were silently ignored whencollapse_whitespacewas disabled.strip_control_chars()andstrip_zero_width_chars()standalone Rust functions for filtering without whitespace collapsing.decimalandhexadecimalflags inSlugConfigare now functional. Settingdecimal=Falsepreserves&#NNN;entities;hexadecimal=Falsepreserves&#xHHH;entities. Previously these flags were accepted but silently ignored.- Rust integration tests:
tests/integration_emoji.rs(10 tests),tests/integration_slugify.rs(20 tests),tests/integration_transliterate.rs(21 tests),tests/integration_whitespace.rs(12 tests).
Changed¶
TextPipelineparametersstrip_controlandstrip_zero_widthchanged frombool(defaultTrue) tobool | None(defaultNone). WhenNone, they inherit fromcollapse_whitespace—Trueifcollapse_whitespace=True,Falseotherwise. Set explicitly toTruefor standalone use withoutcollapse_whitespace. This is backward compatible: existing code that passescollapse_whitespace=Truegets the same behavior as before.steps()now reportsstrip_controlandstrip_zero_widthas separate entries when active, giving full visibility into pipeline behavior.- Pipeline step order updated:
normalize → confusables → demojize → strip_accents → transliterate → fold_case → strip_control → strip_zero_width → collapse_whitespace. - Migrated from
once_celltostd::sync::LazyLock/OnceLock; MSRV bumped to 1.80. Removedonce_celldependency. needs_cjk_space()match arm tightened from wildcard_to explicitIdeograph | Hangul | Kanato match the call-siteis_cjkguard.
Fixed¶
decode_entities()corrupting multi-byte UTF-8 characters (BUG-1). The function usedbytes[i] as charwhich treated each continuation byte as a separate Latin-1 codepoint (e.g.café→café). Now advances by full UTF-8 characters.decode_numeric_entity_skip()panicking on malformed&#followed by multi-byte UTF-8 (BUG-2). The skip function walked through continuation bytes looking for;, landing inside a multi-byte character. Now stops at the first non-ASCII byte.
Performance¶
- ASCII fast-path in
demojize_implanddemojize_rust: pure-ASCII text returns immediately withoutVec<char>allocation or emoji scanning. filter_stopwordsreplaced intermediateVec<_>+.join()with a pre-allocatedStringfold, removing one allocation per slugify call.
[0.1.2] — 2026-03-25¶
Added¶
- Python 3.14 support (classifier and CI test matrix).
ruff check --fixpre-commit hook for automatic lint fixing.- CI publish workflow using
pypa/gh-action-pypi-publishwith OIDC trusted publishers. - Multi-platform wheel builds: Linux (x86_64, aarch64), macOS (Intel, ARM64), Windows.
steps()method on_TextPipelinetype stub.
Changed¶
- Resolved all clippy pedantic warnings instead of suppressing them — reduced
lint suppressions from 48 to 22 (remaining are genuine PyO3 constraints).
Fixes include: combined identical match arms, replaced manual counters with
.enumerate(), moved item declarations before statements, usedclone_into(), merged identical branches, fixed doc comment formatting. - Widened
stopwordsandreplacementstype stubs from stricttuple/listtoSequencefor better mypy compatibility. - Applied
ruff formatto all Python source and test files. - Switched docs publish from deprecated
maturin uploadtopypa/gh-action-pypi-publish. - macOS Intel wheels now cross-compiled on ARM64 runner (macos-14) instead of deprecated macos-13.
- CI doctests now run against installed package (not source tree) with explicit
shell: bashfor Windows compatibility.
Fixed¶
TextPipeline.explain()doctest: output format isnormalize (NFC)notnormalize (form=NFC).from __future__ import annotationsplacement in test files (must follow module docstring, not precede it).- Malformed HTML entity test expectation:
decode_entities("&#xyz;")correctly returns"", not"yz;". - Rust benchmark CI: target
bench_corebinary explicitly to avoid passing Criterion flags to the test harness. - Ruff lint fixes: unsorted imports in
test_encoding.py, unused importis_mixed_scriptintest_security_invariants.py. - Read the Docs trigger workflow: simplified curl status handling, graceful
warning when
RTD_TOKENis missing. - Removed incorrect PyPy classifier (abi3 is CPython-only).
[0.1.1] — 2026-03-25¶
Added¶
src/unicode_ranges.rs— named constants for all Unicode codepoint ranges used by the library, eliminating magic numbers scattered across modules.tests/test_concurrency.py— concurrent access tests forLANG_TABLESandHANGUL_CACHE, plus malformed Unicode input tests.- Code coverage reporting in CI (
pytest-cov, XML report uploaded as artifact). CLOCK$,KEYBD$,SCREEN$,COM0,LPT0added to Windows reserved filename list.casefold()alias forfold_case()— matchesstr.casefold()naming.remove_accents()alias forstrip_accents()— matches sklearn/ML ecosystem naming.- Compatibility parameter aliases:
replacement_text/max_lenonsanitize_filename()(pathvalidate),greedy/preferred_aliasesonis_confusable()(confusable_homoglyphs),delimitersondemojize()(emoji library). - Complete API documentation for 19 previously undocumented exported functions:
precompiled pipelines, grapheme clusters, encoding detection,
Textbuilder,is_safe_hostname,demojize,strip_bidi,EmojiProviderprotocol. - Three new API reference pages: Precompiled Pipelines, Grapheme Clusters, Encoding.
- "Guides by role" section in
docs/index.mdandREADME.md. - Performance section in
README.mdwith benchmark numbers. Scriptenum documentation expanded from 28 to all 41 members.
Changed¶
transliterate_implrefactored: capacity estimation extracted toestimate_capacity(), character classification toclassify_char(), and CJK spacing logic toneeds_cjk_space().- All
RwLockaccesses now recover from lock poisoning using.unwrap_or_else(|e| e.into_inner())instead of silently falling through. - Lambda closures in
_compat.pyreplaced with named inner functions for clarity. emoji.rswrite!()call no longer uses.unwrap()(infallible, documented with a// SAFETYcomment).- MkDocs theme switched from
materialtoreadthedocs. - All documentation references updated from "unirust" to "translit".
- Development status promoted from Alpha to Beta.
- Package renamed from
translittotranslit-rson PyPI (interim until PEP 541 grants thetranslitname). Python import remainsimport translit.
Fixed¶
- Type stub
_text.pyiimported from wrong module name (unirust→translit). - Type stub
_translit.pyimissingmin_confidenceparameter on_decode_to_utf8. - Type stub
_text.pyimissinggrapheme_split,grapheme_truncate,catalog_keymethods. security_clean()pipeline step order corrected in 5+ locations: strip_bidi runs before collapse_whitespace (matching Rust implementation).catalog_key()step order corrected: transliterate before strip_accents.- Stale PyO3 boundary overhead corrected from ~4µs to ~240ns in docs and code comments.
Deprecated¶
translit._compatawesome-slugify compatibility layer (Slugify,UniqueSlugify,slugify_*instances) — planned removal in v1.0.
[0.1.0] — 2026-01-01¶
Added¶
- Initial release.
- Unicode transliteration for 60 language profiles.
- Slugification, normalization, confusable detection, filename sanitization.
- Emoji demojization with ZWJ sequence support.
- Backward-compatible layers for Unidecode and awesome-slugify.