Unidecode → disarm recipes¶
A survey of 58 real projects
(beets, saleor, investpy, python-slugify, django-autoslug, …) found that
unidecode is almost never the last step. It sits inside a hand-rolled
normalisation pipeline — lowercase, strip, collapse, re-encode — and every one
of those pipelines already has a single-call equivalent in disarm. The gap
is discoverability, not capability.
This page maps each observed pattern to its one-liner. The "after" snippets are executed and asserted in CI, so the equivalences below cannot silently rot. The "before" snippets show the legacy hand-rolled code and are not executed.
Why one call beats a pipeline
A hand-rolled pipeline bakes in an ordering decision — does .lower()
run before or after transliteration? Get it wrong and №5 slugs to No5
instead of no5 (see the ordering footgun). The
disarm helper encapsulates the correct order so you never have to make
that call.
At a glance¶
Hand-rolled with unidecode |
disarm one-liner |
|---|---|
re.sub(r"\W+", "-", unidecode(t).lower().strip()) |
slugify(t) |
unidecode(name).replace(os.sep, "_") |
sanitize_filename(name, platform=...) |
unidecode(t).encode("ascii") |
transliterate(t).encode("ascii") |
unidecode(t.lower().strip()) for dict / search keys |
search_key(t) / catalog_key(t) / sort_key(t) |
quote(unidecode(t)) |
quote(transliterate(t)) (or slugify for URLs) |
URL slugs¶
# Before — hand-rolled slug
import re
from unidecode import unidecode
def slug(t):
return re.sub(r"\W+", "-", unidecode(t).lower().strip()).strip("-")
# After
from disarm import slugify
assert slugify("Café del Mar!!") == "cafe-del-mar"
assert slugify("Хлеб с маслом") == "khleb-s-maslom"
slugify() lowercases, transliterates, replaces runs of non-word characters
with a single -, and trims — in the right order — so you do not re-implement
(and mis-order) the pipeline. See the Slugification guide.
Safe filenames¶
# Before — transliterate then patch path separators by hand
import os
from unidecode import unidecode
safe = unidecode("Naïve/Résumé.txt").replace(os.sep, "_")
# After
from disarm import sanitize_filename
assert sanitize_filename("Naïve/Résumé.txt") == "Naive_Resume.txt"
sanitize_filename() does more than swap separators: it removes OS-illegal
characters, handles Windows reserved names, and truncates to a byte budget.
Pick the rule set with platform="universal" | "posix" | "windows". See
Filename Sanitization.
ASCII bytes¶
# Before
from unidecode import unidecode
raw = unidecode("Café").encode("ascii")
# After
from disarm import transliterate
assert transliterate("Café").encode("ascii") == b"Cafe"
transliterate() already guarantees ASCII-only output, so the .encode("ascii")
never raises UnicodeEncodeError. Use errors="ignore" | "preserve" | "replace"
to choose what happens to characters with no mapping.
Dictionary / search keys¶
# Before — a casefolded, accent-folded lookup key
from unidecode import unidecode
key = unidecode("Café del Mar".lower().strip())
# After — pick the helper that documents intent
from disarm import search_key, catalog_key, sort_key
# For typical text the three coincide — use the name that fits the call site:
assert search_key(" Café RÉSUMÉ ") == "cafe resume"
assert catalog_key("Café del Mar") == "cafe del mar"
assert sort_key("Café del Mar") == "cafe del mar"
All three apply NFKC → transliterate → case-fold → collapse-whitespace, so for plain text they produce the same key. They differ by the extra pass each adds:
search_key— lightest; nothing extra. For search / lookup indexes.catalog_key— adds a TR39 confusable fold, which canonicalises look-alikes that survive transliteration (curly quotes, primes, and similar punctuation). Built for bibliographic de-duplication where typographic variants of the same title should collapse to one key.sort_key— strips bidi controls for stable ordering. (It currently coincides withsearch_keyfor typical input; use the name that documents the call site.)
The confusable fold is what sets catalog_key apart — typographic variants of a
title collapse to the same key:
assert catalog_key('naïve “quote”') == "naive ''quote''" # quotes folded
assert search_key('naïve “quote”') == 'naive "quote"' # quotes preserved
That fold runs after transliteration, so it does not reverse cross-script
homoglyph spoofs (Cyrillic р is already romanised to r by then). For that,
see confusable defense.
URL-encoded query parameters¶
# Before
from urllib.parse import quote
from unidecode import unidecode
q = quote(unidecode("Café del Mar"))
# After
from urllib.parse import quote
from disarm import transliterate, slugify
assert quote(transliterate("Café del Mar")) == "Cafe%20del%20Mar"
# For a clean URL path segment, prefer a slug:
assert slugify("Café del Mar") == "cafe-del-mar"
The ordering footgun¶
#88 found pipelines that lowercase before transliterating. That order is
subtly wrong whenever transliteration introduces an uppercase letter. The
numero sign № expands to No — a capital N that a prior .lower() cannot
see:
from disarm import transliterate, search_key
# Lowercasing BEFORE transliteration misses the introduced capital:
assert transliterate("№5".lower()) == "No5" # wrong — stray capital N
# Lowercasing AFTER transliteration is correct:
assert transliterate("№5").lower() == "no5"
# The disarm helpers fold case after transliterating, so they get it right:
assert search_key("№5") == "no5"
This is exactly the class of bug the one-call helpers remove: the correct order
is baked in. When in doubt, reach for slugify / search_key /
sanitize_filename rather than re-assembling the steps.
See also¶
- Migrating from Unidecode — the drop-in alias and the
security caveat (
unidecodeis not a confusable-defense tool). - Slugification, Filename Sanitization, Normalization.