Using translit in LLM pipelines¶
The LLM ecosystem rebuilds translit's functionality ad hoc — LiteLLM ships a
hand-written normaliser, Haystack has ascii_only, every tokenizer wrapper has
its own accent-stripping. The
survey behind this page found the
functionality gap is small; the documentation gap is the whole problem.
normalize_confusables() is usually presented as a transliteration utility, when
for this audience it is a guardrail primitive.
This page frames translit's existing transforms for two LLM jobs — guardrail matching (filtering untrusted input) and ingestion (normalising content for ASCII indexes) — and is explicit about which path each recipe belongs to. Every snippet is executed and asserted in CI.
Two paths, do not cross them
The guardrail path folds confusables to defeat homoglyph spoofing. Run it on legitimate non-Latin text and it corrupts that text (see Which path?). The ingestion path romanises and must not be used to rewrite a prompt the model already handles. Pick the path on purpose.
The NFKC-first convention¶
Matching frameworks normalise before they compare, because an attacker controls the encoding of a string, not just its letters. translit's defense functions follow the same convention — NFKC is their first step — so they are safe to call on raw, untrusted input:
from translit import strip_obfuscation
# Fullwidth letters (NFKC-folded) and zero-width joiners (stripped):
assert strip_obfuscation("Hello") == "Hello"
assert strip_obfuscation("hi") == "hi"
You do not need to pre-normalise before handing text to strip_obfuscation() or
normalize_confusables(); they start from NFKC themselves.
Guardrail primitives¶
For filtering untrusted input — denylist checks, prompt-injection screening,
policy matching — the primitives are strip_obfuscation() (full deobfuscation)
and normalize_confusables() (TR39 visual fold only).
The key difference from a LiteLLM-style hand-rolled normaliser is what translit
refuses to do: it does not apply leet/digit remapping. Digit remapping
corrupts the numeric text that pervades an LLM stack — model names, versions,
quantities — so 4, 0, 1 are left alone:
from translit import normalize_confusables
# Identifiers and version numbers survive untouched:
assert normalize_confusables("gpt-4o") == "gpt-4o"
assert normalize_confusables("Llama-3.1-70B") == "Llama-3.1-70B"
# But cross-script homoglyph spoofs are folded to their Latin skeleton:
assert normalize_confusables("pаypаl") == "paypal" # Cyrillic а → a
What TR39 covers instead of leet tables is visual confusability: a Cyrillic
а that renders identically to Latin a folds to a. See
Confusable Detection for the table and its limits.
Recipe — guardrail matching key:
from translit import get_pipeline
guardrail = get_pipeline("llm_guardrail")
# NFKC → strip zalgo/bidi → demojize → strip accents → confusables →
# fold case → strip control/zero-width → collapse whitespace
assert guardrail("Ѕ𝗲𝗰𝗿𝗲𝘁 data") == "secret data"
Compare the matched key, not the raw string, against your policy.
Convert, don't delete¶
The other common pattern is NFKD + encode("ascii", "ignore") (Haystack's
ascii_only, Whisper's text cleaner). On non-Latin content this deletes the
text — an ASCII index built that way simply loses the document:
from translit import transliterate
passage = "Привет мир"
# ascii-ignore throws the whole passage away:
assert passage.encode("ascii", "ignore") == b" "
# transliterate keeps it, searchable, as readable romanisation:
assert transliterate(passage) == "Privet mir"
This is the wedge for index/retrieval: non-Latin content stays findable in an ASCII-normalised index without losing its semantics.
Recipe — ingestion / RAG index normalisation:
from translit import get_pipeline
ingest = get_pipeline("rag_ingest")
# NFKC → strip bidi → strip accents → transliterate →
# strip control/zero-width → collapse whitespace
assert ingest("Café déjà vu") == "Cafe deja vu"
assert ingest("Привет, мир!") == "Privet, mir!"
A composed entry point¶
If you want one function, compose the two paths and keep transliteration optional — it belongs to the index/matching paths, never to a prompt you are about to send to a model that reads the original script fine:
from translit import get_pipeline
def prepare_for_llm(text, *, romanize=False):
"""Normalize untrusted text for an LLM index / matching path.
romanize=False → guardrail fold (homoglyph + obfuscation defense).
romanize=True → ingestion romanisation for an ASCII index.
Either way this is for indexing/matching, not for rewriting prompts.
"""
profile = "rag_ingest" if romanize else "llm_guardrail"
return get_pipeline(profile)(text)
assert prepare_for_llm("pаypаl") == "paypal" # guardrail
assert prepare_for_llm("Москва", romanize=True) == "Moskva" # ingestion
Which path, and when NOT to use translit¶
Being explicit about the path is what earns credibility with this audience — the wrong path actively destroys signal:
from translit import get_pipeline
# WRONG: the guardrail fold mangles legitimate Cyrillic — its confusable step
# partially rewrites real letters, producing garbage:
assert get_pipeline("llm_guardrail")("Москва") == "mocĸвa"
# RIGHT: the ingestion path romanises the same input cleanly:
assert get_pipeline("rag_ingest")("Москва") == "Moskva"
Do not reach for translit when:
- The text goes straight to a multilingual model. Modern tokenizers already NFC/NFKC-normalise, and the model handles native script better than any romanisation. Transliterating the prompt throws away signal.
- You only need encoding repair. That is
ftfy's job, not translit's. - You need lossless round-tripping. Compatibility-tier romanisation (CJK, Indic) is lossy; see Limitations.
Use translit on the guardrail path (match untrusted input against policy) and the ingestion path (build an ASCII-normalised index) — not on the generation path.
See also¶
- Research: Transliteration for LLM Pre-Processing — the underlying survey.
- Precompiled Pipelines — every named profile and its steps.
- Confusable Detection, Adversarial-Text Defense.