Language Detection¶
How translit's lang="auto" detection works, from script identification through character-level discrimination to fail-safe fallbacks.
Overview¶
When lang="auto" is passed to transliterate(), slugify(), catalog_key(), or any other function that accepts a lang parameter, translit runs a three-stage detection pipeline:
- Script identification — find the dominant non-Latin, non-Common script
- Character-level discrimination — scan for exclusive characters that uniquely identify a language within ambiguous scripts
- Fallback — use the script's default language mapping
The entire pipeline is deterministic, O(n), and fail-safe: if detection is uncertain, it falls back to the safe default rather than guessing.
from translit import transliterate
# Stage 1: Cyrillic detected → ambiguous script
# Stage 2: ї found → Ukrainian discriminator hit
# Stage 3: returns "uk"
transliterate("Київ", lang="auto") # → "Kyiv" (Ukrainian profile)
Stage 1: Script Identification¶
translit classifies each character by its Unicode script property using a static table of 42 scripts with binary search lookup. The first non-Latin, non-Common, non-Inherited character determines the primary script.
from translit import detect_scripts
detect_scripts("Москва") # → [Script.CYRILLIC]
detect_scripts("東京タワー") # → [Script.HAN, Script.KATAKANA]
detect_scripts("Hello World") # → [Script.LATIN]
For Latin-only text, no language override is applied (stage 2 may still detect Latin discriminators like Vietnamese or Turkish characters).
Unambiguous scripts¶
Many scripts map to exactly one language in translit's profile set. Detection is immediate with no further analysis needed:
| Script | Language | Code |
|---|---|---|
| Georgian | Georgian | ka |
| Armenian | Armenian | hy |
| Thai | Thai | th |
| Hangul | Korean | ko |
| Hiragana / Katakana | Japanese | ja |
| Greek | Greek | el |
| Thaana | Dhivehi | dv |
| Bengali | Bengali | bn |
| Tamil | Tamil | ta |
| Telugu | Telugu | te |
| Kannada | Kannada | kn |
| Malayalam | Malayalam | ml |
| Gujarati | Gujarati | gu |
| Gurmukhi | Punjabi | pa |
| Odia | Odia | or |
| Sinhala | Sinhala | si |
| Ethiopic | Amharic | am |
| Tibetan | Tibetan | bo |
| Lao | Lao | lo |
| Myanmar | Burmese | my |
| Khmer | Khmer | km |
| Hebrew | Hebrew | he |
| Javanese | Javanese | jv |
For these scripts, lang="auto" is equivalent to passing the explicit code.
Ambiguous scripts¶
Three scripts are shared by multiple languages in translit's profile set:
| Script | Languages | Default |
|---|---|---|
| Cyrillic | Russian, Ukrainian, Serbian, Bulgarian, Mongolian | ru |
| Arabic | Arabic, Persian | ar |
| Latin | German, Turkish, Vietnamese, + 20 others | (no override) |
These proceed to stage 2.
Stage 2: Character-Level Discrimination¶
For ambiguous scripts, translit scans up to the first 2,000 characters looking for exclusive characters — codepoints that appear in exactly one language's alphabet among all supported profiles for that script.
Discriminator table¶
| Script | Exclusive characters | Detected language |
|---|---|---|
| Cyrillic | ґ Ґ ї Ї є Є і І | uk (Ukrainian) |
| Cyrillic | ђ Ђ ћ Ћ љ Љ њ Њ џ Џ ј Ј | sr (Serbian) |
| Cyrillic | ө Ө ү Ү | mn (Mongolian) |
| Arabic | پ چ ژ گ | fa (Persian) |
| Latin | ơ Ơ ư Ư | vi (Vietnamese) |
| Latin | İ ı | tr (Turkish) |
| Latin | ß ẞ | de (German) |
Algorithm¶
for each character in text[0..2000]:
if character is an exclusive discriminator for this script:
return the associated language ← first hit wins, bail early
return None ← no discriminator found, use default
Key properties:
- First-hit-wins: scanning stops at the first discriminator character found. This is a performance optimization — for well-formed text, a single exclusive character is sufficient for identification.
- 2,000-character cap: avoids scanning entire documents. The first 2K characters are sufficient for discrimination in practice.
- No conflict resolution: if a text contains exclusive characters from two different languages (e.g., Ukrainian ї and Serbian ћ), whichever appears first wins. This is intentional — such mixed text is rare and artificial.
Examples¶
from translit import transliterate
# Ukrainian: ї is exclusive to Ukrainian Cyrillic
transliterate("Київ", lang="auto") # → "Kyiv"
# Serbian: ћ is exclusive to Serbian Cyrillic
transliterate("Београд", lang="auto") # → "Beograd"
# Persian: پ is exclusive to Persian Arabic
transliterate("پارسی", lang="auto") # → "parsy"
# Vietnamese: ơ is exclusive to Vietnamese Latin
transliterate("Hà Nội", lang="auto") # → uses vi profile
# German: ß is exclusive to German Latin
transliterate("Straße", lang="auto") # → "Strasse" (de: ß→ss)
# No discriminator: Москва has no exclusive chars
transliterate("Москва", lang="auto") # → "Moskva" (default ru)
Stage 3: Fallback¶
If no discriminator is found:
- Ambiguous scripts fall back to their default language: Cyrillic →
ru, Arabic →ar, Han →zh, Devanagari →hi - Latin-only text receives no language override (default transliteration table)
- Non-Latin Latin-discriminator text: if the text contains only Latin characters but includes discriminators (ơ, İ, ß), the Latin discriminator table is consulted
Fail-Safe Guarantee¶
The discriminator system is designed to never produce a worse result than the previous script-default approach:
- Discriminators can only upgrade detection (from script default to a more specific language)
- If no discriminator is found, the result is identical to not using
lang="auto"at all - Discriminator characters are selected to have zero ambiguity — each appears in exactly one supported language profile for its script
This means lang="auto" is always at least as good as the script default, and often better.
Limitations¶
- Bulgarian vs Russian: Bulgarian Cyrillic uses the same character set as Russian (no exclusive characters).
lang="auto"defaults torufor both. Passlang="bg"explicitly for Bulgarian text. - Mongolian Cyrillic vs Russian: Mongolian is detected only when ө or ү appear. Standard Russian characters in Mongolian text are not distinguished.
- Devanagari languages: Hindi, Marathi, Nepali, and Sanskrit all use Devanagari.
lang="auto"defaults tohifor all. Since the default Devanagari transliteration table is used by all four, this has no practical impact. - Han characters: Chinese and Japanese kanji share the same Unicode block.
lang="auto"defaults tozh(Chinese pinyin). For Japanese readings, passlang="ja"explicitly. - Mixed-script text: when text contains multiple scripts (e.g., "Hello Мир"), the first non-Latin, non-Common script determines the language. Latin portions use the default table regardless.
- Short text: very short strings (1-3 characters) may not contain any discriminator characters, falling back to the script default.
Inspecting Detection Results¶
Use inspect_auto_lang() to see exactly how the detection pipeline resolved for a given text. This is useful for logging, auditing, and debugging:
from translit import inspect_auto_lang
result = inspect_auto_lang("Київ")
# {
# 'script': 'Cyrillic',
# 'chosen_lang': 'uk',
# 'reason': 'discriminator',
# 'discriminators_hit': ['ї']
# }
result = inspect_auto_lang("Москва")
# {
# 'script': 'Cyrillic',
# 'chosen_lang': 'ru',
# 'reason': 'script_default',
# 'discriminators_hit': []
# }
result = inspect_auto_lang("Straße")
# {
# 'script': None,
# 'chosen_lang': 'de',
# 'reason': 'latin_discriminator',
# 'discriminators_hit': ['ß']
# }
Return value¶
| Key | Type | Description |
|---|---|---|
script |
str | None |
Primary non-Latin script detected, or None for Latin/ASCII |
chosen_lang |
str | None |
Resolved language code, or None if no language matched |
reason |
str |
Detection reason: "unambiguous_script", "discriminator", "script_default", "latin_discriminator", or "no_detection" |
discriminators_hit |
list[str] |
Discriminator characters that triggered the match (empty if none) |
When NOT to Use lang="auto"¶
lang="auto" is a convenience for bulk or unknown-source text. Do not use it when:
- The language is already known — bibliographic records, catalog entries, and curated datasets should always pass an explicit
langcode. - Legal or official names — personal names, place names in legal documents, and citations require the correct language profile. Auto-detection may default to the wrong language for ambiguous scripts.
- Short text — 1-3 character strings rarely contain discriminators and will fall back to the script default, which may not be correct.
- Reproducibility matters — auto-detection depends on text content; the same function with explicit
langalways produces the same mapping.
# Do this:
transliterate("Софія", lang="bg") # Known to be Bulgarian
# Not this:
transliterate("Софія", lang="auto") # Defaults to Russian (no Bulgarian discriminators)
Integration with Pipelines¶
lang="auto" works with all translit entry points:
from translit import (
transliterate, slugify, catalog_key, search_key, sort_key,
TextPipeline, Slugifier, Text, LANG_AUTO,
)
# Functions
transliterate("Київ", lang="auto")
catalog_key("Москва", lang="auto")
search_key("Straße", lang="auto")
sort_key("Війна і мир", lang="auto")
# Classes
pipe = TextPipeline(transliterate=True, lang="auto")
slug = Slugifier(lang="auto")
# Text builder
Text("Київ").transliterate(lang="auto").value
# Type-safe constant
transliterate("Москва", lang=LANG_AUTO)
See Also¶
- Language Support — full list of 83 language profiles and their override rules
- Language Reference — per-language transliteration tables and reference texts
- Limitations — known constraints of context-free transliteration