Predicates

Functions that inspect text and return boolean or structured results without modifying the input.

detect_scripts

detect_scripts

detect_scripts(text: str) -> list[Script]

Return the set of Unicode scripts present in text, in order of first appearance.

Parameters:
  • text (str) –

    Input string.

Returns:
  • list[Script]

    List of :class:Script enum values, ordered by first appearance.

Examples:

>>> detect_scripts("Hello")
[Script.LATIN]
>>> detect_scripts("Hello Мир")
[Script.LATIN, Script.CYRILLIC]

inspect_auto_lang

inspect_auto_lang

inspect_auto_lang(text: str) -> dict[str, str | list[str] | None]

Inspect how lang="auto" would resolve for the given text.

Use this to audit or log the detection decision made by the three-stage auto-detection pipeline.

Parameters:
  • text (str) –

    Input string.

Returns:
  • dict[str, str | list[str] | None]

    Dict with keys:

  • dict[str, str | list[str] | None]
    • script: primary non-Latin script name, or None
  • dict[str, str | list[str] | None]
    • chosen_lang: resolved language code, or None
  • dict[str, str | list[str] | None]
    • reason: one of "unambiguous_script", "discriminator", "script_default", "latin_discriminator", "no_detection"
  • dict[str, str | list[str] | None]
    • discriminators_hit: list of discriminator characters found

Examples:

>>> inspect_auto_lang("Київ")["chosen_lang"]
'uk'
>>> inspect_auto_lang("Москва")["reason"]
'script_default'
from translit import inspect_auto_lang

inspect_auto_lang("Київ")
# {'script': 'Cyrillic', 'chosen_lang': 'uk', 'reason': 'discriminator', 'discriminators_hit': ['ї']}

inspect_auto_lang("Москва")
# {'script': 'Cyrillic', 'chosen_lang': 'ru', 'reason': 'script_default', 'discriminators_hit': []}

inspect_auto_lang("hello")
# {'script': None, 'chosen_lang': None, 'reason': 'no_detection', 'discriminators_hit': []}

See Language Detection for details.


is_mixed_script

is_mixed_script

is_mixed_script(text: str) -> bool

True if text contains characters from more than one Unicode script.

Parameters:
  • text (str) –

    Input string.

Returns:
  • bool

    True if multiple scripts detected (excluding Common/Inherited).

Examples:

>>> is_mixed_script("Hello")
False
>>> is_mixed_script("Hello Мир")  # Latin + Cyrillic
True

is_confusable

is_confusable

is_confusable(text: str, *, target_script: str = 'latin', greedy: bool | None = None, preferred_aliases: list[str] | None = None) -> bool

True if text contains characters confusable with target-script characters.

Parameters:
  • text (str) –

    Input string.

  • target_script (str, default: 'latin' ) –

    Script to check confusability against. Currently only "latin" is supported; any other value raises TranslitError.

  • greedy (bool | None, default: None ) –

    confusable_homoglyphs compatibility — ignored with a DeprecationWarning. translit always checks all characters.

  • preferred_aliases (list[str] | None, default: None ) –

    confusable_homoglyphs compatibility — ignored with a DeprecationWarning. translit uses its own script detection engine.

Returns:
  • bool

    True if any confusable homoglyphs are present.

Raises:
  • TranslitError

    If target_script is not "latin".

Examples:

>>> is_confusable("pаypal")  # Cyrillic а looks like Latin a
True
>>> is_confusable("paypal")  # all genuine Latin
False

is_ascii

is_ascii

is_ascii(text: str) -> bool

True if all characters are in U+0000–U+007F.

Parameters:
  • text (str) –

    Input string.

Returns:
  • bool

    True if the string is pure ASCII.

Examples:

>>> is_ascii("hello 123")
True
>>> is_ascii("café")
False

is_normalized

is_normalized

is_normalized(text: str, *, form: NormalizationForm = 'NFC') -> bool

True if text is already in the specified normalization form.

Parameters:
  • text (str) –

    Input string.

  • form (NormalizationForm, default: 'NFC' ) –

    Normalization form — "NFC", "NFD", "NFKC", or "NFKD".

Returns:
  • bool

    True if the string is already normalized.

Examples:

>>> is_normalized("café")  # NFC by default
True
>>> is_normalized("e\u0301", form="NFC")  # NFD decomposed
False

is_zalgo

is_zalgo

is_zalgo(text: str, *, threshold: int = 3) -> bool

Detect whether text contains zalgo-style combining mark abuse.

Returns True if any base character has more than threshold consecutive combining marks in NFD decomposition.

Parameters:
  • text (str) –

    Input string to check.

  • threshold (int, default: 3 ) –

    Maximum allowed combining marks per base character (default: 3). Vietnamese has 2 marks in NFD — the default is safe for all legitimate scripts.

Returns:
  • bool

    True if zalgo-style stacking is detected.

Examples:

>>> is_zalgo("café")
False
>>> is_zalgo("Việt Nam")
False
>>> is_zalgo("ḧ̸̡̢̧̛̗̱̜̼̯̞̙́̑̾̊̿̏̒̓̕ě̵̢̧̛̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕l̸̡̢̧̛̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕l̸̡̢̧̛̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕ơ̵̢̧̗̱̜̼̯̞̙̈́̑̾̊̿̏̒̓̕")
True
from translit import is_zalgo

is_zalgo("café")          # False (1 combining mark — normal)
is_zalgo("Việt Nam")      # False (2 combining marks — normal)
# Zalgo: 'a' with 20 stacked combining graves
is_zalgo("a" + "\u0300" * 20)  # True

is_safe_hostname

is_safe_hostname

is_safe_hostname(hostname: str) -> tuple[bool, SafeHostnameDetails]

Check if a hostname is safe from Unicode homoglyph attacks.

Returns (is_safe, details) where details is a SafeHostnameDetails with attributes:

  • safe: bool — True if no homoglyph spoofing detected.
  • scripts: list[str] — Unicode scripts found across all labels.
  • mixed_script: bool — True if multiple scripts detected.
  • has_confusables: bool — True if confusable homoglyphs found.
  • canonical: str — Latin-normalized form of the hostname.

A hostname is considered unsafe if it contains mixed high-risk scripts (Cyrillic+Latin, Greek+Latin) or confusable homoglyphs.

Parameters:
  • hostname (str) –

    Hostname string to check (e.g. "example.com").

Returns:
  • tuple[bool, SafeHostnameDetails]

    Tuple of (is_safe, details) where details is a SafeHostnameDetails.

Examples:

>>> safe, details = is_safe_hostname("google.com")
>>> safe
True
>>> details.canonical
'google.com'

SafeHostnameDetails

The second element of the tuple returned by is_safe_hostname():

Attribute Type Description
safe bool True if no homoglyph spoofing detected
scripts list[str] Unicode scripts found across all labels
mixed_script bool True if multiple scripts detected
has_confusables bool True if confusable homoglyphs found
canonical str Latin-normalized form of the hostname
from translit import is_safe_hostname

safe, details = is_safe_hostname("google.com")
# safe = True, details.canonical = "google.com"

safe, details = is_safe_hostname("gооgle.com")  # Cyrillic о's
# safe = False, details.mixed_script = True, details.has_confusables = True

A hostname is considered unsafe if it contains mixed high-risk scripts (Cyrillic+Latin, Greek+Latin) or confusable homoglyphs.