Encoding Detection & Decoding

Functions for detecting and converting byte sequences to UTF-8. Uses the chardetng algorithm (Firefox's encoding detector) for auto-detection.

detect_encoding

detect_encoding

detect_encoding(data: bytes) -> tuple[str, float]

Detect the encoding of a byte sequence.

Returns (encoding_name, confidence) where confidence is 0.0–1.0. Uses the chardetng algorithm (Firefox's encoding detector).

Important: automatic encoding detection is inherently probabilistic. A high confidence score does NOT guarantee correctness. For critical pipelines, always prefer explicit encoding metadata over detection.

Parameters:
  • data (bytes) –

    Raw byte sequence to analyze.

Returns:
  • tuple[str, float]

    Tuple of (encoding_name, confidence) where confidence is 0.0–1.0.

Raises:
  • TranslitError

    If the byte sequence cannot be analyzed.

Examples:

>>> enc, conf = detect_encoding(b"Hello World")
>>> enc
'UTF-8'
from translit import detect_encoding

enc, confidence = detect_encoding(b"Hello World")
# enc = "UTF-8", confidence ≈ 1.0

# Windows-1252 encoded text
enc, confidence = detect_encoding("café".encode("windows-1252"))
# enc = "windows-1252", confidence ≈ 0.87

Warning

Automatic encoding detection is inherently probabilistic. A high confidence score does not guarantee correctness. For critical pipelines, always prefer explicit encoding metadata (HTTP headers, BOM, schema definitions) over detection.


decode_to_utf8

decode_to_utf8

decode_to_utf8(data: bytes, encoding: str | None = None, *, min_confidence: float = 0.0) -> tuple[str, bool]

Decode a byte sequence to UTF-8.

Returns (decoded_text, had_errors) where had_errors is True if any characters were replaced during decoding (lossy conversion).

If encoding is None, auto-detects the encoding using the chardetng algorithm. Use min_confidence to require a minimum detection quality and avoid silently decoding with a low-confidence guess.

Supports all WHATWG encodings (UTF-8, windows-1252, ISO-8859-1, Shift_JIS, EUC-JP, EUC-KR, Big5, GB18030, etc.).

Parameters:
  • data (bytes) –

    Raw byte sequence to decode.

  • encoding (str | None, default: None ) –

    Encoding name (e.g. "windows-1252"). None to auto-detect.

  • min_confidence (float, default: 0.0 ) –

    Minimum acceptable detection confidence (0.0–1.0) when auto-detecting. Raises TranslitError if the detected confidence is below this threshold. Has no effect when encoding is provided explicitly. Defaults to 0.0 (accept any guess).

Returns:
  • str

    Tuple of (decoded_text, had_errors) where had_errors is True if

  • bool

    any characters were replaced during lossy conversion.

Raises:
  • TranslitError

    If the encoding name is unknown, decoding fails, or auto-detection confidence is below min_confidence.

Examples:

>>> text, had_errors = decode_to_utf8(b"caf\xe9", "windows-1252")
>>> text
'café'
>>> had_errors
False
from translit import decode_to_utf8

# Explicit encoding
text, had_errors = decode_to_utf8(b"caf\xe9", "windows-1252")
# text = "café", had_errors = False

# Auto-detection
text, had_errors = decode_to_utf8(raw_bytes)

# Require high confidence for auto-detection
text, had_errors = decode_to_utf8(raw_bytes, min_confidence=0.8)
# Raises TranslitError if detected confidence < 0.8

Supports all WHATWG encodings: UTF-8, windows-1252, ISO-8859-1, Shift_JIS, EUC-JP, EUC-KR, Big5, GB18030, and more.