Grapheme Clusters

Functions for working with user-perceived characters (extended grapheme clusters) as defined by UAX #29. These give correct results for emoji, combining characters, and complex scripts where Python's len() overcounts.

grapheme_len

grapheme_len

grapheme_len(text: str) -> int

Count the number of user-perceived characters (extended grapheme clusters).

This is the correct answer to "how many characters does the user see?" A single grapheme cluster may span multiple codepoints (e.g., flag emoji, skin-toned emoji, Hangul syllables with combining jamo, Zalgo text).

Parameters:
  • text (str) โ€“

    Input string.

Returns:
  • int โ€“

    Number of extended grapheme clusters.

Examples:

>>> grapheme_len("cafรฉ")
4
>>> grapheme_len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ")  # family emoji = 1 grapheme cluster
1
from translit import grapheme_len

grapheme_len("cafรฉ")                 # => 4
grapheme_len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ")                    # => 1 (family emoji = 1 cluster)
grapheme_len("๐Ÿ‡ซ๐Ÿ‡ท")                    # => 1 (flag = 1 cluster, but len() = 2)
grapheme_len("รฉ")                    # => 1 (even if NFD: e + combining acute)

grapheme_split

grapheme_split

grapheme_split(text: str) -> list[str]

Split text into a list of extended grapheme clusters.

Each element is a user-perceived character.

Parameters:
  • text (str) โ€“

    Input string.

Returns:
  • list[str] โ€“

    List of grapheme cluster strings.

Examples:

>>> grapheme_split("cafรฉ")
['c', 'a', 'f', 'รฉ']
>>> len(grapheme_split("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ!"))  # family emoji + "!"
2
from translit import grapheme_split

grapheme_split("cafรฉ")               # => ['c', 'a', 'f', 'รฉ']
grapheme_split("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ!")               # => ['๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ', '!']

Note

Input is limited to 10 MB to prevent excessive memory allocation. Raises TranslitError for larger inputs.


grapheme_truncate

grapheme_truncate

grapheme_truncate(text: str, max_graphemes: int) -> str

Truncate text to at most max_graphemes user-perceived characters.

Unlike byte-level or codepoint-level truncation, this never splits a grapheme cluster (which could corrupt emoji, combining sequences, or Hangul syllables).

Parameters:
  • text (str) โ€“

    Input string.

  • max_graphemes (int) โ€“

    Maximum number of grapheme clusters to keep.

Returns:
  • str โ€“

    Truncated string containing at most max_graphemes grapheme clusters.

Examples:

>>> grapheme_truncate("Hello World", 5)
'Hello'
>>> grapheme_truncate("cafรฉ", 3)
'caf'
from translit import grapheme_truncate

grapheme_truncate("Hello World", 5)  # => "Hello"
grapheme_truncate("cafรฉ", 3)         # => "caf"
grapheme_truncate("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ๐ŸŽ‰", 1)         # => "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ" (never splits a cluster)

Unlike byte-level or codepoint-level truncation, grapheme_truncate never splits a grapheme cluster, which would corrupt emoji, combining sequences, or Hangul syllables.