Grapheme Clusters¶
Functions for working with user-perceived characters (extended grapheme clusters) as defined by UAX #29. These give correct results for emoji, combining characters, and complex scripts where Python's len() overcounts.
grapheme_len¶
grapheme_len ¶
grapheme_len(text: str) -> int
Count the number of user-perceived characters (extended grapheme clusters).
This is the correct answer to "how many characters does the user see?" A single grapheme cluster may span multiple codepoints (e.g., flag emoji, skin-toned emoji, Hangul syllables with combining jamo, Zalgo text).
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> grapheme_len("cafรฉ")
4
>>> grapheme_len("๐จโ๐ฉโ๐งโ๐ฆ") # family emoji = 1 grapheme cluster
1
from translit import grapheme_len
grapheme_len("cafรฉ") # => 4
grapheme_len("๐จโ๐ฉโ๐งโ๐ฆ") # => 1 (family emoji = 1 cluster)
grapheme_len("๐ซ๐ท") # => 1 (flag = 1 cluster, but len() = 2)
grapheme_len("รฉ") # => 1 (even if NFD: e + combining acute)
grapheme_split¶
grapheme_split ¶
grapheme_split(text: str) -> list[str]
Split text into a list of extended grapheme clusters.
Each element is a user-perceived character.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> grapheme_split("cafรฉ")
['c', 'a', 'f', 'รฉ']
>>> len(grapheme_split("๐จโ๐ฉโ๐งโ๐ฆ!")) # family emoji + "!"
2
from translit import grapheme_split
grapheme_split("cafรฉ") # => ['c', 'a', 'f', 'รฉ']
grapheme_split("๐จโ๐ฉโ๐งโ๐ฆ!") # => ['๐จโ๐ฉโ๐งโ๐ฆ', '!']
Note
Input is limited to 10 MB to prevent excessive memory allocation. Raises TranslitError for larger inputs.
grapheme_truncate¶
grapheme_truncate ¶
grapheme_truncate(text: str, max_graphemes: int) -> str
Truncate text to at most max_graphemes user-perceived characters.
Unlike byte-level or codepoint-level truncation, this never splits a grapheme cluster (which could corrupt emoji, combining sequences, or Hangul syllables).
| Parameters: |
|
|---|
| Returns: |
|
|---|
Examples:
>>> grapheme_truncate("Hello World", 5)
'Hello'
>>> grapheme_truncate("cafรฉ", 3)
'caf'
from translit import grapheme_truncate
grapheme_truncate("Hello World", 5) # => "Hello"
grapheme_truncate("cafรฉ", 3) # => "caf"
grapheme_truncate("๐จโ๐ฉโ๐งโ๐ฆ๐", 1) # => "๐จโ๐ฉโ๐งโ๐ฆ" (never splits a cluster)
Unlike byte-level or codepoint-level truncation, grapheme_truncate never splits a grapheme cluster, which would corrupt emoji, combining sequences, or Hangul syllables.