Grapheme Clusters¶
Unicode text is more complex than it appears. A single user-perceived "character" can be composed of multiple Unicode codepoints — combining accents, emoji modifiers, ZWJ sequences, regional indicator pairs, and Hangul jamo all create situations where Python's len() gives a misleading count.
translit provides three functions for working with extended grapheme clusters as defined by UAX #29, giving correct results where len() overcounts.
The Problem¶
text = "café" # 4 characters, right?
len(text) # => 4 ✓ (precomposed é = 1 codepoint)
# But with decomposed é (e + combining acute accent):
import unicodedata
text_nfd = unicodedata.normalize("NFD", "café")
len(text_nfd) # => 5 ✗ (e + ◌́ counted separately)
# Emoji are worse:
len("👨👩👧👦") # => 7 (4 person codepoints + 3 ZWJ joiners)
len("🇬🇧") # => 2 (two regional indicator symbols)
len("👋🏽") # => 2 (wave + skin tone modifier)
Python's len() counts codepoints, not user-perceived characters. For correct character counting, splitting, and truncation, you need grapheme cluster segmentation.
Functions¶
grapheme_len¶
Count the number of user-perceived characters:
from translit import grapheme_len
grapheme_len("café") # => 4
grapheme_len("cafe\u0301") # => 4 (NFD: e + combining accent = 1 grapheme)
# Emoji
grapheme_len("👨👩👧👦") # => 1 (family ZWJ sequence)
grapheme_len("🇬🇧") # => 1 (flag = 2 regional indicators = 1 grapheme)
grapheme_len("👋🏽") # => 1 (hand + skin tone modifier)
grapheme_len("🏳️🌈") # => 1 (rainbow flag)
# Complex scripts
grapheme_len("\u1100\u1161\u11A8") # => 1 (Hangul jamo sequence = 1 syllable)
grapheme_len("नमस्ते") # => 4 (Devanagari with conjuncts)
grapheme_split¶
Split text into individual grapheme clusters:
from translit import grapheme_split
grapheme_split("café") # => ['c', 'a', 'f', 'é']
grapheme_split("cafe\u0301") # => ['c', 'a', 'f', 'é'] (combining accent stays with e)
grapheme_split("👨👩👧👦!") # => ['👨👩👧👦', '!']
grapheme_split("🇫🇷🇬🇧") # => ['🇫🇷', '🇬🇧'] (two flags, not four indicators)
grapheme_split("Hi 👋🏽") # => ['H', 'i', ' ', '👋🏽']
Note
Input is limited to 10 MB to prevent excessive memory allocation. Raises TranslitError for larger inputs.
grapheme_truncate¶
Truncate text to a maximum number of grapheme clusters without splitting any cluster:
from translit import grapheme_truncate
grapheme_truncate("Hello World", 5) # => "Hello"
grapheme_truncate("café", 3) # => "caf"
grapheme_truncate("cafe\u0301s", 4) # => "café" (combining accent stays with the e)
# Emoji are never split
grapheme_truncate("👨👩👧👦🎉", 1) # => "👨👩👧👦" (family emoji = 1 grapheme)
grapheme_truncate("Hi 👩👩👧👦!", 4) # => "Hi 👩👩👧👦" (family counts as 1)
grapheme_truncate("🇬🇧🇫🇷🇩🇪", 2) # => "🇬🇧🇫🇷" (two flags)
Unlike byte-level slicing (text[:n]) or codepoint-level slicing, grapheme_truncate never produces corrupted output — no broken emoji, no orphaned combining marks, no split Hangul syllables.
Text Builder¶
All grapheme functions are also available on the Text builder:
from translit import Text
t = Text("Hello 👨👩👧👦!")
# Predicates (non-chaining)
t.grapheme_len() # => 8
t.grapheme_split() # => ['H', 'e', 'l', 'l', 'o', ' ', '👨👩👧👦', '!']
# Transform (chaining)
t.grapheme_truncate(7).value # => "Hello 👨👩👧👦"
When to Use Grapheme Functions¶
Use grapheme_len instead of len() when:¶
- Enforcing character limits — user-facing limits like "280 characters" should count what users see, not codepoints
- Validating input length — username or field length validation
- Character-level ML tokenization — splitting text into "characters" for character-level models
- Display width estimation — though note that display width also depends on font metrics, not just grapheme count
Use grapheme_truncate instead of slicing when:¶
- Truncating user-visible text — preview snippets, title shortening
- Database field length enforcement — preventing corruption of combining sequences at boundaries
- API response truncation — ensuring valid Unicode output
- Slug length limits — though
slugify(max_length=)already handles this for ASCII output
Use grapheme_split instead of list() when:¶
- Character-level tokenization — NLP pipelines that need individual characters
- Character frequency analysis — counting character distributions
- Grapheme-aware iteration — processing text one user-perceived character at a time
Codepoints vs Graphemes vs Bytes¶
A comparison showing how different counting methods diverge:
| Text | len(b) bytes |
len(s) codepoints |
grapheme_len(s) |
|---|---|---|---|
"hello" |
5 | 5 | 5 |
"café" (NFC) |
5 | 4 | 4 |
"café" (NFD) |
6 | 5 | 4 |
"👨👩👧👦" |
25 | 7 | 1 |
"🇬🇧" |
8 | 2 | 1 |
"👋🏽" |
8 | 2 | 1 |
"नमस्ते" |
18 | 6 | 4 |
"한" (precomposed) |
3 | 1 | 1 |
"한" (jamo) |
9 | 3 | 1 |
Normalization Interaction¶
Grapheme cluster boundaries can differ between NFC and NFD forms of the same text. For consistent results, normalize before counting:
from translit import normalize, grapheme_len
text = "é" # might be NFC or NFD depending on source
normalized = normalize(text, form="NFC")
count = grapheme_len(normalized) # => 1 (regardless of original form)
In practice, grapheme_len gives the same count for NFC and NFD forms of the same text — the grapheme cluster algorithm handles both. But normalizing first ensures deterministic byte-level results from grapheme_split and grapheme_truncate.
Best Practices¶
Username validation¶
Sanitize input first, then enforce a grapheme-aware length limit:
from translit import sanitize_user_input, grapheme_len, grapheme_truncate
def validate_username(raw: str, max_graphemes: int = 30) -> str:
clean = sanitize_user_input(raw)
if grapheme_len(clean) > max_graphemes:
clean = grapheme_truncate(clean, max_graphemes)
return clean
Post/tweet fields¶
Use display_clean for lightweight sanitization and grapheme_truncate for the character limit:
from translit import display_clean, grapheme_truncate
def prepare_post(raw: str, max_graphemes: int = 280) -> str:
clean = display_clean(raw)
return grapheme_truncate(clean, max_graphemes)
Database column truncation¶
When storing text in a column with a character limit, truncate by grapheme clusters — never by bytes or codepoints, which can split emoji or combining sequences:
from translit import security_clean, grapheme_truncate
def safe_for_db(raw: str, max_graphemes: int = 255) -> str:
clean = security_clean(raw)
return grapheme_truncate(clean, max_graphemes)
ML corpus preparation¶
Normalize text before truncating to a token-budget-friendly length:
from translit import ml_normalize, grapheme_truncate
def prepare_for_model(raw: str, max_graphemes: int = 4096) -> str:
clean = ml_normalize(raw)
return grapheme_truncate(clean, max_graphemes)
Limitations¶
- Display width is not grapheme count. East Asian characters (CJK) are typically double-width in monospace fonts, but
grapheme_lencounts them as 1. For terminal column-width calculation, you need a separate width estimation library. - Newer emoji sequences. The
unicode-segmentationcrate's tables must be updated to correctly segment newly standardized ZWJ emoji sequences. Between updates, a brand-new emoji may be split across multiple clusters. - Rendering varies. "User-perceived character" is ultimately a rendering question. Not all systems agree on cluster boundaries, particularly for complex emoji. See Limitations for details.
Performance¶
Grapheme operations use the Rust unicode-segmentation crate, which implements UAX #29 with precomputed lookup tables. Performance is in the sub-microsecond range for typical inputs:
| Function | Input | Time |
|---|---|---|
grapheme_len |
ASCII string | ~100 ns |
grapheme_len |
Emoji string | ~260 ns |
grapheme_split |
ASCII string | ~285 ns |
grapheme_split |
Emoji string | ~516 ns |