[Python-ideas] unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)

Mon Jul 8 20:52:39 CEST 2013

On Sun, Jul 7, 2013 at 3:29 AM, David Kendal <me at dpk.io> wrote:

> Python provides a way to iterate characters of a string by using the
> string as an iterable. But there's no way to iterate over Unicode graphemes
> (a cluster of characters consisting of a base character plus a number of
> combining marks and other modifiers -- or what the human eye would consider
> to be one "character").
>
> I think this ought to be provided either in the unicodedata library,
> (unicodedata.itergraphemes(string)) which exposes the character database
> information needed to make this work, or as a method on the built-in str
> type. (str.itergraphemes() or str.graphemes())

A common case is wanting to extract the current grapheme or move forward or
backward one. Please consider these other use cases rather than just adding
an iterator.

g = unicodedata.grapheme_cluster(str, i)  # extracts cluster that includes
index i (i may be in the middle of the cluster)
i = unicodedata.grapheme_start(str, i)  # if i is the start of the cluster,
returns i; otherwise backs up to the start of the cluster
i = unicodedata.previous_cluster(str, i)  # moves i to the first index of
the previous cluster; returns None if no previous cluster in the string
i = unicodedata.next_cluster(str, i)  # moves i to the first index of the
next cluster; returns None if no next cluster in the String

I think these belongs in unicodedata, not str.

--- Bruce
I'm hiring:
http://www.geekwork.com/opportunity/1225-job-software-developer-cadencemd
Latest blog post: Alice's Puzzle Page http://www.vroospeak.com
Learn how hackers think: http://j.mp/gruyere-security
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130708/5019d799/attachment.html>