[Python-ideas] unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)

Stephen J. Turnbull stephen at xemacs.org
Tue Jul 9 07:30:55 CEST 2013


Bruce Leban writes:

 > On Sun, Jul 7, 2013 at 3:29 AM, David Kendal <me at dpk.io> wrote:
 >> But there's no way to iterate over Unicode graphemes

 > A common case is wanting to extract the current grapheme or move
 > forward or backward one.  Please consider these other use cases
 > rather than just adding an iterator.

 >    g = unicodedata.grapheme_cluster(str, i)
 >    # extracts cluster that includes index i (i may be in the middle
 >    # of the cluster)

Why is indexing a string and returning a grapheme a common case?  I
would think the common case would be indexing or iterating over a
grapheme sequence.  At least, if we provided such a type, it would
be.[1]

Also, for 20 years I've worked with Emacs/Mule which has a multibyte
internal representation of characters, and so does a lot of byte index
<-> character index conversion in the internals.  I would like to
avoid imposing that confusion on application programmers, unless they
really need it for some reason.


Footnotes: 
[1]  Well, of course a lot of applications would continue to work with
strs, just as today some applications work directly with bytes even
though the content is readable text that could sensibly be translated
to str.  What I mean is that I expect that indexing str to get
grapheme would be rare in applications if grapheme iterators and
arrays were available.




More information about the Python-ideas mailing list