[Python-ideas] unicodedata.itergraphemes (or str.itergraphemes / str.graphemes)
Bruce Leban
bruce at leapyear.org
Mon Jul 8 22:26:50 CEST 2013
On Mon, Jul 8, 2013 at 1:02 PM, David Mertz <mertz at gnosis.cx> wrote:
> I think the API Bruce suggests, along with its module location in
> 'unicodedata' makes more sense than the iterator only.
>
> But it seems to me that it would still be useful to explicitly break a
> string into its component clusters with a similar function. E.g.:
>
> graphemes = unicodedata.grapheme_clusters(str) # Returns an iterator of
> strings, often single characters
> for g in graphemes: ...
>
> It wouldn't be very hard to implement 'grapheme_clusters' in terms of the
> API Bruce suggests, but I feel like it should have a standard name and API
> along with those others. Actually, I guess the implementation is just:
>
> def grapheme_clusters(s):
> for i in range(len(str)):
> if i == unicodedata.grapheme_start(s, i):
> yield unicodedata.grapheme_cluster(s, i)
>
Yes, I still think the iterator is useful. I'd use the following
implementation instead as the above is going to find the start of each
multi-char grapheme multiple times.
def grapheme_clusters(s):
if len(str):
i = 0
while i is not None:
yield unicodedata.grapheme_cluster(s, i)
i = unicodedata.grapheme_next(str, i)
This does "if len(str)" at the top rather than just "if str" so it raises
if passed a non-iterable like None rather than silently accepting it.
--- Bruce
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130708/de059c8b/attachment.html>
More information about the Python-ideas
mailing list