'Straße' ('Strasse') and Python 2

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Jan 15 19:43:21 EST 2014


On Wed, 15 Jan 2014 12:00:51 +0000, Robin Becker wrote:

> so two 'characters' are 3 (or 2 or more) codepoints.

Yes.


> If I want to isolate so called graphemes I need an algorithm even 
> for python's unicode

Correct. Graphemes are language dependent, e.g. in Dutch "ij" is usually 
a single grapheme, in English it would be counted as two. Likewise, in 
Czech, "ch" is a single grapheme. The Latin form of Serbo-Croation has 
two two-letter graphemes, Dž and Nj (it used to have three, but Dj is now 
written as Đ).

Worse, linguists sometimes disagree as to what counts as a grapheme. For 
instance, some authorities consider the English "sh" to be a separate 
grapheme. As a native English speaker, I'm not sure about that. Certainly 
it isn't a separate letter of the alphabet, but on the other hand I can't 
think of any words containing "sh" that should be considered as two 
graphemes "s" followed by "h". Wait, no, that's not true... compound 
words such as "glasshouse" or "disheartened" are counter examples.


> ie when it really matters, python3 str is just another encoding.

I'm not entirely sure how a programming language data type (str) can be 
considered a transformation.



-- 
Steven



More information about the Python-list mailing list