'Straße' ('Strasse') and Python 2
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Wed Jan 15 19:43:21 EST 2014
On Wed, 15 Jan 2014 12:00:51 +0000, Robin Becker wrote:
> so two 'characters' are 3 (or 2 or more) codepoints.
Yes.
> If I want to isolate so called graphemes I need an algorithm even
> for python's unicode
Correct. Graphemes are language dependent, e.g. in Dutch "ij" is usually
a single grapheme, in English it would be counted as two. Likewise, in
Czech, "ch" is a single grapheme. The Latin form of Serbo-Croation has
two two-letter graphemes, Dž and Nj (it used to have three, but Dj is now
written as Đ).
Worse, linguists sometimes disagree as to what counts as a grapheme. For
instance, some authorities consider the English "sh" to be a separate
grapheme. As a native English speaker, I'm not sure about that. Certainly
it isn't a separate letter of the alphabet, but on the other hand I can't
think of any words containing "sh" that should be considered as two
graphemes "s" followed by "h". Wait, no, that's not true... compound
words such as "glasshouse" or "disheartened" are counter examples.
> ie when it really matters, python3 str is just another encoding.
I'm not entirely sure how a programming language data type (str) can be
considered a transformation.
--
Steven
More information about the Python-list
mailing list