'Straße' ('Strasse') and Python 2

Ned Batchelder ned at nedbatchelder.com
Wed Jan 15 07:13:36 EST 2014


On 1/15/14 7:00 AM, Robin Becker wrote:
> On 12/01/2014 07:50, wxjmfauth at gmail.com wrote:
>>>>> sys.version
>> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>>>> s = 'Straße'
>>>>> assert len(s) == 6
>>>>> assert s[5] == 'e'
>>>>>
>>
>> jmf
>>
>
> On my utf8 based system
>
>
>> robin at everest ~:
>> $ cat ooo.py
>> if __name__=='__main__':
>>     import sys
>>     s='A̅B'
>>     print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
>> robin at everest ~:
>> $ python ooo.py
>> version_info=sys.version_info(major=3, minor=3, micro=3,
>> releaselevel='final', serial=0)
>> len(A̅B)=3
>> robin at everest ~:
>> $
>
>
> so two 'characters' are 3 (or 2 or more) codepoints. If I want to
> isolate so called graphemes I need an algorithm even for python's
> unicode ie when it really matters, python3 str is just another encoding.

You are right that more than one codepoint makes up a grapheme, and that 
you'll need code to deal with the correspondence between them. But let's 
not muddy these already confusing waters by referring to that mapping as 
an encoding.

In Unicode terms, an encoding is a mapping between codepoints and bytes. 
  Python 3's str is a sequence of codepoints.

-- 
Ned Batchelder, http://nedbatchelder.com




More information about the Python-list mailing list