'Straße' ('Strasse') and Python 2

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Jan 15 19:32:25 EST 2014


On Thu, 16 Jan 2014 02:14:38 +1100, Chris Angelico wrote:

> On Thu, Jan 16, 2014 at 1:55 AM,  <wxjmfauth at gmail.com> wrote:
>> Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit :
>>
>>
>>> ... more than one codepoint makes up a grapheme ...
>>
>> No
> 
> Yes.
> http://www.unicode.org/faq/char_combmark.html
> 
>>> In Unicode terms, an encoding is a mapping between codepoints and
>>> bytes.
>>
>> No
> 
> Yes.
> http://www.unicode.org/reports/tr17/
> Specifically:
> "Character Encoding Form: a mapping from a set of nonnegative integers
> that are elements of a CCS to a set of sequences of particular code
> units of some specified width, such as 32-bit integers"

Technically Unicode talks about mapping code points and code *units*, but 
since code units are defined in terms of bytes, I think it is fair to cut 
out one layer of indirection and talk about mapping code points to bytes. 
For instance, UTF-32 uses 4-byte code units, and every code point U+0000 
through U+10FFFF is mapped to a single code unit, which is always a four-
byte quantity. UTF-8, on the other hand, uses single-byte code units, and 
maps code points to a variable number of code units, so UTF-8 maps code 
points to either 1, 2, 3 or 4 bytes.


> Or are you saying that www.unicode.org is wrong about the definitions of
> Unicode terms?

No, I think he is saying that he doesn't know Unicode anywhere near as 
well as he thinks he does. The question is, will he cherish his 
ignorance, or learn from this thread?




-- 
Steven



More information about the Python-list mailing list