How to waste computer memory?

Steven D'Aprano steve at pearwood.info
Sun Mar 20 01:01:47 EDT 2016


On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote:

> Steven D'Aprano <steve at pearwood.info>:
> 
>> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote:
>>> Yes, but UTF-16 produces 16-bit values that are outside Unicode.
>>
>> Show me.
>>
>> Before you answer, if your answer is "surrogate pairs", that is
>> incorrect. Surrogate pairs is how UTF-16 encodes astral characters.
> 
> UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers.
> Thus, the output of UTF-16 is not Unicode.

I'm not sure what point you think you are making.

Unicode (the character set part of it) is a set of abstract 23-bit numbers,
or code points, representing (among other things) characters, and numbered
from U+0000 to U+10FFFF. Any UTF is, by definition, a transformation from
such abstract code points to sequences of machine words or bytes (and vice
versa). What's your point?

If your point is that the data you get from running UTF-16 on a sequence of
code points is "not Unicode, but 2-byte words", then I agree, but I'm not
sure why you think that's significant.

If you want to call those words "numbers", I cannot really object, but if
so, they aren't abstract numbers (like code points, which may have any
implementation you like), but have their actual base-2 structure specified
by the standard.

If your point is that a UTF-16 encoded stream of bytes is not the same as an
abstract sequence of code points, then I can't disagree, but I don't
understand why you think that's important.


-- 
Steven




More information about the Python-list mailing list