How to waste computer memory?

Marko Rauhamaa marko at pacujo.net
Sun Mar 20 03:30:53 EDT 2016


Steven D'Aprano <steve at pearwood.info>:

> On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote:
>> Steven D'Aprano <steve at pearwood.info>:
>>> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote:
>>>> Yes, but UTF-16 produces 16-bit values that are outside Unicode.
>>>
>>> Show me.
>>>
>>> Before you answer, if your answer is "surrogate pairs", that is
>>> incorrect. Surrogate pairs is how UTF-16 encodes astral characters.
>> 
>> UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers.
>> Thus, the output of UTF-16 is not Unicode.
>
> [...]
>
> If your point is that the data you get from running UTF-16 on a
> sequence of code points is "not Unicode, but 2-byte words", then I
> agree, but I'm not sure why you think that's significant.

I say the surrogate characters are not Unicode. You say they are because
they are used to encode astral characters. I say that point is
irrelevant.

I'm saying the surrogate characters are not Unicode because you are not
allowed to store or communicate them. They are a hole in the Unicode
fabric.

They could have—probably should have—specified a UTF-16 encoding for the
surrogate characters as well. That would have left the Unicode range
uninterrupted. Well, the silver lining is that Python gained a number of
extra code points it was free to use for special purposes, although to
be faithful to Unicode, Python should refuse to store them.


Marko



More information about the Python-list mailing list