How to waste computer memory?

Sun Mar 20 03:30:53 EDT 2016

Steven D'Aprano <steve at pearwood.info>:

> On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote:
>> Steven D'Aprano <steve at pearwood.info>:
>>> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote:
>>>> Yes, but UTF-16 produces 16-bit values that are outside Unicode.
>>>
>>> Show me.
>>>
>>> Before you answer, if your answer is "surrogate pairs", that is
>>> incorrect. Surrogate pairs is how UTF-16 encodes astral characters.
>> 
>> UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers.
>> Thus, the output of UTF-16 is not Unicode.
>
> [...]
>
> If your point is that the data you get from running UTF-16 on a
> sequence of code points is "not Unicode, but 2-byte words", then I
> agree, but I'm not sure why you think that's significant.

I say the surrogate characters are not Unicode. You say they are because
they are used to encode astral characters. I say that point is
irrelevant.

I'm saying the surrogate characters are not Unicode because you are not
allowed to store or communicate them. They are a hole in the Unicode
fabric.

They could have—probably should have—specified a UTF-16 encoding for the
surrogate characters as well. That would have left the Unicode range
uninterrupted. Well, the silver lining is that Python gained a number of
extra code points it was free to use for special purposes, although to
be faithful to Unicode, Python should refuse to store them.

Marko