How to waste computer memory?

Steven D'Aprano steve at pearwood.info
Sat Mar 19 11:47:37 EDT 2016


On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote:

> Steven D'Aprano <steve at pearwood.info>:
> 
>> On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote:
>>
>>
>>>    Using the surrogate mechanism, UTF-16 can support all 1,114,112
>>>    potential Unicode characters.
>>> 
>>> But Unicode doesn't contain 1,114,112 characters—the surrogates are
>>> excluded from Unicode, and definitely cannot be encoded using
>>> UTF-anything.
>>
>> Surrogates are most certainly part of the Unicode standard, and they are
>> necessary in UTF-16.
> 
> Yes, but UTF-16 produces 16-bit values that are outside Unicode. 

Show me.

Before you answer, if your answer is "surrogate pairs", that is incorrect.
Surrogate pairs is how UTF-16 encodes astral characters.

For example, the UTF-16 *byte sequence* 0xD800 0xDC00 does not
represent "code points U+D800,DC00". It represents the *single* code point
U+10000 "LINEAR B SYLLABLE B008 A". The code points U+D800 and U+DC00 are
reserved for the use of UTF-16 as surrogates.

This means that UTF-16 cannot encode lone surrogates. It cannot encode, say,
the code point U+D800 on its own, because it looks like half of a SMP code
point, which is an error. And it cannot encode U+D800 immediately followed
by U+DC00, because that would be interpreted as U+10000. So there is a
range of code points which cannot be represented in UTF-16.

Where UTF-16 goes, UTF-8 and UTF-32 must follow. It is a requirement of
Unicode that you must be able to freely and losslessly convert between the
three UTFs. (I'm not sure if that also applies to UTF-7.) Since UTF-16
*cannot* represent this specific range of code points, then UTF-8 and
UTF-32 must be *forbidden* from doing the same.

Note that the UTF-8 and UTF-32 formats are perfectly capable of representing
lone surrogates. UTF-32, for example would simply pad the code point with
zeroes: U+D800 would be represented as the four bytes 0x0000D800. UTF-8 has
a well-defined 3-byte sequence that corresponds to it. But that is invalid,
since it violates the requirement that it be freely and losslessly
translatable into UTF-16.

Invalid Unicode strings have their uses, but they are not valid :-)



> UTF-16 can encode *any* valid Unicode, but it cannot encode surrogate
> characters.

Correct. 

But encoding of surrogates is not required in Unicode. Strictly speaking, it
is forbidden. Did you read the link from the Unicode consortium that I
provided?


>>> We still don't know if the final result will be UCS-4 everywhere (with
>>> all 2**32 code points allowed?!) or UTF-8 everywhere.
>>
>> Unicode does not have 2**32 code points. It is guaranteed to never
>> exceed the 2**21 code points already allocated. (Many of those are
>> still unused.)
> 
> Never say never.

The Unicode standard has published this guarantee. It is not going to
change. If somebody wants more than 2**21 code points, they can start their
own new, competing, standard.



>> In the future, we'll have so much memory that the idea of using
>> variable width in-memory formats will seem absurd.
> 
> I'm starting to think that future is already here.

I'm not *quite* ruling out the possibility that UTF-8 as internal
representation for in-memory strings is a good idea, but I think that for
non-embedded systems, it is very probably a waste of time.




-- 
Steven




More information about the Python-list mailing list