How to waste computer memory?

Sat Mar 19 11:02:29 EDT 2016

Steven D'Aprano <steve at pearwood.info>:

> On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote:
>
>
>>    Using the surrogate mechanism, UTF-16 can support all 1,114,112
>>    potential Unicode characters.
>> 
>> But Unicode doesn't contain 1,114,112 characters—the surrogates are
>> excluded from Unicode, and definitely cannot be encoded using
>> UTF-anything.
>
> Surrogates are most certainly part of the Unicode standard, and they are
> necessary in UTF-16.

Yes, but UTF-16 produces 16-bit values that are outside Unicode. UTF-16
can encode *any* valid Unicode, but it cannot encode surrogate
characters.

   >>> '\udc10'.encode('utf-8')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-8' codec can't encode character '\udc10' in pos\
   ition 0: surrogates not allowed
   >>> '\udc10'.encode('utf-16')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-16' codec can't encode character '\udc10' in po\
   sition 0: surrogates not allowed
   >>> '\udc10'.encode('utf-32')
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   UnicodeEncodeError: 'utf-32' codec can't encode character '\udc10' in po\
   sition 0: surrogates not allowed

>> We still don't know if the final result will be UCS-4 everywhere (with
>> all 2**32 code points allowed?!) or UTF-8 everywhere.
>
> Unicode does not have 2**32 code points. It is guaranteed to never
> exceed the 2**21 code points already allocated. (Many of those are
> still unused.)

Never say never.

> In the future, we'll have so much memory that the idea of using
> variable width in-memory formats will seem absurd.

I'm starting to think that future is already here.

Marko