How to waste computer memory?

Steven D'Aprano steve at pearwood.info
Sat Mar 19 10:56:56 EDT 2016


On Sat, 19 Mar 2016 08:31 pm, Marko Rauhamaa wrote:


>    Using the surrogate mechanism, UTF-16 can support all 1,114,112
>    potential Unicode characters.
> 
> But Unicode doesn't contain 1,114,112 characters—the surrogates are
> excluded from Unicode, and definitely cannot be encoded using
> UTF-anything.

Surrogates are most certainly part of the Unicode standard, and they are
necessary in UTF-16. (You cannot represent astral characters without them!)
So in a UTF-16 stream, a *pair* of surrogates is nothing unusual. They just
represent a SMP code point.

However, *single* surrogates are an error. For example, we see this FAQ:


Q: How do I convert an unpaired UTF-16 surrogate to UTF-32?

A: If an unpaired surrogate is encountered when converting ill-formed UTF-16
data, any conformant converter must treat this as an error. By representing
such an unpaired surrogate on its own, the resulting UTF-32 data stream
would become ill-formed. While it faithfully reflects the nature of the
input, Unicode conformance requires that encoding form conversion always
results in valid data stream.

http://www.unicode.org/faq/utf_bom.html#utf32-7


But nobody says that programming languages must deal with only conformant
converters and valid Unicode sequences. An unfortunate fact of life that
even if you don't generate them yourself, somebody else will so you need to
be able to deal with them.



[...]
> We still don't know if the final result will be UCS-4 everywhere (with
> all 2**32 code points allowed?!) or UTF-8 everywhere.

Unicode does not have 2**32 code points. It is guaranteed to never exceed
the 2**21 code points already allocated. (Many of those are still unused.)


As far as I am concerned, the future is clear:

UTF-8 for transmission and storage formats, where fast random access is not
necessary;

UTF-32 for in-memory formats, where O(1) random access is advantagous.
Possibly with certain in-memory optimizations to save space, where such can
be done transparently.

In the future, we will no more balk at using four whole bytes for a code
point than we now balk at using eight bytes for floating point numbers. The
mathematical advantages of float Doubles are just overwhelming, and the
only reason for using fewer than 64 bits is if you care more about getting
a fast answer than an accurate answer.

(I'm reminded of one of my wife's former roadies, back in the 70s, crossing
the US desert in a van. On being told that he was heading in the wrong
direction for their next gig, he replied "Who cares? We're making great
time!")

In the future, we'll have so much memory that the idea of using variable
width in-memory formats will seem absurd. 




-- 
Steven




More information about the Python-list mailing list