How to waste computer memory?

Sun Mar 20 07:22:45 EDT 2016

On Sun, Mar 20, 2016 at 10:06 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> The Unicode standard does not, as far as I am aware, care how you represent
> code points in memory, only that there are 0x110000 of them, numbered from
> U+0000 to U+10FFFF. That's what I mean by abstract. The obvious
> implementation is to use 32-bit integers, where 0x00000000 represents code
> point U+0000, 0x00000001 represents U+0001, and so forth. This is
> essentially equivalent to UTF-16, but it's not mandated or specified by the
> Unicode standard, you could, if you choose, use something else.

(UTF-32)

The codepoints are not representable in *memory*; they are, by
definition, representable in a field of integers. If you choose to
represent those integers as little-endian 32-bit values, then yes, the
layout in memory will look like UTF-32LE, but that's because UTF-32LE
is defined in this extremely simple way. In fact, that's exactly how
the layers work - Unicode defines a mapping of characters to code
points, and then UTF-x defines a mapping of code points to bytes.

> On the other hand, I believe that the output of the UTF transformations is
> explicitly described in terms of 8-bit bytes and 16- or 32-bit words. For
> instance, the UTF-8 encoding of "A" has to be a single byte with value 0x41
> (decimal 65). It isn't that this is the most obvious implementation, its
> that it can't be anything else and still be UTF-8.

Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants,
there is only one bitpattern for any given character sequence and
UTF-x (so if you work with eg "UTF-16LE", there's only one). This is
no accident. Unlike some encodings, in which there's a "one most
obvious" way to encode things but then a number of other legal ways,
UTF-x can be compared for equality [1] using simple byte-for-byte
comparisons. This means you don't have to worry about someone sneaking
a magic character past your filter; if you're checking a UTF-8 stream
for the character U+003C LESS-THAN SIGN, the only byte value to look
for is 0x3C - the sequence 0xC0 0xBC, despite mathematically
representing the number 003C, is explicitly forbidden.

ChrisA

[1] Though not inequality - lexical sorting doesn't follow codepoint
order, and codepoint order won't always match byte order. But equality
is easy.