How to waste computer memory?

Steven D'Aprano steve at pearwood.info
Sun Mar 20 08:14:39 EDT 2016


On Sun, 20 Mar 2016 10:22 pm, Chris Angelico wrote:

> On Sun, Mar 20, 2016 at 10:06 PM, Steven D'Aprano <steve at pearwood.info>
> wrote:
>> The Unicode standard does not, as far as I am aware, care how you
>> represent code points in memory, only that there are 0x110000 of them,
>> numbered from U+0000 to U+10FFFF. That's what I mean by abstract. The
>> obvious implementation is to use 32-bit integers, where 0x00000000
>> represents code point U+0000, 0x00000001 represents U+0001, and so forth.
>> This is essentially equivalent to UTF-16, but it's not mandated or
>> specified by the Unicode standard, you could, if you choose, use
>> something else.
> 
> (UTF-32)

D'oh!

I mean, yes, well done, you have passed my little test to see if anyone is
paying attention. Have a gold star.


> The codepoints are not representable in *memory*; they are, by
> definition, representable in a field of integers. 

They're not directly representable in memory because the definition of code
points is not given in terms of memory values. Hence, they are abstract
values, numbered in a certain way, and given certain semantics.

In other words, there's nothing in the Unicode standard that says that code
point U+0020 has to be stored as a byte 0x20, or a word 0x0020. But the
standard does say that the code point U+0020 represents a space character.


[...]
>> On the other hand, I believe that the output of the UTF transformations
>> is explicitly described in terms of 8-bit bytes and 16- or 32-bit words.
>> For instance, the UTF-8 encoding of "A" has to be a single byte with
>> value 0x41 (decimal 65). It isn't that this is the most obvious
>> implementation, its that it can't be anything else and still be UTF-8.
> 
> Exactly. Aside from the way UTF-16 and UTF-32 have LE and BE variants,

Blame the chip manufacturers for that. Actually, I think we can blame Intel
specifically for that, for reversing the normal layout of words in memory.



-- 
Steven




More information about the Python-list mailing list