How to waste computer memory?

Steven D'Aprano steve at pearwood.info
Sun Mar 20 07:06:58 EDT 2016


On Sun, 20 Mar 2016 05:20 pm, Rustom Mody wrote:

> On Sunday, March 20, 2016 at 10:32:07 AM UTC+5:30, Steven D'Aprano wrote:
>> On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote:
>> 
>> > Steven D'Aprano :
>> > 
>> >> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote:
>> >>> Yes, but UTF-16 produces 16-bit values that are outside Unicode.
>> >>
>> >> Show me.
>> >>
>> >> Before you answer, if your answer is "surrogate pairs", that is
>> >> incorrect. Surrogate pairs is how UTF-16 encodes astral characters.
>> > 
>> > UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers.
>> > Thus, the output of UTF-16 is not Unicode.
>> 
>> I'm not sure what point you think you are making.
>> 
>> Unicode (the character set part of it) is a set of abstract 23-bit
>> numbers,
> 
> 23? Or 21?

Oops, you're right, its 21 bits.


> More pertinently if the number of bits signifies, whatever is the sense of
> the word 'abstract'?

The Unicode standard does not, as far as I am aware, care how you represent
code points in memory, only that there are 0x110000 of them, numbered from
U+0000 to U+10FFFF. That's what I mean by abstract. The obvious
implementation is to use 32-bit integers, where 0x00000000 represents code
point U+0000, 0x00000001 represents U+0001, and so forth. This is
essentially equivalent to UTF-16, but it's not mandated or specified by the
Unicode standard, you could, if you choose, use something else.

On the other hand, I believe that the output of the UTF transformations is
explicitly described in terms of 8-bit bytes and 16- or 32-bit words. For
instance, the UTF-8 encoding of "A" has to be a single byte with value 0x41
(decimal 65). It isn't that this is the most obvious implementation, its
that it can't be anything else and still be UTF-8.



-- 
Steven




More information about the Python-list mailing list