Language design

Steven D'Aprano steve+comp.lang.python at pearwood.info
Wed Sep 11 22:33:06 EDT 2013


On Thu, 12 Sep 2013 10:31:26 +1000, Chris Angelico wrote:

> On Thu, Sep 12, 2013 at 10:25 AM, Mark Janssen
> <dreamingforward at gmail.com> wrote:

>> Well now, this is an area that is not actually well-defined.  I would
>> say 16-bit Unicode is binary data if you're encoding in base 65,536,
>> just as 8-bit ascii is binary data if you're encoding in base-256.
>> Which is to say:  there is no intervening data to suggest a TYPE.
> 
> Unicode is not 16-bit any more than ASCII is 8-bit. And you used the
> word "encod[e]", which is the standard way to turn Unicode into bytes
> anyway. No, a Unicode string is a series of codepoints - it's most
> similar to a list of ints than to a stream of bytes.

And not necessarily ints, for that matter.

Let's be clear: the most obvious, simple, hardware-efficient way to 
implement a Unicode string holding arbitrary characters is as an array of 
32-bit signed integers restricted to the range 0x0 - 0x10FFFF. That gives 
you a one-to-one mapping of int <-> code point.

But it's not the only way. One could implement Unicode strings using any 
similar one-to-one mapping. Taking a leaf out of the lambda calculus, I 
might implement each code point like this:

NULL pointer <=> Code point 0
^NULL <=> Code point 1
^^NULL <=> Code point 2
^^^NULL <=> Code point 3

and so on, where ^ means "pointer to".

Obviously this is mathematically neat, but practically impractical. Code 
point U+10FFFF would require a chain of 1114111 pointer-to-pointer-to-
pointer before the NULL. But it would work. Or alternatively, I might 
choose to use floats, mapping (say) 0.25 <=> U+0376. Or whatever.

What we can say, though, is that to represent the full Unicode charset 
requires 21 bits per code-point, although you can get away with fewer 
bits if you have some out-of-band mechanism for recognising restricted 
subsets of the charset. (E.g. you could use just 7 bits if you only 
handled the characters in ASCII, or just 3 bits if you only cared about 
decimal digits.) In practice, computers tend to be much faster when 
working with multiples of 8 bits, so we use 32 bits instead of 21. In 
that sense, Unicode is a 32 bit character set.

But Unicode is absolutely not a 16 bit character set.

And of course you can use *more* bits than 21, or 32. If you had a 
computer where the native word-size was (say) 50 bits, it would make 
sense to use 50 bits per character.

As for the question of "binary data versus text", well, that's a thorny 
one, because really *everything* in a computer is binary data, since it's 
stored using bits. But we can choose to *interpret* some binary data as 
text, just as we interpret some binary data as pictures, sound files, 
video, Powerpoint presentations, and so forth. A reasonable way of 
defining a text file might be:

    If you decode the bytes making up an alleged text file into 
    code-points, using the correct encoding (which needs to be 
    known a priori, or stored out of band somehow), then provided 
    that none of the code-points have Unicode General Category Cc,
    Cf, Cs, Co or Cn (control, format, surrogate, private-use, 
    non-character/reserved), you can claim that it is at least 
    plausible that the file contains text.

Whether that text is meaningful is another story.

You might wish to allow Cf and possibly even Co (format and private-use), 
depending on the application.


-- 
Steven



More information about the Python-list mailing list