Language design

Chris Angelico rosuav at gmail.com
Wed Sep 11 20:31:26 EDT 2013


On Thu, Sep 12, 2013 at 10:25 AM, Mark Janssen
<dreamingforward at gmail.com> wrote:
>>> On Tue, 10 Sep 2013, Ben Finney wrote:
>>> >  The sooner we replace the erroneous
>>> >  “text is ASCII” in the common wisdom with “text is Unicode”, the
>>> >  better.
>>>
>>> I'd actually argue that it's better to replace the common wisdom with
>>> "text is binary data, and we should normally look at that text through
>>> Unicode eyes". A little less catchy, but more accurate ;)
>>
>> No, that's inaccurate. A sequence of bytes is binary data. Unicode is
>> not binary data.
>
> Well now, this is an area that is not actually well-defined.  I would
> say 16-bit Unicode is binary data if you're encoding in base 65,536,
> just as 8-bit ascii is binary data if you're encoding in base-256.
> Which is to say:  there is no intervening data to suggest a TYPE.

Unicode is not 16-bit any more than ASCII is 8-bit. And you used the
word "encod[e]", which is the standard way to turn Unicode into bytes
anyway. No, a Unicode string is a series of codepoints - it's most
similar to a list of ints than to a stream of bytes.

ChrisA



More information about the Python-list mailing list