Ascii to Unicode.

Thu Jul 29 15:46:43 EDT 2010

John Nagle wrote:
> On 7/28/2010 3:58 PM, Joe Goldthwaite wrote:
>> This still seems odd to me.  I would have thought that the unicode 
>> function
>> would return a properly encoded byte stream that could then simply be
>> written to disk. Instead it seems like you have to re-encode the byte 
>> stream
>> to some kind of escaped Ascii before it can be written back out.
> 
>    Here's what's really going on.
> 
>    Unicode strings within Python have to be indexable.  So the internal
> representation of Unicode has (usually) two bytes for each character,
> so they work like arrays.
> 
>    UTF-8 is a stream format for Unicode.  It's slightly compressed;
> each character occupies 1 to 4 bytes, and the base ASCII characters
> (0..127 only, not 128..255) occupy one byte each.  The format is
> described in "http://en.wikipedia.org/wiki/UTF-8".  A UTF-8 file or
> stream has to be parsed from the beginning to keep track of where each
> Unicode character begins.  So it's not a suitable format for
> data being actively worked on in memory; it can't be easily indexed.
> 
Not entirely correct. The advantage of UTF-8 is that although different
codepoints might be encoded into different numbers of bytes it's easy to
tell whether a particular byte is the first in its sequence, so you
don't have to parse from the start of the file. It is true, however, it
can't be easily indexed.

>    That's why it's necessary to convert to UTF-8 before writing
> to a file or socket.
>