Ascii to Unicode.

Steven D'Aprano steve-REMOVE-THIS at cybersource.com.au
Wed Jul 28 23:27:05 EDT 2010


On Wed, 28 Jul 2010 15:58:01 -0700, Joe Goldthwaite wrote:

> This still seems odd to me.  I would have thought that the unicode
> function would return a properly encoded byte stream that could then
> simply be written to disk. Instead it seems like you have to re-encode
> the byte stream to some kind of escaped Ascii before it can be written
> back out.

I'm afraid that's not even wrong. The unicode function returns a unicode 
string object, not a byte-stream, just as the list function returns a 
sequence of objects, not a byte-stream.

Perhaps this will help:

http://www.joelonsoftware.com/articles/Unicode.html


Summary:

ASCII is not a synonym for bytes, no matter what some English-speakers 
think. ASCII is an encoding from bytes like \x41 to characters like "A".

Unicode strings are a sequence of code points. A code point is a number, 
implemented in some complex fashion that you don't need to care about. 
Each code point maps conceptually to a letter; for example, the English 
letter A is represented by the code point U+0041 and the Arabic letter 
Ain is represented by the code point U+0639.

You shouldn't make any assumptions about the size of each code-point, or 
how they are put together. You shouldn't expect to write code points to a 
disk and have the result make sense, any more than you could expect to 
write a sequence of tuples or sets or dicts to disk in any sensible 
fashion. You have to serialise it to bytes first, and that's what the 
encode method does. Decode does the opposite, taking bytes and creating 
unicode strings from them.

For historical reasons -- backwards compatibility with files already 
created, back in the Bad Old Days before unicode -- there are a whole 
slew of different encodings available. There is no 1:1 mapping between 
bytes and strings. If all you have are the bytes, there is literally no 
way of knowing what string they represent (although sometimes you can 
guess). You need to know what the encoding used was, or take a guess, or 
make repeated decodings until something doesn't fail and hope that's the 
right one.

As a general rule, Python will try encoding/decoding using the ASCII 
encoding unless you tell it differently.

Any time you are writing to disk, you need to serialise the objects, 
regardless of whether they are floats, or dicts, or unicode strings.


-- 
Steven



More information about the Python-list mailing list