Ascii to Unicode.
Steven D'Aprano
steve-REMOVE-THIS at cybersource.com.au
Wed Jul 28 23:27:05 EDT 2010
On Wed, 28 Jul 2010 15:58:01 -0700, Joe Goldthwaite wrote:
> This still seems odd to me. I would have thought that the unicode
> function would return a properly encoded byte stream that could then
> simply be written to disk. Instead it seems like you have to re-encode
> the byte stream to some kind of escaped Ascii before it can be written
> back out.
I'm afraid that's not even wrong. The unicode function returns a unicode
string object, not a byte-stream, just as the list function returns a
sequence of objects, not a byte-stream.
Perhaps this will help:
http://www.joelonsoftware.com/articles/Unicode.html
Summary:
ASCII is not a synonym for bytes, no matter what some English-speakers
think. ASCII is an encoding from bytes like \x41 to characters like "A".
Unicode strings are a sequence of code points. A code point is a number,
implemented in some complex fashion that you don't need to care about.
Each code point maps conceptually to a letter; for example, the English
letter A is represented by the code point U+0041 and the Arabic letter
Ain is represented by the code point U+0639.
You shouldn't make any assumptions about the size of each code-point, or
how they are put together. You shouldn't expect to write code points to a
disk and have the result make sense, any more than you could expect to
write a sequence of tuples or sets or dicts to disk in any sensible
fashion. You have to serialise it to bytes first, and that's what the
encode method does. Decode does the opposite, taking bytes and creating
unicode strings from them.
For historical reasons -- backwards compatibility with files already
created, back in the Bad Old Days before unicode -- there are a whole
slew of different encodings available. There is no 1:1 mapping between
bytes and strings. If all you have are the bytes, there is literally no
way of knowing what string they represent (although sometimes you can
guess). You need to know what the encoding used was, or take a guess, or
make repeated decodings until something doesn't fail and hope that's the
right one.
As a general rule, Python will try encoding/decoding using the ASCII
encoding unless you tell it differently.
Any time you are writing to disk, you need to serialise the objects,
regardless of whether they are floats, or dicts, or unicode strings.
--
Steven
More information about the Python-list
mailing list