Ascii to Unicode.

Thu Jul 29 07:50:25 EDT 2010

Joe Goldthwaite wrote:
>   import unicodedata
>    
>   input = file('ascii.csv', 'rb')
>   output = file('unicode.csv','wb')
> 
>   for line in input.xreadlines():
>      unicodestring = unicode(line, 'latin1')
>      output.write(unicodestring.encode('utf-8')) # This second encode
>                                       is what I was missing.

Actually, I see two problems here:
1. "ascii.csv" is not an ASCII file but a Latin-1 encoded file, so there
starts the first confusion.
2. "unicode.csv" is not a "Unicode" file, because Unicode is not a file
format. Rather, it is a UTF-8 encoded file, which is one encoding of
Unicode. This is the second confusion.

> A number of you pointed out what I was doing wrong but I couldn't
> understand it until I realized that the write operation didn't work until
> it was using a properly encoded Unicode string.

The write function wants bytes! Encoding a string in your favourite encoding
yields bytes.

> This still seems odd to me.  I would have thought that the unicode
> function would return a properly encoded byte stream that could then
> simply be written to disk.

No, unicode() takes a byte stream and decodes it according to the given
encoding. You then get an internal representation of the string, a unicode
object. This representation typically resembles UCS2 or UCS4, which are
more suitable for internal manipulation than UTF-8. This object is a string
btw, so typical stuff like concatenation etc are supported. However, the
internal representation is a sequence of Unicode codepoints but not a
guaranteed sequence of bytes which is what you want in a file.

> Instead it seems like you have to re-encode the byte stream to some
> kind of escaped Ascii before it can be written back out.

As mentioned above, you have a string. For writing, that string needs to be
transformed to bytes again.

Note: You can also configure a file to read one encoding or write another.
You then get unicode objects from the input which you can feed to the
output. The important difference is that you only specify the encoding in
one place and it will probably even be more performant. I'd have to search
to find you the according library calls though, but starting point is
http://docs.python.org.

Good luck!

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932