Ascii to Unicode.

Joe Goldthwaite joe at goldthwaites.com
Wed Jul 28 18:58:01 EDT 2010


Thanks to all of you who responded. I guess I was working from the wrong
premise.  I was thinking that a file could write any kind of data and that
once I had my Unicode string, I could just write it out with a standard
file.write() operation.

What is actually happening is the file.write() operation was generating the
error until I re-encoded the string as utf-8.  This is what worked;

  import unicodedata
   
  input = file('ascii.csv', 'rb')
  output = file('unicode.csv','wb')

  for line in input.xreadlines():
  	unicodestring = unicode(line, 'latin1')
  	output.write(unicodestring.encode('utf-8')) # This second encode is
what I was missing.

  input.close()
  output.close()

A number of you pointed out what I was doing wrong but I couldn't understand
it until I realized that the write operation didn't work until it was using
a properly encoded Unicode string. I thought I was getting the error on the
initial latin Unicode conversion not in the write operation.

This still seems odd to me.  I would have thought that the unicode function
would return a properly encoded byte stream that could then simply be
written to disk. Instead it seems like you have to re-encode the byte stream
to some kind of escaped Ascii before it can be written back out.

Thanks to all of you who took the time to respond.  I really do appreciate
it.  I think with my mental block, I couldn't have figure it out without
your help.





More information about the Python-list mailing list