Ascii to Unicode.

Joe Goldthwaite joe at goldthwaites.com
Thu Jul 29 13:59:48 EDT 2010


Hi Ulrich,

Ascii.csv isn't really a latin-1 encoded file.  It's an ascii file with a
few characters above the 128 range that are causing Postgresql Unicode
errors.  Those characters work fine in the Windows world but they're not the
correct byte representation for Unicode. What I'm attempting to do is
translate those upper range characters into the correct Unicode
representations so that they look the same in the Postgresql database as
they did in the CSV file.

I wrote up the source of my confusion to Steven so I won't duplicate it
here.  You're comment on defining the encoding of the file directly instead
of using functions to encode and decode the data lead me to the codecs
module.  Using it, I can define the encoding a file open time and then just
read and write the lines.  I ended up with this;

	import codecs

	input = codecs.open('ascii.csv', encoding='cp1252')
	output = codecs.open('unicode.csv', mode='wb', encoding='utf-8')

	output.writelines(input.readlines())

	input.close()
	output.close()

This is doing exactly the same thing but it's much clearer to me.  Readlines
translates the input using the cp1252 codec and writelines encodes it to
utf-8 and writes it out.  And as you mentioned, it's probably higher
performance.  I haven't tested that but since both programs do the job in
seconds, performance isn't and issue.

Thanks again to everyone who posted.  I really do appreciate it.





More information about the Python-list mailing list