Ascii to Unicode.

Wed Jul 28 15:20:33 EDT 2010

Joe Goldthwaite wrote:
> Hi,
> 
> I've got an Ascii file with some latin characters. Specifically \xe1 and
> \xfc.  I'm trying to import it into a Postgresql database that's running in
> Unicode mode. The Unicode converter chokes on those two characters.
> 

> I could just manually replace those to characters with something valid but
> if any other invalid characters show up in later versions of the file, I'd
> like to handle them correctly.
> 
> 
> I've been playing with the Unicode stuff and I found out that I could
> convert both those characters correctly using the latin1 encoder like this;
> 
> 
> 	import unicodedata
> 
> 	s = '\xe1\xfc'
> 	print unicode(s,'latin1')
> 
> 
> The above works.  When I try to convert my file however, I still get an
> error;
> 
> 	import unicodedata
> 
> 	input = file('ascii.csv', 'r')
> 	output = file('unicode.csv','w')
> 
> 	for line in input.xreadlines():
> 		output.write(unicode(line,'latin1'))
> 
> 	input.close()
> 	output.close()
> 
> Traceback (most recent call last):
>   File "C:\Users\jgold\CloudmartFiles\UnicodeTest.py", line 10, in __main__
>     output.write(unicode(line,'latin1'))
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position
> 295: ordinal not in range(128)
> 
> I'm stuck using Python 2.4.4 which may be handling the strings differently
> depending on if they're in the program or coming from the file.  I just
> haven't been able to figure out how to get the Unicode conversion working
> from the file data.
> 
> Can anyone explain what is going on?
> 
What you need to remember is that files contain bytes.

When you say "ASCII file" what you mean is that the file contains bytes
which represent text encoded as ASCII, and such a file by definition
can't contain bytes outside the range 0-127. Therefore your file isn't
an ASCII file. So then you've decided to treat it as a file containing
bytes which represent text encoded as Latin-1.

You're reading bytes from a file, decoding them to Unicode, and then
trying to write them to a file, but the output file expects bytes (did I
say that files contain bytes? :-)), so it's trying to encode back to
bytes using the default encoding, which is ASCII. u'\xe1' can't be 
encoded as ASCII, therefore UnicodeEncodeError is raised.