Ascii to Unicode.

Wed Jul 28 15:29:36 EDT 2010

On 7/28/2010 11:32 AM, Joe Goldthwaite wrote:
> Hi,
>
> I've got an Ascii file with some latin characters. Specifically \xe1 and
> \xfc.  I'm trying to import it into a Postgresql database that's running in
> Unicode mode. The Unicode converter chokes on those two characters.
>
> I could just manually replace those to characters with something valid but
> if any other invalid characters show up in later versions of the file, I'd
> like to handle them correctly.
>
>
> I've been playing with the Unicode stuff and I found out that I could
> convert both those characters correctly using the latin1 encoder like this;
>
>
> 	import unicodedata
>
> 	s = '\xe1\xfc'
> 	print unicode(s,'latin1')
>
>
> The above works.  When I try to convert my file however, I still get an
> error;
>
> 	import unicodedata
>
> 	input = file('ascii.csv', 'r')
> 	output = file('unicode.csv','w')
>
> 	for line in input.xreadlines():
> 		output.write(unicode(line,'latin1'))
>
> 	input.close()
> 	output.close()
>
Try this, which will get you a UTF-8 file, the usual standard for
Unicode in a file.

     for rawline in input :
         unicodeline = unicode(line,'latin1')	# Latin-1 to Unicode
	output.write(unicodeline.encode('utf-8')) # Unicode to as UTF-8


				John Nagle