ascii to latin1

Tue May 9 14:27:10 EDT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>> When I used the "NFD" option, I came across many errors on these and
>> possibly other codes: \xba, \xc9, \xcd.
> 
> What errors? normalize method is not supposed to give any errors. You
> mean it doesn't work as expected? Well, I have to admit that using
> normalize is a far from perfect way to  implement search. The most
> advanced algorithm is published by Unicode guys:
> <http://www.unicode.org/reports/tr10/> If you read it you'll understand
> it's not so easy.
> 
>> I tried to use "NFKD" instead, and the number of errors was only about
>> half a dozen, for a universe of 600000+ names, on code \xbf.
>> It looks like I have to do a search and substitute using regular
>> expressions for these cases.  Or is there a better way to do it?
> 
> Perhaps you can use unicode translate method to map the characters that
> still give you problems to whatever you want.
> 

Errors occur when I assign the result of ''.join(cp for cp in de_str if
not unicodedata.category(cp).startswith('M')) to a variable.  The same
happens with de_str.  When I print the strings everything is ok.

Here's a short example of data:
115448,DAÇÃO
117788,DA 1º DE MO Nº 2

I used the following script to convert the data:
# -*- coding: iso8859-15 -*-

class Latin1ToAscii:

	def abreFicheiro(self):
		import csv
		self.reader = csv.reader(open(self.input_file, "rb"))

	def converter(self):
		import unicodedata
		self.lista_csv = []
		for row in self.reader:
			s = unicode(row[1],"latin-1")
			de_str = unicodedata.normalize("NFD", s)
			nome = ''.join(cp for cp in de_str if not \
  			unicodedata.category(cp).startswith('M'))

			linha_ascii = row[0] + "," + nome  # *
			print linha_ascii.encode("ascii")
			self.lista_csv.append(linha_ascii)

	def __init__(self):
		self.input_file = 'nome_latin1.csv'
		self.output_file = 'nome_ascii.csv'

if __name__ == "__main__":
	f = Latin1ToAscii()
	f.abreFicheiro()
	f.converter()

And I got the following result:
$ python latin1_to_ascii.py
115448,DACAO
Traceback (most recent call last):
  File "latin1_to_ascii.py", line 44, in ?
    f.converter()
  File "latin1_to_ascii.py", line 22, in converter
    print linha_ascii.encode("ascii")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in
position 11: ordinal not in range(128)

The script converted the ÇÃ from the first line, but not the º from the
second one.  Still in *, I also don't get a list as [115448,DAÇÃO] but a
[u'115448,DAÇÃO'] element, which doesn't suit my needs.

Would you mind telling me what should I change?

Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEYN7+Hn4UHCY8rB8RAjcTAKCgEkZwCURgp/VrtthM1MBba+d7KACfY9dj
xcHVL1BuhyrPV8+9Z5Q2AJQ=
=+AO0
-----END PGP SIGNATURE-----