ascii to latin1

Wed May 10 05:42:20 EDT 2006

Luis P. Mendes wrote:
> Errors occur when I assign the result of ''.join(cp for cp in de_str if
> not unicodedata.category(cp).startswith('M')) to a variable.  The same
> happens with de_str.  When I print the strings everything is ok.
>
> Here's a short example of data:
> 115448,DAÇÃO
> 117788,DA 1º DE MO Nº 2
>
> I used the following script to convert the data:
> # -*- coding: iso8859-15 -*-
>
> class Latin1ToAscii:
>
> 	def abreFicheiro(self):
> 		import csv
> 		self.reader = csv.reader(open(self.input_file, "rb"))
>
> 	def converter(self):
> 		import unicodedata
> 		self.lista_csv = []
> 		for row in self.reader:
> 			s = unicode(row[1],"latin-1")
> 			de_str = unicodedata.normalize("NFD", s)
> 			nome = ''.join(cp for cp in de_str if not \
>   			unicodedata.category(cp).startswith('M'))
>
> 			linha_ascii = row[0] + "," + nome  # *
> 			print linha_ascii.encode("ascii")
> 			self.lista_csv.append(linha_ascii)
>
>
> 	def __init__(self):
> 		self.input_file = 'nome_latin1.csv'
> 		self.output_file = 'nome_ascii.csv'
>
> if __name__ == "__main__":
> 	f = Latin1ToAscii()
> 	f.abreFicheiro()
> 	f.converter()
>
>
> And I got the following result:
> $ python latin1_to_ascii.py
> 115448,DACAO
> Traceback (most recent call last):
>   File "latin1_to_ascii.py", line 44, in ?
>     f.converter()
>   File "latin1_to_ascii.py", line 22, in converter
>     print linha_ascii.encode("ascii")
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in
> position 11: ordinal not in range(128)
>
>
> The script converted the ÇÃ from the first line, but not the º from the
> second one.  Still in *, I also don't get a list as [115448,DAÇÃO] but a
> [u'115448,DAÇÃO'] element, which doesn't suit my needs.
>
> Would you mind telling me what should I change?

Calling this process "latin1 to ascii" was a misnomer, sorry that I
used this phrase. It should be called "latin1 to search key", there is
no requirement that the key must be ascii, so change the corresponding
lines in your code:

linha_key = row[0] + "," + nome
print linha_key
self.lista_csv.append(linha_key.encode("latin-1")

With regards to º, Richie already gave you food for thoughts, if you
want "1 DE MO" to match "1º DE MO" remove that symbol from the key
(linha_key = linha_key.translate({u"º": None}), if you don't want such
a fuzzy matching, keep it.