ascii to latin1
Serge Orlov
Serge.Orlov at gmail.com
Wed May 10 05:42:20 EDT 2006
Luis P. Mendes wrote:
> Errors occur when I assign the result of ''.join(cp for cp in de_str if
> not unicodedata.category(cp).startswith('M')) to a variable. The same
> happens with de_str. When I print the strings everything is ok.
>
> Here's a short example of data:
> 115448,DAÇÃO
> 117788,DA 1º DE MO Nº 2
>
> I used the following script to convert the data:
> # -*- coding: iso8859-15 -*-
>
> class Latin1ToAscii:
>
> def abreFicheiro(self):
> import csv
> self.reader = csv.reader(open(self.input_file, "rb"))
>
> def converter(self):
> import unicodedata
> self.lista_csv = []
> for row in self.reader:
> s = unicode(row[1],"latin-1")
> de_str = unicodedata.normalize("NFD", s)
> nome = ''.join(cp for cp in de_str if not \
> unicodedata.category(cp).startswith('M'))
>
> linha_ascii = row[0] + "," + nome # *
> print linha_ascii.encode("ascii")
> self.lista_csv.append(linha_ascii)
>
>
> def __init__(self):
> self.input_file = 'nome_latin1.csv'
> self.output_file = 'nome_ascii.csv'
>
> if __name__ == "__main__":
> f = Latin1ToAscii()
> f.abreFicheiro()
> f.converter()
>
>
> And I got the following result:
> $ python latin1_to_ascii.py
> 115448,DACAO
> Traceback (most recent call last):
> File "latin1_to_ascii.py", line 44, in ?
> f.converter()
> File "latin1_to_ascii.py", line 22, in converter
> print linha_ascii.encode("ascii")
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in
> position 11: ordinal not in range(128)
>
>
> The script converted the ÇÃ from the first line, but not the º from the
> second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
> [u'115448,DAÇÃO'] element, which doesn't suit my needs.
>
> Would you mind telling me what should I change?
Calling this process "latin1 to ascii" was a misnomer, sorry that I
used this phrase. It should be called "latin1 to search key", there is
no requirement that the key must be ascii, so change the corresponding
lines in your code:
linha_key = row[0] + "," + nome
print linha_key
self.lista_csv.append(linha_key.encode("latin-1")
With regards to º, Richie already gave you food for thoughts, if you
want "1 DE MO" to match "1º DE MO" remove that symbol from the key
(linha_key = linha_key.translate({u"º": None}), if you don't want such
a fuzzy matching, keep it.
More information about the Python-list
mailing list