ascii to latin1
Luis P. Mendes
luis_lupe2XXX at netvisaoXXX.pt
Tue May 9 14:27:10 EDT 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
>> When I used the "NFD" option, I came across many errors on these and
>> possibly other codes: \xba, \xc9, \xcd.
>
> What errors? normalize method is not supposed to give any errors. You
> mean it doesn't work as expected? Well, I have to admit that using
> normalize is a far from perfect way to implement search. The most
> advanced algorithm is published by Unicode guys:
> <http://www.unicode.org/reports/tr10/> If you read it you'll understand
> it's not so easy.
>
>> I tried to use "NFKD" instead, and the number of errors was only about
>> half a dozen, for a universe of 600000+ names, on code \xbf.
>> It looks like I have to do a search and substitute using regular
>> expressions for these cases. Or is there a better way to do it?
>
> Perhaps you can use unicode translate method to map the characters that
> still give you problems to whatever you want.
>
Errors occur when I assign the result of ''.join(cp for cp in de_str if
not unicodedata.category(cp).startswith('M')) to a variable. The same
happens with de_str. When I print the strings everything is ok.
Here's a short example of data:
115448,DAÇÃO
117788,DA 1º DE MO Nº 2
I used the following script to convert the data:
# -*- coding: iso8859-15 -*-
class Latin1ToAscii:
def abreFicheiro(self):
import csv
self.reader = csv.reader(open(self.input_file, "rb"))
def converter(self):
import unicodedata
self.lista_csv = []
for row in self.reader:
s = unicode(row[1],"latin-1")
de_str = unicodedata.normalize("NFD", s)
nome = ''.join(cp for cp in de_str if not \
unicodedata.category(cp).startswith('M'))
linha_ascii = row[0] + "," + nome # *
print linha_ascii.encode("ascii")
self.lista_csv.append(linha_ascii)
def __init__(self):
self.input_file = 'nome_latin1.csv'
self.output_file = 'nome_ascii.csv'
if __name__ == "__main__":
f = Latin1ToAscii()
f.abreFicheiro()
f.converter()
And I got the following result:
$ python latin1_to_ascii.py
115448,DACAO
Traceback (most recent call last):
File "latin1_to_ascii.py", line 44, in ?
f.converter()
File "latin1_to_ascii.py", line 22, in converter
print linha_ascii.encode("ascii")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xba' in
position 11: ordinal not in range(128)
The script converted the ÇÃ from the first line, but not the º from the
second one. Still in *, I also don't get a list as [115448,DAÇÃO] but a
[u'115448,DAÇÃO'] element, which doesn't suit my needs.
Would you mind telling me what should I change?
Luis P. Mendes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFEYN7+Hn4UHCY8rB8RAjcTAKCgEkZwCURgp/VrtthM1MBba+d7KACfY9dj
xcHVL1BuhyrPV8+9Z5Q2AJQ=
=+AO0
-----END PGP SIGNATURE-----
More information about the Python-list
mailing list