encoding problems (é and è)
John Machin
sjmachin at lexicon.net
Thu Mar 23 16:14:00 EST 2006
On 23/03/2006 10:07 PM, bussiere bussiere wrote:
> hi i'am making a program for formatting string,
> or
> i've added :
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
>
> in the begining of my script but
>
> str = str.replace('Ç', 'C')
> str = str.replace('é', 'E')
> str = str.replace('É', 'E')
> str = str.replace('è', 'E')
> str = str.replace('È', 'E')
> str = str.replace('ê', 'E')
>
>
> doesn't work it put me " and , instead of remplacing é by E
>
>
> if someone have an idea it could be great
Hi, I've added some comments below ... I hope they help.
Cheers,
John
>
> regards
> Bussiere
> ps : i've added the whole script under :
> __________________________________________________________________________
[snip]
>
> if ligneA != "":
> str = ligneA
> str = str.replace('a', 'A')
[snip]
> str = str.replace('z', 'Z')
>
> str = str.replace('ç', 'C')
> str = str.replace('Ç', 'C')
> str = str.replace('é', 'E')
> str = str.replace('É', 'E')
> str = str.replace('è', 'E')
[snip]
> str = str.replace('Ú','U')
You can replace ALL of this upshifting and accent removal in one blow by
using the string translate() method with a suitable table.
> str = str.replace(' ', ' ')
> str = str.replace(' ', ' ')
> str = str.replace(' ', ' ')
The standard Python idiom for normalising whitespace is
strg = ' '.join(strg.split())
>>> strg = ' ALLO BUSSIERE\tCA VA? '
>>> strg.split()
['ALLO', 'BUSSIERE', 'CA', 'VA?']
>>> ' '.join(strg.split())
'ALLO BUSSIERE CA VA?'
>>>
[snip]
> if normalisation2 == "O":
> str = str.replace('MONSIEUR', 'M')
> str = str.replace('MR', 'M')
You need to be very careful with this approach. You are changing EVERY
occurrence of "MR" in the string, not just where it is a whole "word"
meaning "Monsieur".
Copnstructed example of what can go wrong:
>>> strg = 'MR IMRE NAGY, 123 PRIMROSE STREET, SHAMROCK VALLEY'
>>> strg.replace('MR', 'M')
'M IME NAGY, 123 PRIMOSE STREET, SHAMOCK VALLEY'
>>>
A real, non-constructed history lesson: A certain database indicated
duplicate records by having the annotation "DUP" in the surname field
e.g. "SMITH DUP". Fortunately it was detected in testing that the
so-called clean-up was causing DUPLESSIS to become PLESSIS and DUPRAT to
become RAT!
Two points here: (1) Split up your strings into "words" or "tokens".
Using strg.split() is a start but you may need something more
sophisticated e.g. "-" as an additional token separator. (2) Instead of
writing out all those lines of code, consider putting those
substitutions in a dictionary:
title_substitution = {
'MONSIEUR': 'M',
'MR': 'M',
'MADAME': 'MME',
# etc
}
Next level of improvement is to read that stuff from a file.
[snip]
>
> if normalisation4 == "O":
> str = str.replace(';\"', ' ')
> str = str.replace('\"', ' ')
> str = str.replace('\'', ' ')
> str = str.replace('-', ' ')
> str = str.replace(',', ' ')
> str = str.replace('\\', ' ')
> str = str.replace('\/', ' ')
> str = str.replace('&', ' ')
[snip]
Again, consider the string translate() method.
Also, consider that some of those characters may have some meaning that
you perhaps shouldn't blow away e.g. compare 'SMITH & WESSON' with
'SMITH ET WESSON' :-)
More information about the Python-list
mailing list