encoding problems (é and è)

John Machin sjmachin at lexicon.net
Thu Mar 23 16:14:00 EST 2006


On 23/03/2006 10:07 PM, bussiere bussiere wrote:
> hi i'am making a program for formatting string,
> or
> i've added :
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
> 
> in the begining of my script but
> 
>  str = str.replace('Ç', 'C')
>         str = str.replace('é', 'E')
>         str = str.replace('É', 'E')
>         str = str.replace('è', 'E')
>         str = str.replace('È', 'E')
>         str = str.replace('ê', 'E')
> 
> 
> doesn't work it put me " and , instead of remplacing é by E
> 
> 
> if someone have an idea it could be great

Hi, I've added some comments below ... I hope they help.
Cheers,
John

> 
> regards
> Bussiere
> ps : i've added the whole script under :
> __________________________________________________________________________
[snip]
> 
>     if ligneA != "":
>         str = ligneA
>         str = str.replace('a', 'A')
[snip]
>         str = str.replace('z', 'Z')
>
>         str = str.replace('ç', 'C')
>         str = str.replace('Ç', 'C')
>         str = str.replace('é', 'E')
>         str = str.replace('É', 'E')
>         str = str.replace('è', 'E')
[snip]
>         str = str.replace('Ú','U')

You can replace ALL of this upshifting and accent removal in one blow by 
using the string translate() method with a suitable table.

>         str = str.replace('  ', ' ')
>         str = str.replace('   ', ' ')
>         str = str.replace('    ', ' ')

The standard Python idiom for normalising whitespace is
strg = ' '.join(strg.split())

 >>> strg = '  ALLO    BUSSIERE\tCA  VA?     '
 >>> strg.split()
['ALLO', 'BUSSIERE', 'CA', 'VA?']
 >>> ' '.join(strg.split())
'ALLO BUSSIERE CA VA?'
 >>>

[snip]
>         if normalisation2 == "O":
>             str = str.replace('MONSIEUR', 'M')
>             str = str.replace('MR', 'M')

You need to be very careful with this approach. You are changing EVERY 
occurrence of "MR" in the string, not just where it is a whole "word" 
meaning "Monsieur".
Copnstructed example of what can go wrong:
 >>> strg = 'MR IMRE NAGY, 123 PRIMROSE STREET, SHAMROCK VALLEY'
 >>> strg.replace('MR', 'M')
'M IME NAGY, 123 PRIMOSE STREET, SHAMOCK VALLEY'
 >>>

A real, non-constructed history lesson: A certain database indicated 
duplicate records by having the annotation "DUP" in the surname field 
e.g. "SMITH DUP". Fortunately it was detected in testing that the 
so-called clean-up was causing DUPLESSIS to become PLESSIS and DUPRAT to 
become RAT!

Two points here: (1) Split up your strings into "words" or "tokens". 
Using strg.split() is a start but you may need something more 
sophisticated e.g. "-" as an additional token separator. (2) Instead of 
writing out all those lines of code, consider putting those 
substitutions in a dictionary:

title_substitution = {
     'MONSIEUR': 'M',
     'MR': 'M',
     'MADAME': 'MME',
     # etc
     }
Next level of improvement is to read that stuff from a file.
[snip]
> 
>         if normalisation4 == "O":
>             str = str.replace(';\"', ' ')
>             str = str.replace('\"', ' ')
>             str = str.replace('\'', ' ')
>             str = str.replace('-', ' ')
>             str = str.replace(',', ' ')
>             str = str.replace('\\', ' ')
>             str = str.replace('\/', ' ')
>             str = str.replace('&', ' ')
[snip]
Again, consider the string translate() method.
Also, consider that some of those characters may have some meaning that 
you perhaps shouldn't blow away e.g. compare 'SMITH & WESSON' with 
'SMITH ET WESSON' :-)



More information about the Python-list mailing list