character encoding conversion

Sun Dec 12 14:29:59 EST 2004

Martin v. Löwis wrote:
> Dylan wrote:
> 
>> Things I have tried include encode()/decode()
> 
> 
> This should work. If you somehow manage to guess the encoding,
> e.g. guess it as cp1252, then
> 
>   htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
> 
> will give you a file that contains only ASCII characters, and
> character references for everything else.
> 
> Now, how should you guess the encoding? Here is a strategy:
> 1. use the encoding that was sent through the HTTP header. Be
>    absolutely certain to not ignore this encoding.
> 2. use the encoding in the XML declaration (if any).
> 3. use the encoding in the http-equiv meta element (if any)
> 4. use UTF-8
> 5. use Latin-1, and check that there are no characters in the
>    range(128,160)
> 6. use cp1252
> 7. use Latin-1
> 
> In the order from 1 to 6, check whether you manage to decode
> the input. Notice that in step 5, you will definitely get successful
> decoding; consider this a failure if you have get any control
> characters (from range(128, 160)); then try in step 7 latin-1
> again.
> 
> When you find the first encoding that decodes correctly, encode
> it with ascii and xmlcharrefreplace, and you won't need to worry
> about the encoding, anymore.
> 
> Regards,
> Martin
I have a similar problem, with characters like äöüAÖÜß and so on. I am 
extracting some content out of webpages, and they deliver whatever, 
sometimes not even giving any encoding information in the header. But 
your solution sounds quite good, i just do not know if
- it works with the characters i mentioned
- what encoding do you have in the end
- and how exactly are you doing all this? All with somestring.decode() 
or... Can you please give an example for these 7 steps?
Thanx in advance for the help
Chris