character encoding conversion
Christian Ergh
christian.ergh at gmail.com
Mon Dec 13 03:41:56 EST 2004
Martin v. Löwis wrote:
> Dylan wrote:
>
>> Things I have tried include encode()/decode()
>
>
> This should work. If you somehow manage to guess the encoding,
> e.g. guess it as cp1252, then
>
> htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
>
> will give you a file that contains only ASCII characters, and
> character references for everything else.
>
> Now, how should you guess the encoding? Here is a strategy:
> 1. use the encoding that was sent through the HTTP header. Be
> absolutely certain to not ignore this encoding.
> 2. use the encoding in the XML declaration (if any).
> 3. use the encoding in the http-equiv meta element (if any)
> 4. use UTF-8
> 5. use Latin-1, and check that there are no characters in the
> range(128,160)
> 6. use cp1252
> 7. use Latin-1
>
> In the order from 1 to 6, check whether you manage to decode
> the input. Notice that in step 5, you will definitely get successful
> decoding; consider this a failure if you have get any control
> characters (from range(128, 160)); then try in step 7 latin-1
> again.
>
> When you find the first encoding that decodes correctly, encode
> it with ascii and xmlcharrefreplace, and you won't need to worry
> about the encoding, anymore.
>
> Regards,
> Martin
Something like this?
Chris
import urllib2
url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = ???
xmlencoding = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
f.close()
try:
data = data.decode(pageencoding)
except:
try:
data = data.decode(xmlencoding)
except:
try:
data = data.decode(htmlmetaencoding)
except:
try:
data = data.encode('UTF-8')
except:
flag = true
for char in data:
if 127 < ord(char) < 128:
flag = false
if flag:
try:
data = data.encode('latin-1')
except:
pass
try:
data = data.encode('cp1252')
except:
pass
try:
data = data.encode('latin-1')
except:
pass:
data = data.encode("ascii", "xmlcharrefreplace")
More information about the Python-list
mailing list