character encoding conversion

Mon Dec 13 03:41:56 EST 2004

Martin v. Löwis wrote:
> Dylan wrote:
> 
>> Things I have tried include encode()/decode()
> 
> 
> This should work. If you somehow manage to guess the encoding,
> e.g. guess it as cp1252, then
> 
>   htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")
> 
> will give you a file that contains only ASCII characters, and
> character references for everything else.
> 
> Now, how should you guess the encoding? Here is a strategy:
> 1. use the encoding that was sent through the HTTP header. Be
>    absolutely certain to not ignore this encoding.
> 2. use the encoding in the XML declaration (if any).
> 3. use the encoding in the http-equiv meta element (if any)
> 4. use UTF-8
> 5. use Latin-1, and check that there are no characters in the
>    range(128,160)
> 6. use cp1252
> 7. use Latin-1
> 
> In the order from 1 to 6, check whether you manage to decode
> the input. Notice that in step 5, you will definitely get successful
> decoding; consider this a failure if you have get any control
> characters (from range(128, 160)); then try in step 7 latin-1
> again.
> 
> When you find the first encoding that decodes correctly, encode
> it with ascii and xmlcharrefreplace, and you won't need to worry
> about the encoding, anymore.
> 
> Regards,
> Martin

Something like this?
Chris

import urllib2

url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = ???
xmlencoding  = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
f.close()
         try:
             data = data.decode(pageencoding)
         except:
             try:
                 data = data.decode(xmlencoding)
             except:
                 try:
                     data = data.decode(htmlmetaencoding)
                 except:
                     try:
		        data = data.encode('UTF-8')
		    except:
			flag = true
			for char in data:
			    if 127 < ord(char) < 128:
				flag = false
			    if flag:
				try:
				    data = data.encode('latin-1')
				except:
				    pass
		    try:
			data = data.encode('cp1252')
		    except:
			pass
	try:
	    data = data.encode('latin-1')
	except:
	    pass:
data = data.encode("ascii", "xmlcharrefreplace")