how to detect the character encoding in a web page ?

python培训 51mmj.com at gmail.com
Fri Dec 28 09:30:46 EST 2012


在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

first setup  chardet 


import chardet
#抓取网页html
html_1 = urllib2.urlopen(line,timeout=120).read()
#print html_1
mychar=chardet.detect(html_1)
#print mychar
bianma=mychar['encoding']
if bianma == 'utf-8' or bianma == 'UTF-8':
    #html=html.decode('utf-8','ignore').encode('utf-8')
   html=html_1
else :
    html =html_1.decode('gb2312','ignore').encode('utf-8')



More information about the Python-list mailing list