how to detect the character encoding in a web page ?

Kurt Mueller kurt.alfred.mueller at gmail.com
Mon Dec 24 03:34:16 EST 2012


Am 24.12.2012 um 04:03 schrieb iMath:
> but how to let python do it for you ? 
> such as these 2 pages 
> http://python.org/ 
> http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx
> how to  detect the character encoding in these 2 pages  by python ?


If you have the html code, let 
chardetect.py 
do an educated guess for you.

http://pypi.python.org/pypi/chardet

Example:
$ wget -q -O - http://python.org/ | chardetect.py 
stdin: ISO-8859-2 with confidence 0.803579722043
$ 

$ wget -q -O - 'http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx' | chardetect.py 
stdin: utf-8 with confidence 0.87625
$ 


Grüessli
-- 
kurt.alfred.mueller at gmail.com




More information about the Python-list mailing list