how to detect the character encoding in a web page ?

Chris Angelico rosuav at gmail.com
Wed Jun 5 13:55:11 EDT 2013


On Thu, Jun 6, 2013 at 1:14 AM, iMath <redstone-cold at 163.com> wrote:
> 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
>> how to detect the character encoding  in a web page ?
>>
>> such as this page
>>
>>
>>
>> http://python.org/
>
> by the way  ,we cannot get character encoding programmatically from the mate data without knowing the  character encoding  ahead !

The rules for web pages are (massively oversimplified):

1) HTTP header
2) ASCII-compatible encoding and meta tag

The HTTP header is completely out of band. This is the best way to
transmit encoding information. Otherwise, you assume 7-bit ASCII and
start parsing. Once you find a meta tag, you stop parsing and go back
to the top, decoding in the new way. "ASCII-compatible" covers a huge
number of encodings, so it's not actually much of a problem to do
this.

ChrisA



More information about the Python-list mailing list