save gb-2312 web page in a .html file

"Martin v. Löwis" martin at v.loewis.de
Wed Dec 26 18:33:15 EST 2007


> .read() returns the bytes exactly how it downloads them. It doesn't
> interpret them. If those bytes are GB-2312-encoded text, that's what
> they are. There's no need to reencode them. Just .write(page) (of
> course, this way you don't verify that it's correct).

Alternatively, if the page is *not* gb-2312, you must first *decode*
it from its original encoding. Suppose the original encoding is
windows-1252, you do

  page = page.decode("windows-1252")
  page = page.encode("gb-2312")

Of course, for HTML, that may be tricky, as the file may include
an encoding declaration (XML declaration or http-equiv header). So if
you recode it, you might have to change such declarations as well.

Regards,
Martin



More information about the Python-list mailing list