save gb-2312 web page in a .html file
"Martin v. Löwis"
martin at v.loewis.de
Wed Dec 26 18:33:15 EST 2007
> .read() returns the bytes exactly how it downloads them. It doesn't
> interpret them. If those bytes are GB-2312-encoded text, that's what
> they are. There's no need to reencode them. Just .write(page) (of
> course, this way you don't verify that it's correct).
Alternatively, if the page is *not* gb-2312, you must first *decode*
it from its original encoding. Suppose the original encoding is
windows-1252, you do
page = page.decode("windows-1252")
page = page.encode("gb-2312")
Of course, for HTML, that may be tricky, as the file may include
an encoding declaration (XML declaration or http-equiv header). So if
you recode it, you might have to change such declarations as well.
Regards,
Martin
More information about the Python-list
mailing list