Need help with file encoding-decoding

Philip Semanchuk philip at semanchuk.com
Fri Sep 23 09:14:15 EDT 2011


On Sep 23, 2011, at 7:44 AM, Yaşar Arabacı wrote:

> Hi,
> 
> I'am trying to write a mass html downloader, and it processes files after it
> downloaded them. I have problems with encodings, and decodings. Sometimes I
> get UnicodeDecodeErrors, or
> I get half-pages in after processing part. Or more generally, some things
> don't feel right. Can you check my approach, and provide me some feedback
> please? Here is what I am doing.
> 
> 1) send a HEAD request to file's source to get file encoding, set encoding
> variable accordingly.

Hi Yaşar
This is a pretty optimistic algorithm, at least by the statistics from 2008 (see below). 


> 2) if server doesn't provide an encoding, set encoding variable as utf-8

This is statistically a good guess but it doesn't follow the HTTP specification.


> 4) in this step, I need to parse the content I get, because I will search
> for further links \
>        I feed content to parser (subclass of HTMLParser.HTMLParser) like

Does HTMLParser.HTMLParser handle broken HTML? Because there's lots of it out there.

I used to run an automated site validator, and I wrote a couple of articles you might find interesting. One is about how to get the encoding of a Web page:
http://NikitaTheSpider.com/articles/EncodingDivination.html

I also wrote an article examining the statistics I'd seen run through the crawler/validator. One thing I saw was that almost 2/3 of Web pages specified the encoding in the META HTTP-EQUIV Content-Type tag rather than in the HTTP Content-Type header. Mind you, this was three years ago so the character of the Web has likely changed since then, but probably not too dramatically.
http://NikitaTheSpider.com/articles/ByTheNumbers/fall2008.html

You can also do some straightforward debugging. Save the raw bytes you get from each site, and when you encounter a decode error, check the raw bytes. Are they really in the encoding specified? Webmasters make all kinds of mistakes. 


Hope this helps
Philip



> this -> content.decode(encoding)
> 5) open a file in binary mod open(file_path,"wb")
> 6) I write as I read without modifing.
> 
> ##########
> # After processing part....
> ##########
> 
> (Note: encoding variable is same as the downloading part)
> 
> 1) open local file in binary mod for reading file_name =
> open(file_path,"rb")
> 2) decode the file contents into a variable => decoded_content =
> file_name.read().decode(encoding)
> 3) send decoded content to a parser, parser contstruct new html content. (as
> str)
> 4) open same file for writing, in binary mod, write parsers output like
> this: file_name.write(parser.output.encode(encoding))
> -- 
> http://yasar.serveblog.net/
> -- 
> http://mail.python.org/mailman/listinfo/python-list




More information about the Python-list mailing list