Need help with file encoding-decoding

Yaşar Arabacı yasar11732 at gmail.com
Fri Sep 23 07:44:27 EDT 2011


Hi,

I'am trying to write a mass html downloader, and it processes files after it
downloaded them. I have problems with encodings, and decodings. Sometimes I
get UnicodeDecodeErrors, or
I get half-pages in after processing part. Or more generally, some things
don't feel right. Can you check my approach, and provide me some feedback
please? Here is what I am doing.

1) send a HEAD request to file's source to get file encoding, set encoding
variable accordingly.
2) if server doesn't provide an encoding, set encoding variable as utf-8
3) read html page from internet, read it to a variable let's say content.
4) in this step, I need to parse the content I get, because I will search
for further links \
        I feed content to parser (subclass of HTMLParser.HTMLParser) like
this -> content.decode(encoding)
5) open a file in binary mod open(file_path,"wb")
6) I write as I read without modifing.

##########
# After processing part....
##########

(Note: encoding variable is same as the downloading part)

1) open local file in binary mod for reading file_name =
open(file_path,"rb")
2) decode the file contents into a variable => decoded_content =
file_name.read().decode(encoding)
3) send decoded content to a parser, parser contstruct new html content. (as
str)
4) open same file for writing, in binary mod, write parsers output like
this: file_name.write(parser.output.encode(encoding))
-- 
http://yasar.serveblog.net/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20110923/8ffe4dca/attachment.html>


More information about the Python-list mailing list