urllib2.urlopen(url) pulling something other than HTML
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Mon Aug 20 16:18:00 EDT 2007
On 20 ago, 15:44, "dogatemycompu... at gmail.com"
<dogatemycompu... at gmail.com> wrote:
> ----------------------------------------------------------
> f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
> parser = htmllib.HTMLParser(f)
> parser.feed(html)
> parser.close()
> return parser.anchorlist
> ----------------------------------------------------------
The htmllib.HTMLParser class is hard to use. I would replace those
lines with:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.anchorlist = []
def handle_starttag(self, tag, attrs):
if tag=="a":
href = dict(attrs).get("href")
if href:
self.anchorlist.append(href)
parser = MyHTMLParser()
parser.feed(htmltext)
print parser.anchorlist
The anchorlist attribute, defined by myself here, is a list containing
all href attributes found in the page.
See <http://docs.python.org/lib/module-HTMLParser.html>
> I get the idea that we're allocating some memory that looks like a
> file so formatter.dumbwriter can manipulate it. The results are
> passed to formatter.abstractformatter which does something else to the
> HTML code. The results are then passed to "f" which is then passed to
> htmllib.HTMLParser so it can parse the html for links. I guess I
> don't understand with any great detail as to why this is happening.
> I know someone is going to say that I should RTFM so here is the gist
> of the documentation:
Don't even try to understand it - it's a mess. Use the HTMLParser
module instead.
> The last question is.. I can't find any documentation to explain
> where the "anchorlist" attribute came from? Here is the only
> reference to this attribute that I can find anywhere in the Python
> documentation.
And that's all you will find.
> So .. How does an average developer figure out that parser returns a
> list of hyperlinks in an attribute called anchorlist? Is this
Usually, those attributes are hyperlinked and you can find them in the
documentation index. Not for this one :(
> something that you just "figure out" or is there some book I should be
> reading that documents all of the attributes for a particular
> method? It just seems a bit obscure and certainly not something I
> would have figured out on my own. Does this make me a poor developer
> who should find another hobby? I just need to know if there is
> something wrong with me or if this is a reasonable question to ask.
It's a very reasonable question. The attribute should be documented
properly. But the class itself is a bit old; I don't never use it
anymore.
> The last question I have is about debugging. The spider is capable
> of parsing links until it reaches:
>
> "html = get_page(http://www.google.com/jobs/fortune)" which returns
> the contents of a pdf document, assigns the pdf contents to html which
> is later passed to parser.feed(html) which crashes.
You can verify the Content-Type header before processing. Quoting the
get_page method:
> def get_page(url, log):
> """Retrieve URL and return comments, log errors."""
> try:
> page = urllib2.urlopen(url)
> except urllib2.URLError:
> log("Error retrieving: " + url)
> return ''
> body = page.read()
> page.close()
> return body
>From <http://docs.python.org/lib/module-urllib2.html>, the urlopen
method returns a file-like object, which has an additional info()
method holding the response headers. You can get the Content-Type
using page.info().gettype(), which should be text/html or text/xhtml.
For any other type, just return '' as you do for any error.
--
Gabriel Genellina
More information about the Python-list
mailing list