urllib2.urlopen(url) pulling something other than HTML

Mon Aug 20 15:38:27 EDT 2007

"dogatemycomputer at gmail.com" <dogatemycomputer at gmail.com> writes:
[...]
> ----------------------------------------------------------
> f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO()))
> parser = htmllib.HTMLParser(f)
>     parser.feed(html)
>     parser.close()
>     return parser.anchorlist
> ----------------------------------------------------------
>
> I get the idea that we're allocating some memory that looks like a
> file so formatter.dumbwriter can manipulate it.

Don't worry too much about memory.  The "StringIO()" probably only
really allocates the memory needed for the "bookkeeping" that StringIO
does for its own internal purposes, not the memory needed to actually
store the HTML.  Later, when you use the object, Python will
dynamically (== at run time) allocate the necessary memory for the
HTML, when the .write() method is called on the StringIO instance.
Python handles the memory allocation for you -- though of course the
code you write affects how much memory gets used.

Note:

 - The StringIO is where the *output* HTML goes.

 - The formatter.DumbWriter likely doesn't do anything with the
   StringIO() at the time it's passed (it hasn't even seen your HTML
   yet, so how could it?).  Instead, it just squirrels away the
   StringIO() for later use.

> The results are
> passed to formatter.abstractformatter which does something else to the
> HTML code.

Again, nothing much happens right away on the "f = ..." line.  The
formatter.AbstractFormatter just keeps the formatter so it can use it
to format HTML later on.

> The results are then passed to "f" which is then passed to

The results are not "passed" to f.  Instead, the results are given a
name, "f".  You can give a single object as many names as you like.

> htmllib.HTMLParser so it can parse the html for links.   I guess I

htmllib.HTMLParser wants the formatter so it can format output
(e.g. you might want to write out the same page with some of the links
removed).  It doesn't need the formatter to parse the HTML.
HTMLParser itself is responsible for the parsing -- as the name
implies.

> don't understand with any great detail as to why this is happening.
> I know someone is going to say that I should RTFM so here is the gist
> of the documentation:
>
> formatter.DumbWriter = "This class is suitable for reflowing a
> sequence of paragraphs."
> formatter.AbstractFormatter = "The standard formatter. This
> implementation has demonstrated wide applicability to many writers,
> and may be used directly in most circumstances. It has been used to
> implement a full-featured World Wide Web browser." <-- huh?

The web browser in question was called "Grail".  Grail has been
resting for some time now.  By today's standards, "full-featured" is a
bit of a stretch.

But I wouldn't worry too much about what they're trying to say there
yet (it has to do with the way the formatter.AbstractFormatter class
is structured, not what it actually does "out of the box").

> So.. What is dumbwriter and abstractformatter doing with this HTML and
> why does it need to be done before parser.feed() gets a hold of it?

The "heavy lifting" only really actually starts happening when you
call parser.feed().  Before that, you're just setting the stage.

> The last question is..   I can't find any documentation to explain
> where the "anchorlist" attribute came from?   Here is the only
> reference to this attribute that I can find anywhere in the Python
> documentation.
>
> ----------------------
> anchor_bgn(  	href, name, type)
>     This method is called at the start of an anchor region. The
> arguments correspond to the attributes of the <A> tag with the same
> names. The default implementation maintains a list of hyperlinks
> (defined by the HREF attribute for <A> tags) within the document. The
> list of hyperlinks is available as the data attribute anchorlist.
> ----------------------

That is indeed the (only) documentation for .anchorlist .  What more
were you expecting to see?

> So ..  How does an average developer figure out that parser returns a
> list of hyperlinks in an attribute called anchorlist?  Is this

They keep the Library Reference under their pillow :-)

And strictly it doesn't *return* a list of links.  And that's
certainly not HTMLParser's main function in life.  It merely makes
such a list available as a convenience.  In fact, many people instead
use module sgmllib, which provides no such convenience, but otherwise
does the same parsing work as module htmllib.

> something that you just "figure out" or is there some book I should be
> reading that documents all of the attributes for a particular
> method?   It just seems a bit obscure and certainly not something I
> would have figured out on my own.  Does this make me a poor developer
> who should find another hobby?   I just need to know if there is
> something wrong with me or if this is a reasonable question to ask.

But you *did* figure it out.  How else is it that you come to be
explaining it to us?

Keep in mind that *nobody* knows all of the standard library.  I've
been writing Python code full time for years, and I often bump into
whole standard library modules whose existence I'd forgotten about, or
was never really aware of in the first place.  The more you know about
what it can do, the more convenience you'll get out of it, is all.

> The last question I have is about debugging.   The spider is capable
> of parsing links until it reaches:
>
> "html = get_page(http://www.google.com/jobs/fortune)" which returns
> the contents of a pdf document, assigns the pdf contents to html which
> is later passed to parser.feed(html) which crashes.
[...]
> How would an experienced python developer check the contents of "html"
> to make sure its not something else other than a blob of HTML code?  I
> should note an obviously catch-22..   How do I check the HTML in such
> a way that the check itself doesn't possibly crash the app?  I thought
> about:
>
> try:
>     parser.feed(html)
> except parser.HTMLParseError:
>     parser.close()
>
>
> .... but i'm not sure if that is right or not?  The app still crashes
> so obviously i'm doing something wrong.

That kind of idea is often the best way.  In this case, though, you
probably want to do an up-front check by looking at the HTTP
Content-Type header (Google for it), something like this:

response = urllib2.urlopen(url)
html = response.read()
if response.info()["Content-Type"] == "text/html":
    parse(html)

John