HTMLparsing abnormal html pages

Mark Pilgrim f8dy at diveintopython.org
Fri Mar 16 19:30:53 EST 2001


in article 98ua8j$a6h$1 at panix2.panix.com, Aahz Maruch at aahz at panix.com
wrote on 3/16/01 7:14 PM:

> In article <98pvp1$15t$1 at news.netmar.com>,  <asle at spam.com> wrote:
>> 
>> Considering the small program below. Running it will show that the
>> HTMLparser
>> is truncating urls in the HTML page. Now, most of you will probably say that
>> the page and in particular the URL's of this page are not valid according to
>> the RFC1738 protocol --bad luck. But there must be a work-around for this?
> 
> For this specific case, Mark's solution may well work (haven't tested it
> myself).  But you cannot easily find a generic solution because of all
> the different ways to mangle HTML.

My solution does work, (only) because the page the original poster was
trying to parse had unquoted attribute values, like <a href=index.html>.
sgmllib works fine in these cases.

In fact, the BaseHTMLProcessor class I define in my book can be used to
properly quote all attribute values, since it works by breaking down the
entire HTML (via sgmllib) and building up equivalent HTML with proper quotes
around attribute values.  This is decidedly *not* why it was written; I just
happened to notice it one day when I was testing the code for other reasons.
  http://diveintopython.org/dialect_divein.html

-M
You're smart; why haven't you learned Python yet?
http://diveintopython.org/






More information about the Python-list mailing list