Web page from hell breaks BeautifulSoup, almost

John Nagle nagle at animats.com
Tue Dec 11 01:19:13 EST 2007


This web page:

http://azultralights.com/ulclass.html

parses OK with BeautifulSoup, but "prettify" will hit the
recursion limit if you try to display it.  I raised the
recursion limit to a large number, and it was converted
to 5MB of text successfully, in about a minute.

The page has real problems.  1901 errors from the W3C validator,
and that's after forcing an encoding and a doctype.  "body" tags
nested 3 deep.  "head" element inside two "body" tags.  Tags
opened with an upper case tag and closed with a lower case tag.
All "font" tags unclosed.  Hundreds of "li" tags outside a
"ol" or "ul".  Yet Firefox is quite happy to display it.
It looks even better in IE, according to comments on the page.

The page consists of a long list of classified ads, all with
unclosed tags.  So the maximum depth is huge.

Worst HTML I've seen in a while.

(We use BeautifulSoup to parse hostile web sites in bulk,
so we tend to discover more hard cases than most users.)

				John Nagle
				SiteTruth



More information about the Python-list mailing list