Some <head> clauses cases BeautifulSoup to choke?

Mon Nov 19 17:49:45 EST 2007

On Nov 19, 2007 3:29 PM, Marc Christiansen <usenet at solar-empire.de> wrote:
> Frank Stutzman <stutzman at skywagon.kjsl.com> wrote:
> > I've got a simple script that looks like (watch the wrap):
> > ---------------------------------------------------
> > import BeautifulSoup,urllib
> >
> > ifile = urllib.urlopen("http://www.naco.faa.gov/digital_tpp_search.asp?fldId
> > ent=klax&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search").read()
> >
> > soup=BeautifulSoup.BeautifulSoup(ifile)
> > print soup.prettify()
> > ----------------------------------------------------
> >
> > and all I get out of it is garbage.
>
> Same for me.
>
> > I did some poking and proding and it seems that there is something in the
> > <head> clause that is causing the problem.  Heck if I can see what it is.
>
> The problem is this line:
>  <META http-equiv="Content-Type" content="text/html; charset=UTF-16">
>
> Which is wrong. The content is not utf-16 encoded. The line after that
> declares the charset as utf-8, which is correct, although ascii would be
> ok too.
>
> If I save the search result and remove this line, everything works. So,
> you could:
> - ignore problematic pages
> - save and edit them, then reparse them (not always practical)
> - use the fromEncoding argument:
>  soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
> (or 'ascii'). Of course this only works if you guess/predict the
> encoding correctly ;) Which can be difficult. Since BeautifulSoup uses
> "an encoding discovered in the document itself" (quote from
> <http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)
> when the encoding you supply does not work, using fromEncoding="ascii"
> should not hurt too much. But this being usenet, I'm sure someone will
> tell me that I'm wrong and there is some weird 7bit encoding in use
> somewhere on the web...
>
> > I'm new to BeautifulSoup (heck, I'm new to python).  If I'm doing something
> > dumb, you don't need to be gentle.
>
> No, you did nothing dumb. The server sent you broken content.
>

Correct. However, this is the sort of real-life broken HTML that BS is
tasked to handle. It looks like the major browers handle this by using
the last content type (header or meta tag) encountered before other
content. Right now, it looks like BS has a number of fallback
mechanisms but it's meta-tag fallback only looks at the first tag.

Posting a feature request or whatever through whatever mechanism BS
uses to handle this sort of thing would probably be nice.