Some <head> clauses cases BeautifulSoup to choke?

Frank Stutzman stutzman at skywagon.kjsl.com
Tue Nov 20 11:07:40 EST 2007


Some kind person replied:
> You have the same URL as both your good and bad example.

Oops, dang emacs cut buffer (yeah, thats what did it).  A working 
example url would be (again, mind the wrap):

http://www.naco.faa.gov/digital_tpp_search.asp?fldIdent=ksfo&fld_ident_type=ICAO&ver=0711&bnSubmit=Complete+Search 


Marc Christiansen <usenet at solar-empire.de> wrote:

> The problem is this line:
> <META http-equiv="Content-Type" content="text/html; charset=UTF-16">
> 
> Which is wrong. The content is not utf-16 encoded. The line after that
> declares the charset as utf-8, which is correct, although ascii would be
> ok too.

Ah, er, hmmm.  Take a look the 'good' URL I mentioned above.  You will 
notice that it has the same utf-16, utf-8 encoding that the 'bad' one
has.  And BeautifulSoup works great on it.  

I'm still scratchin' ma head...

> If I save the search result and remove this line, everything works. So,
> you could:
> - ignore problematic pages

Not an option for my application.
> - save and edit them, then reparse them (not always practical)

Thats what I'm doing at the moment during my development.  Sure
seems inelegant.

> - use the fromEncoding argument:
> soup=BeautifulSoup.BeautifulSoup(ifile, fromEncoding="utf-8")
> (or 'ascii'). Of course this only works if you guess/predict the
> encoding correctly ;) Which can be difficult. Since BeautifulSoup uses
> "an encoding discovered in the document itself" (quote from
> <http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful Soup Gives You Unicode, Dammit>)

I'll try that.  For what I'm doing it ought to be safe enough.  

Much appreciate all the comments so far.

-- 
Frank Stutzman
Bonanza N494B     "Hula Girl"
Boise, ID




More information about the Python-list mailing list