BeautifulSoup vs. loose & chars

John Nagle nagle at animats.com
Mon Dec 25 20:00:39 EST 2006


   I've been parsing existing HTML with BeautifulSoup, and occasionally
hit content which has something like "Design & Advertising", that is,
an "&" instead of an "&".  Is there some way I can get BeautifulSoup
to clean those up?  There are various parsing options related to "&"
handling, but none of them seem to do quite the right thing.

   If I write the BeautifulSoup parse tree back out with "prettify",
the loose "&" is still in there.  So the output is
rejected by XML parsers.  Which is why this is a problem.
I need valid XML out, even if what went in wasn't quite valid.

				John Nagle



More information about the Python-list mailing list