BeautifulSoup vs. loose & chars

Tue Dec 26 07:22:38 EST 2006

John Nagle wrote:
> I've been parsing existing HTML with BeautifulSoup, and occasionally
> hit content which has something like "Design & Advertising", that is,
> an "&" instead of an "&".  Is there some way I can get BeautifulSoup
> to clean those up?  There are various parsing options related to "&"
> handling, but none of them seem to do quite the right thing.
>
>    If I write the BeautifulSoup parse tree back out with "prettify",
> the loose "&" is still in there.  So the output is
> rejected by XML parsers.  Which is why this is a problem.
> I need valid XML out, even if what went in wasn't quite valid.
>
> 				John Nagle

So do you want to remove "&" or replace them with "&" ? If you want
to replace it try the following;

import urllib, sys

try:
  location = urllib.urlopen(url)
except IOError, (errno, strerror):
  sys.exit("I/O error(%s): %s" % (errno, strerror))

content = location.read()
content = content.replace("&", "&")

To do this with BeautifulSoup, i think you need to go through every
Tag, get its content, see if it contains an "&" and then replace the
Tag with the same Tag but the content contains "&"

Hope this helps.
Cheers