BeautifulSoup vs. loose & chars

John Nagle nagle at animats.com
Tue Dec 26 13:26:54 EST 2006


Felipe Almeida Lessa wrote:
> On 26 Dec 2006 04:22:38 -0800, placid <Bulkan at gmail.com> wrote:
> 
>> So do you want to remove "&" or replace them with "&" ? If you want
>> to replace it try the following;
> 
> 
> I think he wants to replace them, but just the invalid ones. I.e.,
> 
> This & this & that
> 
> would become
> 
> This & this & that
> 
> 
> No, i don't know how to do this efficiently. =/...
> I think some kind of regex could do it.

    Yes, and the appropriate one is:

	krefindamp = re.compile(r'&(?!(\w|#)+;)')
	...
	xmlsection = re.sub(krefindamp,'&',xmlsection)

This will replace an '&' with '&amp' if the '&' isn't
immediately followed by some combination of letters, numbers,
and '#' ending with a ';'  Admittedly this would let something
like '&xx#2;', which isn't a legal entity, through unmodified.

There's still a potential problem with unknown entities in the output XML, but
at least they're recognized as entities.

				John Nagle





More information about the Python-list mailing list