Removing certain tags from html files

Marc 'BlackJack' Rintsch bj_666 at gmx.net
Fri Jul 27 14:45:29 EDT 2007


On Fri, 27 Jul 2007 17:40:23 +0000, sebzzz wrote:

> My question, since I'm quite new to python, is about what tool I
> should use to remove the table, tr and td tags, but not what's
> enclosed in it. I think BeautifulSoup isn't good for that because it
> removes what's enclosed as well.

Than take a hold on the content and add it to the parent.  Somthing like
this should work:

from BeautifulSoup import BeautifulSoup


def remove(soup, tagname):
    for tag in soup.findAll(tagname):
        contents = tag.contents
        parent = tag.parent
        tag.extract()
        for tag in contents:
            parent.append(tag)


def main():
    source = '<a><b>This is a <c>Test</c></b></a>'
    soup = BeautifulSoup(source)
    print soup
    remove(soup, 'b')
    print soup

> Is re the good module for that? Basically, if I make an iteration that
> scans the text and tries to match every occurrence of a given regular
> expression, would it be a good idea?

No regular expressions are not a very good idea.  They get very
complicated very quickly while often still miss some corner cases.

Ciao,
	Marc 'BlackJack' Rintsch



More information about the Python-list mailing list