remove begin and end tags and its content in between

Alex Martelli aleaxit at yahoo.com
Fri Apr 6 06:24:18 EDT 2001


"Jennifer Jeng" <jeng at cliffie.nosc.mil> wrote in message
news:9ai3vs$mok$1 at newpoisson.nosc.mil...
> Hi,
>
> Is there a way to search for a specific html tags and remove the begin,
end
> tags include its content (begin tag can go multiple lines to the end tag)?
>
> example:
>
> replace:
>
> this text is before the begin title tags and <title>dddfdfdkfdjfdjfldjfl
> dkfdlfjdlf jdkjfdk
> djkfd
> dfdkfdkf</title> this text is
> at the end of the
> end title tag
>
> to:
>
> this text is before the begin title tags and this text is
> at the end of the
> end title tag

Not clear exactly what tags qualify here, but, assuming you want
to remove everything that IS between ANY tags, and only leave
what is not, this might work:

import sgmllib

class afilter(sgmllib.SGMLParser):
    def __init__(self):
        sgmllib.SGMLParser.__init__(self)
        self.inTag = 0
        self.data = []
    def unknown_starttag(self, tag, attributes):
        self.inTag += 1
    def unknown_endtag(self, tag):
        self.inTag -= 1
    def handle_data(self, data):
        if self.inTag: return
        self.data.append(data)

if __name__=='__main__':
    sometext =  """
this text is before the begin title tags and <title>dddfdfdkfdjfdjfldjfl
dkfdlfjdlf jdkjfdk
djkfd
dfdkfdkf</title> this text is
at the end of the
end title tag"""
    filt = afilter()
    filt.feed(sometext)
    filt.close()
    print ''.join(filt.data)


The only difference wrt your desired output in your example is
that, of course, TWO spaces will be between 'and' and 'this',
since that is the number of spaces outside of tags in the
string being processed:-).


Alex






More information about the Python-list mailing list