[Tutor] lstrip() question

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Mon Feb 2 21:47:48 EST 2004



On Mon, 2 Feb 2004, Tim Johnson wrote:

>     I'd like to remove all leading occurance of a html break tag in a
> string such that "<br><br>test" => "test" and "<br>test<br>this"
> =>"test<br>this"


Hi Tim,


Just out of curiosity, why are you trying to do this?  Would it be
possible to use something like HTMLParser?

    http://www.python.org/doc/lib/module-HTMLParser.html

I know it sounds like using the library might be overkill, but HTMLParser
is meant to deal with the ugliness that is HTML.  It can handle some
strange situations like


###
s = """<br
     ><Br/><bR       class="f<o><o>!">this is a test"""
###


where a regular expression for this might be more subtle than we might
expect.  (The example above is meant to be a nightmare case.  *grin*)


Using a real HTML parser normalizes this wackiness so that we don't see
it.  Here's a subclass of HTMLParser that shows how we might use it for
the problem:


###
from HTMLParser import HTMLParser

class IgnoreLeadingBreaksParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.seen_nonbreak_tag = False
        self.text = []

    def get_text(self):
        return ''.join(self.text)

    def handle_starttag(self, tag, attrs):
        if tag != 'br':
            self.seen_nonbreak_tag = True
        if self.seen_nonbreak_tag:
            self.text.append(self.get_starttag_text())

    def handle_endtag(self, tag):
        if tag != 'br':
            self.seen_nonbreak_tag = True
        if self.seen_nonbreak_tag:
            self.text.append('</%s>' % tag)

    def handle_data(self, data):
        self.seen_nonbreak_tag = True
        self.text.append(data)


def ignore_leading_breaks(text):
    parser = IgnoreLeadingBreaksParser()
    parser.feed(text)
    return parser.get_text()
###


Note: this is not quite production-quality yet.  In particular, it doesn't
handle comments or character references, so we may need to add more
methods to the IgnoreLeadingBreaksParser so that it handles those cases
too.


Hope this helps!




More information about the Tutor mailing list