[Tutor] lstrip() question

don arnold darnold02 at sprynet.com
Mon Feb 2 18:48:02 EST 2004


----- Original Message -----
From: "Karl Pflästerer" <sigurd at 12move.de>
To: <tutor at python.org>
Sent: Monday, February 02, 2004 5:27 PM
Subject: Re: [Tutor] lstrip() question


> On  2 Feb 2004, don arnold <- darnold02 at sprynet.com wrote:
>
> > But this doesn't seem to quite work if there are multiple leading
<br>'s.
>
> >>>> tmp = '<br><br>real estate<br>broker<br>'
> >>>> import re
> >>>> re.sub('^<br>*','',tmp)
> > '<br>real estate<br>broker<br>'
>
>
> What Python version do you have; it seems to be broken.
>
> $ python
> Python 2.3.3 (#1, Dec 30 2003, 08:29:25)
> [GCC 3.3.1 (cygming special)] on cygwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import re
> >>> s = '<br><br>foobar'
> >>> re.sub('<br>*', '', s)
> 'foobar'
> >>> tmp = '<br><br>real estate<br>broker<br>'
> >>> re.sub('<br>*', '', tmp)
> 'real estatebroker'
> >>>
>

Yes, but this regex doesn't have the initial caret ('^') that your original
did. As a result, it removes all occurrences of the tag, not just the
leading one(s).

> I can't reproduce your observation here.
>
> The regexp could be made better since a <br/ > tag should be written
> like this (HTML 4.01 and XHTML)
>
> >>> re.sub('<br( */ *)?>*', '', tmp)
> 'real estatebroker'
> >>>
>
>
> > I don't know much about regexes, but is this because only the very first
> > occurrence is considered to be at the beginning of the line? I'm sure
there
>
> No that can't be.  There is something other wrong.  Did you type it in
> exactly the way you posted it here?
>
> > is a regex way to do it, but you could just use a simple loop and
> > startswith():
>
> That's possible but IMO extremly inefficient. Also with tags written as
> a mixture of <br> and <br/ > your loop couldn't be written so simple.
>
>    Karl

No argument there. But the truth of the matter is that neither string
methods nor regexes are very well-suited for parsing full-blown HTML. For
that, you're probably better off using the HTMLParser module.

Don




More information about the Tutor mailing list