need help with re module

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Wed Jun 20 20:43:59 EDT 2007


En Wed, 20 Jun 2007 17:56:30 -0300, David Wahler <dwahler at gmail.com>  
escribió:

> On 6/20/07, Gabriel Genellina <gagsl-py2 at yahoo.com.ar> wrote:
>> En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog <linuxprog at gmail.com>
>> escribió:
>>
>> > i have that string "<html>hello</a>world<anytag>ok" and i want to
>> > extract all the text , without html tags , the result should be some
>> > thing like that : helloworldok
>>
>> You can't use a regular expression for this task (no matter how
>> complicated you write it).
> [snip]
>
> I agree that BeautifulSoup is probably the best tool for the job, but
> this doesn't sound right to me. Since the OP doesn't care about tags
> being properly nested, I don't see why a regex (albeit a tricky one)
> wouldn't work. For example:
>
> regex = re.compile(r'''
>     <[^!]             # beginning of normal tag
>         ([^'">]*        # unquoted text...
>         |'[^']*'        # or single-quoted text...
>         |"[^"]*")*      # or double-quoted text
>     >                 # end of tag
>    |<!--              # beginning of comment
>         ([^-]|-[^-])*
>     --\s*>            # end of comment
> ''', re.VERBOSE)
> text = regex.sub('', html)
>
> Granted, this misses out a few things (e.g. DOCTYPE declarations), but
> those should be straightforward to handle.

It doesn't handle a lot of things. For this input (not very special, just  
a few simple mistakes):

<html>
<a href="http://foo.com/baz.html>click here</a>
<p>What if price<100? You lose.
<p>What if HitPoints<-10? You are dead.
<p>Assignment: target <-- any_expression
Just a few last words.
</html>

the BeautifulSoup version gives:

click here
What if price<100? You lose.
What if HitPoints<-10? You are dead.
Assignment: target <-- any_expression
Just a few last words.

and the regular expression version gives:

<a href="http://foo.com/baz.html>click here
What if priceWhat if HitPointsAssignment: target

Clearly the BeautifulSoup version gives the "right" result, or the  
"expected" one.
It's hard to get that with only a regular expression, you need more power;  
and BeautifulSoup fills the gap.

-- 
Gabriel Genellina




More information about the Python-list mailing list