need help with re module
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Wed Jun 20 20:43:59 EDT 2007
En Wed, 20 Jun 2007 17:56:30 -0300, David Wahler <dwahler at gmail.com>
escribió:
> On 6/20/07, Gabriel Genellina <gagsl-py2 at yahoo.com.ar> wrote:
>> En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog <linuxprog at gmail.com>
>> escribió:
>>
>> > i have that string "<html>hello</a>world<anytag>ok" and i want to
>> > extract all the text , without html tags , the result should be some
>> > thing like that : helloworldok
>>
>> You can't use a regular expression for this task (no matter how
>> complicated you write it).
> [snip]
>
> I agree that BeautifulSoup is probably the best tool for the job, but
> this doesn't sound right to me. Since the OP doesn't care about tags
> being properly nested, I don't see why a regex (albeit a tricky one)
> wouldn't work. For example:
>
> regex = re.compile(r'''
> <[^!] # beginning of normal tag
> ([^'">]* # unquoted text...
> |'[^']*' # or single-quoted text...
> |"[^"]*")* # or double-quoted text
> > # end of tag
> |<!-- # beginning of comment
> ([^-]|-[^-])*
> --\s*> # end of comment
> ''', re.VERBOSE)
> text = regex.sub('', html)
>
> Granted, this misses out a few things (e.g. DOCTYPE declarations), but
> those should be straightforward to handle.
It doesn't handle a lot of things. For this input (not very special, just
a few simple mistakes):
<html>
<a href="http://foo.com/baz.html>click here</a>
<p>What if price<100? You lose.
<p>What if HitPoints<-10? You are dead.
<p>Assignment: target <-- any_expression
Just a few last words.
</html>
the BeautifulSoup version gives:
click here
What if price<100? You lose.
What if HitPoints<-10? You are dead.
Assignment: target <-- any_expression
Just a few last words.
and the regular expression version gives:
<a href="http://foo.com/baz.html>click here
What if priceWhat if HitPointsAssignment: target
Clearly the BeautifulSoup version gives the "right" result, or the
"expected" one.
It's hard to get that with only a regular expression, you need more power;
and BeautifulSoup fills the gap.
--
Gabriel Genellina
More information about the Python-list
mailing list