need help with re module

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Wed Jun 20 15:11:59 EDT 2007


En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog <linuxprog at gmail.com>  
escribió:

> i have that string "<html>hello</a>world<anytag>ok" and i want to
> extract all the text , without html tags , the result should be some
> thing like that : helloworldok
>
> i have tried that :
>
>         from re import findall
>
>         chaine = """<html>hello</a>world<anytag>ok"""
>
>         print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine)
>       >>> ['html', 'hell', 'worl', 'anyt', 'ag>o']
>
> the result is not correct ! what would be the correct regex to use ?

You can't use a regular expression for this task (no matter how  
complicated you write it).
Use BeautifulSoup, that can handle invalid HTML like yours:

py> from BeautifulSoup import BeautifulSoup
py> chaine = """<html>hello</a>world<anytag>ok"""
py> soup = BeautifulSoup(chaine)
py> soup.findAll(text=True)
[u'hello', u'world', u'ok']

Get it from <http://www.crummy.com/software/BeautifulSoup/>

-- 
Gabriel Genellina




More information about the Python-list mailing list