need help with re module

Wed Jun 20 14:02:12 EDT 2007

On Jun 20, 9:58 am, linuxprog <linuxp... at gmail.com> wrote:
> hello
>
> i have that string "<html>hello</a>world<anytag>ok" and i want to
> extract all the text , without html tags , the result should be some
> thing like that : helloworldok
>
> i have tried that :
>
>         from re import findall
>
>         chaine = """<html>hello</a>world<anytag>ok"""
>
>         print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine)
>
>        >>> ['html', 'hell', 'worl', 'anyt', 'ag>o']
>
> the result is not correct ! what would be the correct regex to use ?

This: [^(<.*>)] is a set that contains everything but the characters
"(","<",".","*",">" and ")". It most certainly doesn't do what you
want it to. Is it absolutely necessary that you use a regular
expression? There are a few HTML parsing libraries out there. The
easiest approach using re might be to do a search and replace on all
tags. Just replace the tags with nothing.

Matt