Q: how to extract only text from a html ?

Moshe Zadka moshez at math.huji.ac.il
Wed Nov 1 10:47:08 EST 2000


On 1 Nov 2000, Gerrit Holl wrote:

> On Tue, 31 Oct 2000 13:50:54 -0600, Hwanjo Yu wrote:
> > Could someone please tell me how to get rid of all the tags in a html ?
> > It seems that the htmllib.HTMLParser is not helpful to do it.
> 
> Maybe you should have a look at regular expressions, the re module.
> There's extremely much possible with it. Have you had a look at it?

No! HTML is very hard to parse with regular expressions.
Consider

<A HREF=">">fff</A>

CDATA sections, comments, etc. 
Do yourself a favour: learn to use htmllib instead of reinventing the
wheel, while making it square.
--
Moshe Zadka <moshez at math.huji.ac.il> -- 95855124
http://advogato.org/person/moshez





More information about the Python-list mailing list