Q: how to extract only text from a html ?

Fredrik Lundh fredrik at effbot.org
Wed Nov 1 19:24:03 EST 2000


Alex wrote:
> I think htmllib (a solution based on which has already been
> posted) is a much better idea to handle HTML, than trying to
> do it with re's.  HTML syntax is not parsable with re's,  while
> htmllib does a decent job of it, I think.

footnote: htmllib (or rather, sgmllib) uses regular expressions
to parse HTML (SGML).  maybe you meant "cannot be parsed
with a single re"?

(on the other hand, you can parse XML with a single RE, and
I don't see why you cannot use a similar technique to parse
HTML...)

</F>





More information about the Python-list mailing list