Q: how to extract only text from a html ?

Thu Nov 2 06:18:21 EST 2000

"Fredrik Lundh" <fredrik at effbot.org> wrote in message
news:Du2M5.3455$QH2.340287 at newsb.telia.net...
> Alex wrote:
> > I think htmllib (a solution based on which has already been
> > posted) is a much better idea to handle HTML, than trying to
> > do it with re's.  HTML syntax is not parsable with re's,  while
> > htmllib does a decent job of it, I think.
>
> footnote: htmllib (or rather, sgmllib) uses regular expressions
> to parse HTML (SGML).  maybe you meant "cannot be parsed
> with a single re"?

I did not mean that the lexical level cannot be handled
with RE's -- but I thought sgmllib was adding substantial
'parsing' value to the 'lexical' RE's it uses by arranging
to call the right one[s] at various points in goahead
and the various parse_* methods.  I believed the relationship
was something like that of lexx (which uses RE's, but only
handles lexical issues) to yacc (which superimposes a more
general LALR(1) structure and handles non-lexical syntax
issues).

> (on the other hand, you can parse XML with a single RE, and
> I don't see why you cannot use a similar technique to parse
> HTML...)

Now that is interesting indeed -- I didn't know that!  I
thought some recursion, &c, would be needed in general.
Or do you mean the so-called "shallow parsing" of, e.g.,
http://www.cs.sfu.ca/~cameron/REX.html?

Anyway, I see I'd better change my line from "can't be
done" to "you'd better not even try", as in "Please don't
use regular expressions on XML, in the *very* short run you
will be bitten", quoting directly from
http://www.xmltwig.cx/perl_survey/perl_survey.html

Like for the Turing-completeness of C++ templates, I
think much of the dark fascination of RE's comes from
the fact that it's hard to find something you _cannot_
do with them...:-).

Alex