Q: how to extract only text from a html ?

D-Man dsh8290 at rit.edu
Thu Nov 2 20:39:24 EST 2000


On Thu, 02 Nov 2000 06:18:21 Alex Martelli wrote:
 | 
 | Like for the Turing-completeness of C++ templates, I
 | think much of the dark fascination of RE's comes from
 | the fact that it's hard to find something you _cannot_
 | do with them...:-).
 | 


It's not that hard,  try to match parenthesis with unlimited nesting.  Ok, maybe that's a little too difficult for you,  how about parens with only 1 level of nesting.  Ex:   ((3 + 2) - (1 + 0))


For the HTML stripping, the following RE (adapted from matching C comments) may do the job.  No guarantees though and I haven't tested it ;-)

<[^>]+>


The following sed/vi/perl command will replace all tags (read: text matched by the regex) with whitespace:

s/<[^>]+>//g

The python code is (I think):

str = file.read()
re.sub( "<[^>]+>", str, "" )  # I probably put the args in the wrong order


On second thought here, suppose you have some text in your html file like this:

<html><body>  Some examples of tautologies are 3 < 5 ;  5 > 3 </body></html>

the text "<5 ; 5>" will be matched as a tag.



As a side comment, is <> a legal tag?  This will not be matched by my re.

-D





More information about the Python-list mailing list