Q: how to extract only text from a html ?
D-Man
dsh8290 at rit.edu
Thu Nov 2 20:39:24 EST 2000
On Thu, 02 Nov 2000 06:18:21 Alex Martelli wrote:
|
| Like for the Turing-completeness of C++ templates, I
| think much of the dark fascination of RE's comes from
| the fact that it's hard to find something you _cannot_
| do with them...:-).
|
It's not that hard, try to match parenthesis with unlimited nesting. Ok, maybe that's a little too difficult for you, how about parens with only 1 level of nesting. Ex: ((3 + 2) - (1 + 0))
For the HTML stripping, the following RE (adapted from matching C comments) may do the job. No guarantees though and I haven't tested it ;-)
<[^>]+>
The following sed/vi/perl command will replace all tags (read: text matched by the regex) with whitespace:
s/<[^>]+>//g
The python code is (I think):
str = file.read()
re.sub( "<[^>]+>", str, "" ) # I probably put the args in the wrong order
On second thought here, suppose you have some text in your html file like this:
<html><body> Some examples of tautologies are 3 < 5 ; 5 > 3 </body></html>
the text "<5 ; 5>" will be matched as a tag.
As a side comment, is <> a legal tag? This will not be matched by my re.
-D
More information about the Python-list
mailing list