Newby: How do I strip HTML tags?
Harvey Thomas
hst at empolis.co.uk
Fri Jun 7 12:12:21 EDT 2002
netvegetable wrote
> I'm mucking around with cgi, and I'm trying to work out a way
> to strip the
> html tags of a string. e.g, I want to convert this...
>
> ><font size = 12><b><big>Really Big String</big></b></font>
>
> to this this ...
>
> >Really Big String
>
> ... and store it as a value.
>
> I worked out a crude, but effective way of doing it (see code
> below), but I
> can't escape the feeling there must be a built in way of
> doing it more. If
> nothing else, I'm sure somebody who knows their regular
> expressions could
> neaten it up (please?).
>
> def strip_html_tags(it):
> left = it[:(len(it)/2)]
> right = it[(len(it)/2):]
> final = left[left.rfind('>')+1:] + right[:right.find('<')]
> return final
>
If your HTML is reasonably legal, then you can use something along the lines of the following very quick and very dirty program:
import re
import sys
s = open(sys.argv[1]).read()
o = open('tmp.tmp', 'w')
r = re.compile('(<!--.*?-->)|(<[^>]*>)([^<]+)', re.DOTALL)
for x, y, z in r.findall(s):
if z and not z.isspace(): #don't use comments tags and white-space only content
print >>o, z
Note that you have to test first for HTML comments as a comment can contain a '>' character.
_____________________________________________________________________
This message has been checked for all known viruses by the MessageLabs Virus Scanning Service.
More information about the Python-list
mailing list