Strip HTML tags?

EP EP at zomething.com
Sat Dec 13 23:57:05 EST 2003


>Hello,
>
>I was wondering what would be the easiest way to strip away HTML tags from 
>a string?
>
>Or how would I remove everything between < and > also the < , > as well 
>using regex?


I'm a newbie, but here's a way I did it:

## compile regular expressions for HTML tags... I did a separate one for 
line breaks
htmltags=re.compile(r'<p.*?>|</p>|<tr>|</tr>|<td.*?>|</td>|<a.*?>|</a>|<i>|</i>|<b>|</b>|<hr.*?>')
linebreaks=re.compile(r'<br>|<br/>')
##  not shown: some lines to input or iterate files go here; then you get 
the html file as a string:
wwwf=open(nextfile, 'r')
         strng=wwwf.read()
         wwwf.close()
## execute some regular expression methods
## first one substitutes empty single quotes for known html tags
nohtml=htmltags.sub('',strng)
## this one splits the string into lines on <br> or <br /> consuming the 
html tag in the process
textlines=linebreaks.split(nohtml)
## you could then print it to stdout or file
for line in textlines:
         print line

OK, on second thought it was probably not the easiest way, which is what 
you asked; but it wasn't hard and I understood how it worked.   :-) 






More information about the Python-list mailing list