HTML Parser

Fri Dec 29 10:26:31 EST 2000

Hello,

I am trying to write an HTML parser. I am starting off with a simple one
like so:

# html_parser.py
import re
import string

newline=re.compile('\n')
HTMLtags=re.compile('<.*>')

file=open('C:\Documents and Settings\dvoitenko\My
Documents\Python\index.jsp', 'r')
input_file = file.read()
file.close()

jsp_content = newline.split(input_file)

# loop thru lines ...
for line in jsp_content[:]: 
	result=HTMLtags.search(line)
	tag_content = line[result.start()+1:result.end()-1]
	print '<'+string.upper(tag_content)+'>'

Which simply uppercases all html tags. Well, it actually uppercases
everything else which I do not like. What seems to be wrong here? Say if I
have a link <a href=hello.jsp>Hello</a> it will result into <A
HREF=HELLO.JSP>HELLO</A>... which is not right.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20001229/6aeaf02e/attachment.html>