HTML Parser
Voitenko, Denis
dvoitenko at qode.com
Fri Dec 29 10:26:31 EST 2000
Hello,
I am trying to write an HTML parser. I am starting off with a simple one
like so:
# html_parser.py
import re
import string
newline=re.compile('\n')
HTMLtags=re.compile('<.*>')
file=open('C:\Documents and Settings\dvoitenko\My
Documents\Python\index.jsp', 'r')
input_file = file.read()
file.close()
jsp_content = newline.split(input_file)
# loop thru lines ...
for line in jsp_content[:]:
result=HTMLtags.search(line)
tag_content = line[result.start()+1:result.end()-1]
print '<'+string.upper(tag_content)+'>'
Which simply uppercases all html tags. Well, it actually uppercases
everything else which I do not like. What seems to be wrong here? Say if I
have a link <a href=hello.jsp>Hello</a> it will result into <A
HREF=HELLO.JSP>HELLO</A>... which is not right.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20001229/6aeaf02e/attachment.html>
More information about the Python-list
mailing list