Parsing html
wes weston
wweston at att.net
Thu Jul 8 19:18:32 EDT 2004
C Gillespie wrote:
> Dear All,
>
> I have hopefully a very simple problem. I wish to parse an html page and
> extract everything between the <body> tags.
>
> E.g.
> <head>
> <body>
> <b>afsdf</b>
> </body>
> </head>
>
> Would give
> <body>
> <b>afsdf</b>
> </body>
>
> I've been playing about with htmllib with no successful. Any suggestions?
>
> Thanks
>
> Colin
>
>
#--------------------------------------------------------------------------
def TokenizeHTML( s ):
#return a list containing two types of tokens:
# 1. html tokens starting with '<' and ending with '>'
# 2. strings between '>' and '<'
state = 0
htmlStr = ""
str = ""
list = []
for ch in s:
if state == 0: #initial state; detection state
if ch == '<':
state = 1
htmlStr += ch
else:
state = 2
str += ch
elif state == 1: #html state; in a <> pair
htmlStr += ch
if ch == '>':
state = 0
list.append(htmlStr)
htmlStr = ""
elif state == 2: #non html state; not in a <> pair
if ch == '<':
state = 1
list.append(str)
str = ""
htmlStr = "<"
else:
str += ch
if len(str) > 0:
list.append(str)
return list
More information about the Python-list
mailing list