Parsing html

Thu Jul 8 19:18:32 EDT 2004

C Gillespie wrote:
> Dear All,
> 
> I have hopefully a very simple problem. I wish to parse an html page and
> extract everything between the <body> tags.
> 
> E.g.
> <head>
>     <body>
>         <b>afsdf</b>
>     </body>
> </head>
> 
> Would give
> <body>
>     <b>afsdf</b>
> </body>
> 
> I've been playing about with htmllib with no successful. Any suggestions?
> 
> Thanks
> 
> Colin
> 
> 

#--------------------------------------------------------------------------
def TokenizeHTML( s ):
     #return a list containing two types of tokens:
     #   1. html tokens starting with '<' and ending with '>'
     #   2. strings between '>' and '<'
     state   = 0
     htmlStr = ""
     str     = ""
     list    = []
     for ch in s:
         if state == 0: #initial state; detection state
             if ch == '<':
                 state = 1
                 htmlStr += ch
             else:
                 state = 2
                 str += ch
         elif state == 1: #html state; in a <> pair
             htmlStr += ch
             if ch == '>':
                 state = 0
                 list.append(htmlStr)
                 htmlStr = ""
         elif state == 2: #non html state; not in a <> pair
             if ch == '<':
                 state = 1
                 list.append(str)
                 str = ""
                 htmlStr = "<"
             else:
                 str += ch
     if len(str) > 0:
         list.append(str)
     return list