Help Parsing an HTML File

Sat Feb 16 04:10:33 EST 2008

Stefan Behnel wrote:

> egonslokar at gmail.com wrote:
>> I have a single unicode file that has  descriptions of hundreds of
>> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>> 
>> I need to parse the file in such a way to extract data out of the html
>> and to come up with a tab separated file that would look like OUTPUT-
>> FILE below.
>> 
>> =====OUTPUT-FILE=====
>> /please note that the first line of the file contains column headers/
>> ------Tab Separated Output File Begin------
>> H1   H2      DIV     Segment1        Segment2        Segment3
>> RoséH1-1     RoséH2-1        RoséDIV-1       RoséSegmentDIV1-1       RoséSegmentDIV2-1
>> ------Tab Separated Output File End------
>> 
>> =====HTML-EXAMPLE=====
>> ------HTML Example Begin------
>> <html>
>> 
>> <h1>RoséH1-1</h1>
>> <h2>RoséH2-1</h2>
>> <div>RoséDIV-1</div>
>> <div "segment1">RoséSegmentDIV1-1</div><br>
>> <div "segment2">RoséSegmentDIV2-1</div><br>
>> <div "segment3">RoséSegmentDIV3-1</div><br>
>> <br>
>> <br>
>> 
>> </html>
>> ------HTML Example End------
> 
> Now, what ugly markup is that? You will never manage to get any HTML
> compliant parser return the "segmentX" stuff in there. I think your best
> bet is really going for pyparsing or regular expressions (and I actually
> recommend pyparsing here).
> 
> Stefan

In practice the following might be sufficient:

from BeautifulSoup import BeautifulSoup

def chunks(bs):
    chunk = []
    for tag in bs.findAll(["h1", "h2", "div"]):
        if tag.name == "h1":
            if chunk:
                yield chunk
                chunk = []
        chunk.append(tag)
    if chunk:
        yield chunk

def process(filename):
    bs = BeautifulSoup(open(filename))
    for chunk in chunks(bs):
        columns = [tag.string for tag in chunk]
        columns += ["No Value"] * (6 - len(columns))
        print "\t".join(columns)

if __name__ == "__main__":
    process("example.html")

The biggest caveat is that only columns at the end of a row may be left out.

Peter