Help Parsing an HTML File

Sat Feb 16 02:41:45 EST 2008

egonslokar at gmail.com wrote:
> I have a single unicode file that has  descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
> 
> I need to parse the file in such a way to extract data out of the html
> and to come up with a tab separated file that would look like OUTPUT-
> FILE below.
> 
> =====OUTPUT-FILE=====
> /please note that the first line of the file contains column headers/
> ------Tab Separated Output File Begin------
> H1	H2	DIV	Segment1	Segment2	Segment3
> RoséH1-1	RoséH2-1	RoséDIV-1	RoséSegmentDIV1-1	RoséSegmentDIV2-1
> ------Tab Separated Output File End------
> 
> =====HTML-EXAMPLE=====
> ------HTML Example Begin------
> <html>
> 
> <h1>RoséH1-1</h1>
> <h2>RoséH2-1</h2>
> <div>RoséDIV-1</div>
> <div "segment1">RoséSegmentDIV1-1</div><br>
> <div "segment2">RoséSegmentDIV2-1</div><br>
> <div "segment3">RoséSegmentDIV3-1</div><br>
> <br>
> <br>
> 
> </html>
> ------HTML Example End------

Now, what ugly markup is that? You will never manage to get any HTML compliant
parser return the "segmentX" stuff in there. I think your best bet is really
going for pyparsing or regular expressions (and I actually recommend pyparsing
here).

Stefan