Help Parsing an HTML File

Fri Feb 15 17:06:42 EST 2008

On Feb 15, 3:28 pm, egonslo... at gmail.com wrote:
> Hello Python Community,
>
> It'd be great if someone could provide guidance or sample code for
> accomplishing the following:
>
> I have a single unicode file that has  descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>
> I need to parse the file in such a way to extract data out of the html
> and to come up with a tab separated file that would look like OUTPUT-
> FILE below.
>
> Any tips, advice and guidance is greatly appreciated.
>
> Thanks,
>
> Egon
>
> =====OUTPUT-FILE=====
> /please note that the first line of the file contains column headers/
> ------Tab Separated Output File Begin------
> H1      H2      DIV     Segment1        Segment2        Segment3
> RoséH1-1       RoséH2-1       RoséDIV-1      RoséSegmentDIV1-1      RoséSegmentDIV2-1
> RoséSegmentDIV3-1
> PinkH1-2        PinkH2-2        PinkDIV2-2      PinkSegmentDIV1-2       No-Value        No-Value
> BlackH1-3       BlackH2-3       BlackDIV2-3     BlackSegmentDIV1-3      No-Value        No-Value
> YellowH1-4      YellowH2-4      YellowDIV2-4    YellowSegmentDIV1-4
> YellowSegmentDIV2-4     No-Value
> ------Tab Separated Output File End------
>
> =====HTML-EXAMPLE=====
> ------HTML Example Begin------
> <html>
>
> <h1>RoséH1-1</h1>
> <h2>RoséH2-1</h2>
> <div>RoséDIV-1</div>
> <div "segment1">RoséSegmentDIV1-1</div><br>
> <div "segment2">RoséSegmentDIV2-1</div><br>
> <div "segment3">RoséSegmentDIV3-1</div><br>
> <br>
> <br>
>
> <h1>PinkH1-2</h1>
> <h2>PinkH2-2</h2>
> <div>PinkDIV2-2</div>
> <div "segment1">PinkSegmentDIV1-2</div><br>
> <br>
> <comment></comment>
>
> <h1>BlackH1-3</h1>
> <h2>BlackH2-3</h2>
> <div>BlackDIV2-3</div>
> <div "segment1">BlackSegmentDIV1-3</div><br>
>
> <h1>YellowH1-4</h1>
> <h2>YellowH2-4</h2>
> <div>YellowDIV2-4</div>
> <div "segment1">YellowSegmentDIV1-4</div><br>
> <div "segment2">YellowSegmentDIV2-4</div><br>
>
> </html>
> ------HTML Example End------

Pyparsing, ElementTree and lxml are all good candidates as well.
BeautifulSoup takes care of malformed html though.

http://pyparsing.wikispaces.com/
http://effbot.org/zone/element-index.htm
http://codespeak.net/lxml/

Mike