Help Parsing an HTML File

Fri Feb 15 21:56:47 EST 2008

On Feb 15, 2:28 pm, egonslo... at gmail.com wrote:
> Hello Python Community,
>
> It'd be great if someone could provide guidance or sample code for
> accomplishing the following:
>
> I have a single unicode file that has  descriptions of hundreds of
> objects. The file fairly resembles HTML-EXAMPLE pasted below.
>
> I need to parse the file in such a way to extract data out of the html
> and to come up with a tab separated file that would look like OUTPUT-
> FILE below.
>
> Any tips, advice and guidance is greatly appreciated.
>
> Thanks,
>
> Egon
>
> =====OUTPUT-FILE=====
> /please note that the first line of the file contains column headers/
> ------Tab Separated Output File Begin------
> H1      H2      DIV     Segment1        Segment2        Segment3
> RoséH1-1       RoséH2-1       RoséDIV-1      RoséSegmentDIV1-1      RoséSegmentDIV2-1
> RoséSegmentDIV3-1
> PinkH1-2        PinkH2-2        PinkDIV2-2      PinkSegmentDIV1-2       No-Value        No-Value
> BlackH1-3       BlackH2-3       BlackDIV2-3     BlackSegmentDIV1-3      No-Value        No-Value
> YellowH1-4      YellowH2-4      YellowDIV2-4    YellowSegmentDIV1-4
> YellowSegmentDIV2-4     No-Value
> ------Tab Separated Output File End------
>
> =====HTML-EXAMPLE=====
> ------HTML Example Begin------
> <html>
>
> <h1>RoséH1-1</h1>
> <h2>RoséH2-1</h2>
> <div>RoséDIV-1</div>
> <div "segment1">RoséSegmentDIV1-1</div><br>
> <div "segment2">RoséSegmentDIV2-1</div><br>
> <div "segment3">RoséSegmentDIV3-1</div><br>
> <br>
> <br>
>
> <h1>PinkH1-2</h1>
> <h2>PinkH2-2</h2>
> <div>PinkDIV2-2</div>
> <div "segment1">PinkSegmentDIV1-2</div><br>
> <br>
> <comment></comment>
>
> <h1>BlackH1-3</h1>
> <h2>BlackH2-3</h2>
> <div>BlackDIV2-3</div>
> <div "segment1">BlackSegmentDIV1-3</div><br>
>
> <h1>YellowH1-4</h1>
> <h2>YellowH2-4</h2>
> <div>YellowDIV2-4</div>
> <div "segment1">YellowSegmentDIV1-4</div><br>
> <div "segment2">YellowSegmentDIV2-4</div><br>
>
> </html>
> ------HTML Example End------

Beautiful soup won't help much because the 'attributes' in the tags
are not really attributes, and therefore BeautifulSoup ignores them.
As a result, you'll end up just processing the file line by line.
That can be done just as easily without BeautifulSoup.  Based on the
example file you posted, all that is required is a simple regex to
match the text between the single tag on each line, and then just
outputting the data in the order you find it.  Pad the end of each
block of data with some No-Values, and you have your desired results.

Post some code with your efforts.