Help Parsing an HTML File

egonslokar at gmail.com egonslokar at gmail.com
Fri Feb 15 16:28:51 EST 2008


Hello Python Community,

It'd be great if someone could provide guidance or sample code for
accomplishing the following:

I have a single unicode file that has  descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.

I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.

Any tips, advice and guidance is greatly appreciated.

Thanks,

Egon




=====OUTPUT-FILE=====
/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1	H2	DIV	Segment1	Segment2	Segment3
RoséH1-1	RoséH2-1	RoséDIV-1	RoséSegmentDIV1-1	RoséSegmentDIV2-1
RoséSegmentDIV3-1
PinkH1-2	PinkH2-2	PinkDIV2-2	PinkSegmentDIV1-2	No-Value	No-Value
BlackH1-3	BlackH2-3	BlackDIV2-3	BlackSegmentDIV1-3	No-Value	No-Value
YellowH1-4	YellowH2-4	YellowDIV2-4	YellowSegmentDIV1-4
YellowSegmentDIV2-4	No-Value
------Tab Separated Output File End------



=====HTML-EXAMPLE=====
------HTML Example Begin------
<html>

<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
<div>RoséDIV-1</div>
<div "segment1">RoséSegmentDIV1-1</div><br>
<div "segment2">RoséSegmentDIV2-1</div><br>
<div "segment3">RoséSegmentDIV3-1</div><br>
<br>
<br>

<h1>PinkH1-2</h1>
<h2>PinkH2-2</h2>
<div>PinkDIV2-2</div>
<div "segment1">PinkSegmentDIV1-2</div><br>
<br>
<comment></comment>

<h1>BlackH1-3</h1>
<h2>BlackH2-3</h2>
<div>BlackDIV2-3</div>
<div "segment1">BlackSegmentDIV1-3</div><br>

<h1>YellowH1-4</h1>
<h2>YellowH2-4</h2>
<div>YellowDIV2-4</div>
<div "segment1">YellowSegmentDIV1-4</div><br>
<div "segment2">YellowSegmentDIV2-4</div><br>

</html>
------HTML Example End------



More information about the Python-list mailing list