Help Parsing an HTML File
egonslokar at gmail.com
egonslokar at gmail.com
Fri Feb 15 16:28:51 EST 2008
Hello Python Community,
It'd be great if someone could provide guidance or sample code for
accomplishing the following:
I have a single unicode file that has descriptions of hundreds of
objects. The file fairly resembles HTML-EXAMPLE pasted below.
I need to parse the file in such a way to extract data out of the html
and to come up with a tab separated file that would look like OUTPUT-
FILE below.
Any tips, advice and guidance is greatly appreciated.
Thanks,
Egon
=====OUTPUT-FILE=====
/please note that the first line of the file contains column headers/
------Tab Separated Output File Begin------
H1 H2 DIV Segment1 Segment2 Segment3
RoséH1-1 RoséH2-1 RoséDIV-1 RoséSegmentDIV1-1 RoséSegmentDIV2-1
RoséSegmentDIV3-1
PinkH1-2 PinkH2-2 PinkDIV2-2 PinkSegmentDIV1-2 No-Value No-Value
BlackH1-3 BlackH2-3 BlackDIV2-3 BlackSegmentDIV1-3 No-Value No-Value
YellowH1-4 YellowH2-4 YellowDIV2-4 YellowSegmentDIV1-4
YellowSegmentDIV2-4 No-Value
------Tab Separated Output File End------
=====HTML-EXAMPLE=====
------HTML Example Begin------
<html>
<h1>RoséH1-1</h1>
<h2>RoséH2-1</h2>
<div>RoséDIV-1</div>
<div "segment1">RoséSegmentDIV1-1</div><br>
<div "segment2">RoséSegmentDIV2-1</div><br>
<div "segment3">RoséSegmentDIV3-1</div><br>
<br>
<br>
<h1>PinkH1-2</h1>
<h2>PinkH2-2</h2>
<div>PinkDIV2-2</div>
<div "segment1">PinkSegmentDIV1-2</div><br>
<br>
<comment></comment>
<h1>BlackH1-3</h1>
<h2>BlackH2-3</h2>
<div>BlackDIV2-3</div>
<div "segment1">BlackSegmentDIV1-3</div><br>
<h1>YellowH1-4</h1>
<h2>YellowH2-4</h2>
<div>YellowDIV2-4</div>
<div "segment1">YellowSegmentDIV1-4</div><br>
<div "segment2">YellowSegmentDIV2-4</div><br>
</html>
------HTML Example End------
More information about the Python-list
mailing list