Tips Re Pattern Matching / REGEX

egonslokar at gmail.com egonslokar at gmail.com
Thu Mar 27 17:02:53 EDT 2008


Hello Python Community,

I have a large text file (1GB or so) with structure similar to the
html example below.

I have to extract content (text between div and tr tags) from this
file and put it into a spreadsheet or a database - given my limited
python knowledge I was going to try to do this with regex pattern
matching.

Would someone be able to provide pointers regarding how do I approach
this? Any code samples would be greatly appreciated.

Thanks.

Sam



<html>

\\ there are hundreds of thousands of items

\\Item1

<div class="ItemHead">123</div>
....
<div class="special">Text1: What do I do with these lines
That span several rows? </div>
...
<tr tag="ItemFoot">Foot</tr>

\\Item2

<div class="ItemHead">First Line Can go here
But the second line can go here</div>
...
<tr tag="ItemFoot">Foot
Can span
Over several <b>pages</b></tr>


\\Item3

<div class="ItemHead">First Line Can go here
But the second line can go here</div>
...
<div class="special">This can
Span several rows</div>

</html>





More information about the Python-list mailing list