Tips Re Pattern Matching / REGEX

Thu Mar 27 17:13:16 EDT 2008

Hello,

> I have a large text file (1GB or so) with structure similar to the
> html example below.
>
> I have to extract content (text between div and tr tags) from this
> file and put it into a spreadsheet or a database - given my limited
> python knowledge I was going to try to do this with regex pattern
> matching.
>
> Would someone be able to provide pointers regarding how do I approach
> this? Any code samples would be greatly appreciated.
The ultimate tool for handling HTML is http://www.crummy.com/software/BeautifulSoup/
where you can do stuff like:
soup = BeautifulSoup(html)
for div in soup("div", {"class" : "special"}):
    ...

Not sure how fast it is though.

There is also the htmllib module that comes with python, it might do
the work as well and maybe a bit faster.
If the file is valid HTML and you need some speed, have a look at
xml.sax.

HTH,
--
Miki <miki.tebeka at gmail.com>
http://pythonwise.blogspot.com