[Tutor] Parsing HTML file

Daniel Ehrenberg littledanehren at yahoo.com
Thu Dec 11 18:11:09 EST 2003


Chris Heisel wrote:
> Hi,
> 
> I'm working on a Python script that will go through
> a series of 
> directories and parse some HTML files.
> 
> I'd like to be able to read the HTML and extract
> certain components and 
> put them into a MySQL database.
> 
> For instance, in these files there will be a
> document title like this:
> <h2 class="header">This is the documents header</h2>
> 
> There would be content marked like this:
> <!--START CONTENT-->
> <p>Some content</p>
> <p>Some more content</p>
> <h4>A sub head</h4>
> <p>Again</p>
> <!--END CONTENT-->
> 
> I'm wondering what the best way to approach this
> problem is?
> 
> I was reading up on htmllib and HTMLParser. Should I
> use them or do some 
> regexp searches of the files for "<h2
> class="header">*</h2>"?
> 
> If I should use htmllib and HTMLParser any
> suggestions on their use?
> 
> I gather than I can set event handlers for say, an
> <h2>, tag, but can I 
> set event handlers for classes, like <h2
> class="header">, or for blocks 
> of commments like <!--START CONTENT--> and <!--END
> CONTENT-->
> 
> In a perferct world I would have gotten all this
> data in an XML format, 
> that would make my life easier, but the files are
> already there in HTML 
> and I've got to figure out how to extract some of
> the semantic content 
> and stuff it into a MySQL DB...
> 
> Many, many thanks in advance for your help,
> 
> Chris

I think htmllib or HTMLParser would be overkill. They
both go through the whole document and send all of the
tags into user-defined functions. This would make a
much slower program for it to have to go through all
of those tags. It would be much faster to just search
for it with regexps. Use the re, not regex, module.
The regexp you gave didn't do what you wanted it to,
so here's what it should have been:
(?s)<h2 class="header">(.*?)</h2>

Daniel Ehrenberg

__________________________________
Do you Yahoo!?
New Yahoo! Photos - easier uploading and sharing.
http://photos.yahoo.com/



More information about the Tutor mailing list