[Tutor] Extracting data from HTML files

motorolaguy@gmx.net motorolaguy at gmx.net
Wed Dec 28 22:26:59 CET 2005


Hello,
I`m very new to Python and programming in general.I`ve been reading Dive in
to Python as an introduction to the language and I think I`m doing pretty
well,but I`m stuck on this problem.
I`m trying to make a python script for extracting certain data from HTML
files.These files are from a template so they all have the same formatting.I
just want to extract the data from certain fields.It would also be nice to
insert it into a mysql database, but I`ll leave that for later since I`m
stuck in just reading the files.
Say for example the HTML file has the following format:

<strong>Category:</strong>Category1<br><br>
[...]
<strong>Name:</strong>Filename.exe<br><br>
[...]
<strong>Description:</strong>Description1.<br><br>

Taking in to account that each HTML file has a load of code in between each
[...], what I want to do is extract the information for each field.In this
case what I want to do is the script to read Category1, filename.exe and
Description1.And later on insert this in to a mysql database, or read the
info and generate a CSV file to make db insertion easier.
Since all the files are generated by a script each field I want to read
is,from what I`ve seen, in the same line number so this could make things
easier.But not all fields are of the same length.
I`ve read Chapter 8 of Dive in to Python so I`m basing my work on that.
I also thought regexes might be useful for this but I suck at using regexes
so that`s another problem.
Do any of you have an idea of where I could get a good start on this and if
there`s any modules (like sgmllib.py) that might come in handy for this.
Thanks! 

-- 
Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner


More information about the Tutor mailing list