[Tutor] Extracting data from HTML files

motorolaguy@gmx.net motorolaguy at gmx.net
Thu Dec 29 08:46:58 CET 2005


Hello,
I was taking a look at BeautifulSoup as recommended by bob and from what I
can tell it`s just what I`m looking for but it`s a bit over my current
skills with python I`m afraid.I`ll still keep playing with it and see what I
can come up with.
I`ll also take a look at regexes as recommended by Kent Johnson to see if
it`ll work here.My guess is this is the way to go since the data I need is
always in the same line number in the HTML source.So I could just go to the
specific line numbers, look for my data and strip out the unnecesary tags.
Thanks for the help guys, if anyone`s got more tips they are more than
welcome :)
Thanks again and happy holidays!


> --- Ursprüngliche Nachricht ---
> Von: Kent Johnson <kent37 at tds.net>
> An: Python Tutor <tutor at python.org>
> Betreff: Re: [Tutor] Extracting data from HTML files
> Datum: Wed, 28 Dec 2005 22:16:47 -0500
> 
> motorolaguy at gmx.net wrote:
> > I`m trying to make a python script for extracting certain data from HTML
> > files.These files are from a template so they all have the same
> formatting.I
> > just want to extract the data from certain fields.It would also be nice
> to
> > insert it into a mysql database, but I`ll leave that for later since I`m
> > stuck in just reading the files.
> > Say for example the HTML file has the following format:
> > 
> > <strong>Category:</strong>Category1<br><br>
> > [...]
> > <strong>Name:</strong>Filename.exe<br><br>
> > [...]
> > <strong>Description:</strong>Description1.<br><br>
> 
> 
> Since your data is all in the same form, I think a regex will easily 
> find this data. Something like
> 
> import re
> catRe = re.compile(r'<strong>Category:</strong>(.*?)<br><br>')
> data = ...read the HTML file here
> m = catRe.search(data)
> category = m.group(1)
> 
> > I also thought regexes might be useful for this but I suck at using
> regexes
> > so that`s another problem.
> 
> Regexes take some effort to learn but it is worth it, they are a very 
> useful tool in many contexts, not just Python. Have you read the regex 
> HOW-TO?
> http://www.amk.ca/python/howto/regex/
> 
> Kent
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 

-- 
10 GB Mailbox, 100 FreeSMS/Monat http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++


More information about the Tutor mailing list