Extracting data from HTML

Hazel lailian98 at hotmail.com
Sat Jun 1 06:01:48 EDT 2002


Geoff Gerrietts <geoff at gerrietts.net> wrote in message news:<mailman.1022908262.30070.python-list at python.org>...
> Quoting Ian Bicking (ianb at colorstudy.com):
> > On Fri, 2002-05-31 at 14:52, Hazel wrote:
> > > how do i write a program that
> > > will extract info from an HTML and print
> > > of a list of TV programmes, its Time, and Duration
> > > using urllib?
> > 
> > You can get the page with urllib.  You can use htmllib to parse it, but
> > I often find that regular expressions (the re module) are an easier way
> > -- since you aren't looking for specific markup, but specific
> > expressions.  You'll get lots of false negatives (and positives), but
> > when you are parsing a page that isn't meant to be parsed (like most web
> > pages) no technique is perfect.
> 
> Definitely agree with this sentiment.
> 
> I'll go a step farther, and do a little compare/contrast.
> 
> Once upon a time, I wanted to grab data from the
> weatherunderground.com website. I know there are lots of better ways
> to go about getting this information, these days, but I was not so
> well-informed back then.
> 
> So I wanted to grab this information, and I tried using regular
> expressions to mangle the page. But truthfully, it was just too hard
> to do. I could guess about where in the file the table with all the
> info would appear, but getting a regular expression that was inclusive
> enough to catch all the quirks, yet exclusionary enough to filter out
> all the other embedded tables, proved a very large challenge.
> 
> That's when the idea of a parser made a lot of sense.
> 
> I could push the whole page through a parser, looking for one
> particular phrase in a <TH> element, and from that point forward, map
> <TH> elements to <TD> elements effectively. It became a very simple
> exercise, because I knew how to find that info.
> 
> But as Ian rightly points out, htmllib and a real parser can be very
> heavy if you're just looking to grab unformatted info -- or if you
> can't rely on the formatting to be reliable.
> 
> Both techniques are worth knowing -- but better than either would be
> finding a way to get the information you're after via XML-RPC or some
> other protocol that's designed to carry data rather than rendering
> instructions.
> 
> Best of luck,
> --G.
Dear Geoff, Ian
I'm relying onsgmllib to do the work....
since htmllib requires heavy coding.
Here, an instance of what I want to extract....
the time of the TV programme >> 12:15:00AM

"<TR>
   <TD align=right bgColor=#000033><FONT color=#ffffff 
   face="verdana, arial, helvetica" size=1>12:15:00 
   AM</FONT></TD>"

So what do u think?

-Thanx
Hazel



More information about the Python-list mailing list