[Tutor] (no subject)

Daniel Yoo dyoo@hkn.eecs.berkeley.edu
Thu, 29 Mar 2001 03:22:38 -0800 (PST)


On Wed, 28 Mar 2001, wong chow cheok wrote:

> thanks a lot daniel. i think i am doing it wrong. i still have no idea how 
> to extract the information. i looked over the tutorial but i can't find much 
> that can help. still new to this language. if you have any ideas i would 
> really appreciate it. i tried using:
> 
> params = urllib.urllencode
> 
> but it seems that the parameters do not have any significance. i am still 
> learning as i go. very sorry if i cannot give you an answer yet but i will 
> get back to you if i find one.

That's ok; take it step by step.

You might find the regular expression functions useful.  Regular
expressions are tools that let us search and pull out portions of strings
out of text.  For example, we can use a regular expression to pull titles
out of html fairly easily.

Let's say that a "title" begins with a '<title>' and ends with a
'</title>'.  If we're given a whole web page, we can use regular
expressions to encode this idea:

###
title_re = re.compile('<title>.*</title>')
###

This says to make a "regular expression" that knows that a title is made
up of a beginning title tag "<title>", a bunch of other characters ".*",
and an ending tag "</title>".  With this information encoded, we can use a
regular expression to search() any string for this pattern.


Here's a program that tries to grab the titles out of any web site we give
it:

###
import re
import urllib
import sys

def getTitle(url):
    title_re = re.compile('<title>(.*?)</title>', 
                          re.IGNORECASE | re.DOTALL)
    html = urllib.urlopen(url).read()
    result = title_re.search(html)
    if result:
        return result.group(1)

if __name__ == '__main__':
    print getTitle(sys.argv[1])
###

There's some extra stuff in here that's extraneous, but I wanted to show a
slightly realistic example that you'll be able to play around with.  It
sounds like your project will become easier if you use regular
expressions, so I'd recommend experimenting with them.

It has to be said, though, that regular expressions are weird at first.  
You'll want to take a look at a few introductions that talk about regular
expressions.  Here are a few references:

    http://python.org/doc/current/lib/module-re.html
    http://py-howto.sourceforge.net/regex/regex.html

It doesn't hurt to look at Perl documentation on regular expressions
either, since the idea (and the syntax!) is the same:

    http://language.perl.com/all_about/regexps.html

and the rest of:

    http://www.perl.com/reference/query.cgi?regexp

has lots of good stuff on regular expressions.

If you have more questions, feel free to ask us on tutor@python.org.  
Good luck to you.