[Tutor] Regular Expresions

Karl Pflästerer sigurd at 12move.de
Tue Mar 2 11:41:44 EST 2004


On  2 Mar 2004, Karl Pflästerer <- sigurd at 12move.de wrote:

[...]
> So you could write your regexp as:

> reg = re.compile(r'href\s*=\s*(?P<url>"[^"]*"|\S*(?:\s|$))')

I made a mistake here.

reg = re.compile(r'href\s*=\s*"?(?P<url>[^"\s]*)', re.I)

should be better.

Reading the whole file with file(filename).readlines(), join() the
result and test the result with findall() is probably the fastest
solution (unless you have realy big files or scarce memory).  Like that
you might find string as href which weren't written as one; e.g. if
someone wrote: <p> An href is written like href="foo" </p> you would
find that also.  For that a parser is better.  The disadvantage of a
parser might be that it doesn't play well with invalid documents.

As I found that interesting I played a bit and wrote a solution which is
a bit safer than a plain regexp albeit it is no real parser; so it might
also find hrefs which are no real ones (but very unlikely since that
could only happen in <script> or <style> sections (or perhaps in text
after the </html> tag; maybe in comments (I'm not sure at the moment))


def href_vals (itr):
    actions = {
        'start' : re.compile(r'.*?<\s*a(?P<href1>)?'),
        'href1' : re.compile(r'[^>]*?href\s*(?P<href2>)'),
        'href2' : re.compile(r'=\s*"?(?P<href3>)'),
        'href3' : re.compile(r'(?P<start>[^\s"]*)')
        }
    state = 'start'
    for line in itr:
        pos = 0
        while 1:
            m = actions[state].match(line, pos)
            if not m: break
            pos = m.end()
            state = m.groupdict().keys()[0]
            if state == 'start':
                yield m.groupdict()[state]


This is a little state machine which searches first for an opening `<a'
tag, then for the href= in it and then yields the text after the href=
as value.

If that is new to you (since you wrote you first start with Python);
Python has something which is called generators.  They are used at
places where you iterate over objects like:

for foo in bar: ...

Perhaps you know it for files.  But you can also write your own
generator functions; you recognize them by the keyword `yield'.

What happens in the above function?

It iterates over an object and tries some regexps in turn on the lines
of the object (let's think of it as a file).

The regexps are values in a dictionray; the keys are the states.

It starts with the regexp which is the value of the state 'start'.  If
no match is found a new line is taken from the file and again that
regexp gets tried.  It always starts looking at position 0.

But what happens if a match succeds?

The new state is the name of the group in the regexp.  So you can see
that after 'start' 'href' will be tried.  The new start position will be
the end of the previous match.  So we don't scan pieces twice.

If after state 'href' we had a match, the new state will gain be 'start'
but now we will have a href.  That is the point were

            if state == 'start':
                yield m.groupdict()[state]

will be true and the generator yields a value.  Then it tries again to
find a match untill the file is completely scanned.

To use it you could write:

f = open('test.html')
for url in href_values(f):
    print url
f.close()


You have to decide yourself if you really need somethinfg like that or
if a simple regexp is enough; anyway it was fun to test it.



   Karl
-- 
Please do *not* send copies of replies to me.
I read the list




More information about the Tutor mailing list