regular expression help

Jonathan Gardner jgardn at alumni.washington.edu
Thu Feb 28 03:11:32 EST 2002


Drew Fisher wrote:

> Hello,
> 
> I'm having some issues doing a small but irritating regular expression
> match.
> 
> Here's an example of text that I want to search through:
> 
> str = """
> Location:Home
> 555-1212
> 555-3434
> 555-5656
> 
> Location:Work
> 555-9999
> """
> 
> Just for reference, there is newline characters at the end of each line
> and the phone numbers have a single tab character (\t) in front of them.
> 
> I'm trying to get the location of the record and the associated phone
> number(s) that go along with it.
> 
> I've tried things like:
> 
> import re
> test1 = re.compile ('Location:(\w+)\s+(\d+\-\d+)', re.DOTALL)
> print test1.findall (str)
> [('Home', '555-1212'), ('Work', '555-9999')]
> 
> But this only gets the location and the first phone number in the
> record.  How do I get all of the phone numbers for each location without
> writing a second regular expression?
> 

APPROACH 1:
You'll need to do something fun like this:
re.compile('Location:(\w+)$\s((?:\s\d+\-\d+$\s)*)', re.MULTILINE)

Of course, this is a bit strict...

You'll get (with findall):
[('Home', '\t555-1212\n\t555-3434\n\t555-5656\n'),...]

Take the second element in the pair and use split('\n'). Then use strip(). 

APPROACH 2:
Write a basic parser.
loc_re = re.compile('Location:(\w+)')
pnum_re = re.compile('(\d+-\d+)')

locs = {}
last_loc = None
for i in str.split():
    found = loc_re.findall(i)
    if found:
        locs[found[0]] = []
        last_loc = locs[found[0]]
        continue
    found = pnum_re.findall(i)
    if found and last_loc:
        last_loc.append(found[0])

I think the parser approach is a bit easier to maintain, extend, and 
customize.

Jonathan



More information about the Python-list mailing list