regular expression help
Jonathan Gardner
jgardn at alumni.washington.edu
Thu Feb 28 03:11:32 EST 2002
Drew Fisher wrote:
> Hello,
>
> I'm having some issues doing a small but irritating regular expression
> match.
>
> Here's an example of text that I want to search through:
>
> str = """
> Location:Home
> 555-1212
> 555-3434
> 555-5656
>
> Location:Work
> 555-9999
> """
>
> Just for reference, there is newline characters at the end of each line
> and the phone numbers have a single tab character (\t) in front of them.
>
> I'm trying to get the location of the record and the associated phone
> number(s) that go along with it.
>
> I've tried things like:
>
> import re
> test1 = re.compile ('Location:(\w+)\s+(\d+\-\d+)', re.DOTALL)
> print test1.findall (str)
> [('Home', '555-1212'), ('Work', '555-9999')]
>
> But this only gets the location and the first phone number in the
> record. How do I get all of the phone numbers for each location without
> writing a second regular expression?
>
APPROACH 1:
You'll need to do something fun like this:
re.compile('Location:(\w+)$\s((?:\s\d+\-\d+$\s)*)', re.MULTILINE)
Of course, this is a bit strict...
You'll get (with findall):
[('Home', '\t555-1212\n\t555-3434\n\t555-5656\n'),...]
Take the second element in the pair and use split('\n'). Then use strip().
APPROACH 2:
Write a basic parser.
loc_re = re.compile('Location:(\w+)')
pnum_re = re.compile('(\d+-\d+)')
locs = {}
last_loc = None
for i in str.split():
found = loc_re.findall(i)
if found:
locs[found[0]] = []
last_loc = locs[found[0]]
continue
found = pnum_re.findall(i)
if found and last_loc:
last_loc.append(found[0])
I think the parser approach is a bit easier to maintain, extend, and
customize.
Jonathan
More information about the Python-list
mailing list