Using re to get data from text file

Andrew Dalke adalke at mindspring.com
Fri Sep 10 13:19:04 EDT 2004


Jocknerd wrote:
> How would I set this up to read in the data using Regular expressions?
> 
> I've tried this:
> 
> while True:
>     line = file.readline ()
>     if not line: break
>     game = {}
>     datePattern = re.compile('^(\d{2})\D+(\d{2})\D+(\d{4})')

Regular expressions are tricky.  Luckily, there are plenty
of resources available to learn.  Here's a suggestion for how
to read your data.

The subtle parts are:
   - I'm using re.X so I can document each of the fields in the re
   - The team name must only contain letters
        [a-zA-Z]+  means "set of letters" (that is, a word)
        [a-zA-Z]+(\s[a-zA-Z]+)*  means "one or more words separated
                      by spaces

I also use the ^  and $ symbols to make sure the match is
complete across the whole line.

If you have teams with digits in the name (eg, "49ers") then
you'll have to change the definition of 'word' appropriately.
I made it a strict test to ensure sure there wasn't an accidental
confusion with a score.


import re

pat = re.compile("""
^\s*                       # allow spaces at the start
(\d\d)/(\d\d)/(\d\d\d\d)   # the month, day, and year

\s+                        # spaces to the first team name
([a-zA-Z]+(\s+[a-zA-Z]+)*) # one or more words separated by spaces
\s+                        # spaces to the first score
(\d+)                      # the score

\s+                        # spaces to the second team name
([a-zA-Z]+(\s+[a-zA-Z]+)*) # one or more words separated by spaces
\s+                        # spaces to the second score
(\d+)                      # the score

\s*$                       # allow spaces, up to the end
""", re.X)
tests = [
"09/04/2004  Virginia              44   Temple             14",
"09/04/2004  LSU                   22   Oregon State       21",
"09/09/2004  Troy State            24   Missouri           14",
"01/02/2003  Florida State        103   University of Miami 2",
]

for test in tests:
   m = pat.match(test)
   if not m:
     raise AssertionError("test failure")
   print "Match results:"
   print " month", m.group(1), "day", m.group(2), "year", m.group(3)
   print " team #1", m.group(4), "score", m.group(6)
   print " team #2", m.group(7), "score", m.group(9)


Here's the output

Match results:
  month 09 day 04 year 2004
  team #1 Virginia score 44
  team #2 Temple score 14
Match results:
  month 09 day 04 year 2004
  team #1 LSU score 22
  team #2 Oregon State score 21
Match results:
  month 09 day 09 year 2004
  team #1 Troy State score 24
  team #2 Missouri score 14
Match results:
  month 01 day 02 year 2003
  team #1 Florida State score 103
  team #2 University of Miami score 2


				Andrew
				dalke at dalkescientific.com



More information about the Python-list mailing list