Using re to get data from text file
Andrew Dalke
adalke at mindspring.com
Fri Sep 10 13:19:04 EDT 2004
Jocknerd wrote:
> How would I set this up to read in the data using Regular expressions?
>
> I've tried this:
>
> while True:
> line = file.readline ()
> if not line: break
> game = {}
> datePattern = re.compile('^(\d{2})\D+(\d{2})\D+(\d{4})')
Regular expressions are tricky. Luckily, there are plenty
of resources available to learn. Here's a suggestion for how
to read your data.
The subtle parts are:
- I'm using re.X so I can document each of the fields in the re
- The team name must only contain letters
[a-zA-Z]+ means "set of letters" (that is, a word)
[a-zA-Z]+(\s[a-zA-Z]+)* means "one or more words separated
by spaces
I also use the ^ and $ symbols to make sure the match is
complete across the whole line.
If you have teams with digits in the name (eg, "49ers") then
you'll have to change the definition of 'word' appropriately.
I made it a strict test to ensure sure there wasn't an accidental
confusion with a score.
import re
pat = re.compile("""
^\s* # allow spaces at the start
(\d\d)/(\d\d)/(\d\d\d\d) # the month, day, and year
\s+ # spaces to the first team name
([a-zA-Z]+(\s+[a-zA-Z]+)*) # one or more words separated by spaces
\s+ # spaces to the first score
(\d+) # the score
\s+ # spaces to the second team name
([a-zA-Z]+(\s+[a-zA-Z]+)*) # one or more words separated by spaces
\s+ # spaces to the second score
(\d+) # the score
\s*$ # allow spaces, up to the end
""", re.X)
tests = [
"09/04/2004 Virginia 44 Temple 14",
"09/04/2004 LSU 22 Oregon State 21",
"09/09/2004 Troy State 24 Missouri 14",
"01/02/2003 Florida State 103 University of Miami 2",
]
for test in tests:
m = pat.match(test)
if not m:
raise AssertionError("test failure")
print "Match results:"
print " month", m.group(1), "day", m.group(2), "year", m.group(3)
print " team #1", m.group(4), "score", m.group(6)
print " team #2", m.group(7), "score", m.group(9)
Here's the output
Match results:
month 09 day 04 year 2004
team #1 Virginia score 44
team #2 Temple score 14
Match results:
month 09 day 04 year 2004
team #1 LSU score 22
team #2 Oregon State score 21
Match results:
month 09 day 09 year 2004
team #1 Troy State score 24
team #2 Missouri score 14
Match results:
month 01 day 02 year 2003
team #1 Florida State score 103
team #2 University of Miami score 2
Andrew
dalke at dalkescientific.com
More information about the Python-list
mailing list