Using re to get data from text file
William Park
opengeometry at yahoo.ca
Fri Sep 10 10:53:32 EDT 2004
Jocknerd <jocknerd1 at yahoo.com> wrote:
> I'm a Python newbie and I'm having trouble with Regular Expressions when
> reading in a text file. Here is a sample layout of the input file:
>
> 09/04/2004 Virginia 44 Temple 14
> 09/04/2004 LSU 22 Oregon State 21
> 09/09/2004 Troy State 24 Missouri 14
>
> As you can see, the text file contains a list of games. Each game has a
> date, a winning team, the winning team's score, the losing team, and the
> losing team's score. If I set up my program to import the data with fixed
> length format's its no problem. But some of my text files have different
> layouts. For instance, some only have one space between a team name and
> their score.
>
> Here's how I read in the file using fixed length fields:
>
> filename = sys.argv[1]
> file = open (filename, 'r')
>
> schedule = [] # make a list called schedule
>
> while True:
> line = file.readline()
> if not line: break
> game = {} # make a dictionary called game
> game['date'] = line[0:10] # fixed length field
> game['team1'] = string.strip (line[12:40])
> game['score1'] = line[40:42]
> game['team2'] = string.strip (line[44:72])
> game['score2'] = line[72:74]
> schedule.append(game)
>
> file.close()
>
> Note: I'm stripping whitespace from the team names because I don't want
> the team name to actually be a fixed length.
>
> How would I set this up to read in the data using Regular expressions?
>
> I've tried this:
>
> while True:
> line = file.readline ()
> if not line: break
> game = {}
> datePattern = re.compile('^(\d{2})\D+(\d{2})\D+(\d{4})')
>
> Here's where I get stuck. What do I do from here? I just don't know how
> to import the text and assign it to the proper fields using the re module.
Your format is a bit complicated since team's name can be variable
words. But, I'm assuming that they don't have any digit as part of
their name. So, use '\d+' to separate the fields. Eg.
re.split ('\d+', line)
re.split ('(\d+)', line)
re.split ('(\d+)', line[10:])
--
William Park <opengeometry at yahoo.ca>
Open Geometry Consulting, Toronto, Canada
More information about the Python-list
mailing list