Using re to get data from text file

Fri Sep 10 10:53:32 EDT 2004

Jocknerd <jocknerd1 at yahoo.com> wrote:
> I'm a Python newbie and I'm having trouble with Regular Expressions when
> reading in a text file.  Here is a sample layout of the input file:
> 
> 09/04/2004  Virginia              44   Temple               14
> 09/04/2004  LSU                   22   Oregon State         21
> 09/09/2004  Troy State            24   Missouri             14
> 
> As you can see, the text file contains a list of games.  Each game has a
> date, a winning team, the winning team's score, the losing team, and the
> losing team's score.  If I set up my program to import the data with fixed
> length format's its no problem.  But some of my text files have different
> layouts.  For instance, some only have one space between a team name and
> their score.
> 
> Here's how I read in the file using fixed length fields:
> 
> filename = sys.argv[1]
> file = open (filename, 'r')
> 
> schedule = []     # make a list called schedule
> 
> while True:
>     line = file.readline()
>     if not line: break
>     game = {}     # make a dictionary called game
>     game['date']   = line[0:10]   # fixed length field
>     game['team1']  = string.strip (line[12:40])
>     game['score1'] = line[40:42]
>     game['team2']  = string.strip (line[44:72])
>     game['score2'] = line[72:74]
>     schedule.append(game)
> 
> file.close()
> 
> Note:  I'm stripping whitespace from the team names because I don't want
> the team name to actually be a fixed length.
> 
> How would I set this up to read in the data using Regular expressions?
> 
> I've tried this:
> 
> while True:
>     line = file.readline ()
>     if not line: break
>     game = {}
>     datePattern = re.compile('^(\d{2})\D+(\d{2})\D+(\d{4})')
> 
> Here's where I get stuck.  What do I do from here?  I just don't know how
> to import the text and assign it to the proper fields using the re module.


Your format is a bit complicated since team's name can be variable
words.  But, I'm assuming that they don't have any digit as part of
their name.  So, use '\d+' to separate the fields.  Eg.
    re.split ('\d+', line)
    re.split ('(\d+)', line)
    re.split ('(\d+)', line[10:])

-- 
William Park <opengeometry at yahoo.ca>
Open Geometry Consulting, Toronto, Canada