regular expression for integer and decimal numbers

gary gary.wilson at gmail.com
Sun Sep 26 12:59:59 EDT 2004


bokr at oz.net (Bengt Richter) wrote in message news:<cj4tfm$seh$0$216.39.172.122 at theriver.com>...
> On 25 Sep 2004 13:13:22 -0700, gary.wilson at gmail.com (gary) wrote:
> 
> >Peter Hansen <peter at engcorp.com> wrote in message news:<pbadnZrDHOinY87cRVn-jg at powergate.ca>...
> >> gary wrote:
> >> > I want to pick all intergers and decimal numbers out of a string.
> >> > Would this be the most correct regular expression to use?
> >> > 
> >> > "\d+\.?\d*"
> >> 
> >> Examples, including the most extreme cases you want to handle,
> >> are always a good idea.
> >> 
> >> -Peter
> >
> >Here is an example of what I will be dealing with:
> >"""
> >TOTAL FIRST DOWNS                                     19        21
> >   By Rushing                                         11         6
> >   By Passing                                          6        10
> >   By Penalty                                          2         5
> >THIRD DOWN EFFICIENCY                           4-11-36%  6-14-43%
> >FOURTH DOWN EFFICIENCY                            0-1-0%    0-0-0%
> >TOTAL NET YARDS                                      379       271
> >   Total Offensive Plays (inc. times thrown passing)  58        63
> >   Average gain per offensive play                   6.5       4.3
> >NET YARDS RUSHING                                    264       115
> >"""

> Are you sure you want to throw away all the info implicit in the structure of that data?
> How about the columns? Will you get other input with more columns?

There are several other instances in the files that I am extracting
data from where the numbers are not so nicely arranged in columns, so
I am really looking for something that could be used in all instances.
(http://www.nfl.com/gamecenter/gamebook/NFL_20020929_TEN@OAK)

I do however still need to convert everything from string to numbers. 
I was thinking about using the following for that unless someone has a
better solution:

>>> def StrToNum(str):
...     try: return int(str)
...     except ValueError: 
...         try: return float(str)
...         except ValueError: return str

>>> statlist = ['10', '6', '2002', 'tampa bay buccaneers', 'atlanta
falcons', 'the georgia dome', '1', '03', 'pm', 'est', 'artificial',
'0', '3', '7', '10', '0', '20', '3', '0', '3', '0', '0', '6', '15',
'14', '5', '2', '9', '10', '1', '2', '4', '13', '31', '3', '14', '21',
'1', '1', '100', '0', '1', '0', '327', '243', '59', '64', '5.5',
'3.8', '74', '70', '26', '22', '2.8', '3.2', '2', '3', '2', '3',
'253', '173', '2', '8', '4', '14', '261', '187', '31', '17', '1',
'38', '17', '4', '7.7', '4.1', '5', '3', '0', '3', '2', '2', '5',
'43.2', '5', '45.6', '0', '0', '0', '0', '0', '0', '31.2', '41.6',
'50', '40', '0', '0', '3', '40', '0', '0', '5', '120', '4', '50', '1',
'0', '6', '35', '6', '41', '1', '1', '0', '0', '2', '0', '0', '0',
'1', '0', '1', '0', '2', '2', '0', '0', '2', '2', '0', '0', '2', '2',
'2', '3', '0', '2', '0', '0', '2', '0', '0', '1', '0', '0', '0', '0',
'0', '0', '20', '6', '29', '34', '30', '26', '3', '37', '9', '59',
'9', '35', '6', '23', 0, 0, '11', '23', '5', '01', '5', '25', '8',
'37', 0, 0, '26']
>>> [StrToNum(item) for item in statlist]
[10, 6, 2002, 'tampa bay buccaneers', 'atlanta falcons', 'the georgia
dome', 1, 3, 'pm', 'est', 'artificial', 0, 3, 7, 10, 0, 20, 3, 0, 3,
0, 0, 6, 15, 14, 5, 2, 9, 10, 1, 2, 4, 13, 31, 3, 14, 21, 1, 1, 100,
0, 1, 0, 327, 243, 59, 64, 5.5, 3.7999999999999998, 74, 70, 26, 22,
2.7999999999999998, 3.2000000000000002, 2, 3, 2, 3, 253, 173, 2, 8, 4,
14, 261, 187, 31, 17, 1, 38, 17, 4, 7.7000000000000002,
4.0999999999999996, 5, 3, 0, 3, 2, 2, 5, 43.200000000000003, 5,
45.600000000000001, 0, 0, 0, 0, 0, 0, 31.199999999999999,
41.600000000000001, 50, 40, 0, 0, 3, 40, 0, 0, 5, 120, 4, 50, 1, 0, 6,
35, 6, 41, 1, 1, 0, 0, 2, 0, 0, 0, 1, 0, 1, 0, 2, 2, 0, 0, 2, 2, 0, 0,
2, 2, 2, 3, 0, 2, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 20, 6, 29, 34,
30, 26, 3, 37, 9, 59, 9, 35, 6, 23, 0, 0, 11, 23, 5, 1, 5, 25, 8, 37,
0, 0, 26]

Another thing was that I found a negative number which kinds screws up
the regex's previously disscussed.  So I came up with a workaround
below:
>>> str = """
... FGs - PATs Had Blocked                         0-0    0-0
... Net Punting Average                           -6.3   33.3
... TOTAL RETURN YARDAGE (Not Including Kickoffs)   14    257
...    No. and Yards Punt Returns                 1-14  2-157
... """
>>> str = re.sub(r"(\d+)-",r"\1 ",str) #replace number followed by
dash with number followed by space
>>> teamstats = re.findall(r"-?\d+\.?\d*",str) #regex discussed before
but with an optional negative sign in front
>>> teamstats
['0', '0', '0', '0', '-6.3', '33.3', '14', '257', '1', '14', '2',
'157']
>>> [StrToNum(item) for item in teamstats]
[0, 0, 0, 0, -6.2999999999999998, 33.299999999999997, 14, 257, 1, 14,
2, 157]

Gary



More information about the Python-list mailing list