Newbie Question: Regular Expressions

gbreed at cix.compulink.co.uk gbreed at cix.compulink.co.uk
Thu Jul 12 12:42:09 EDT 2001


In article <mailman.994953021.32263.python-list at python.org>, 
fett at tradersdata.com () wrote:

> I have a really dumb program that i would like to make smarter.  I need
> to take a file on my hard drive and filter out everything except for the
> standings which are written in it.  I have tried to use regular
> expressions with no success, but i still think that they are probably
> the best way.  I created the following simple fix, but it is unreliable
> if the data changed posistions.
> 
> 
> input = open('rawdata', 'r')
> S = input.read()
> print S[4021:6095]
> 
> Output :
>    League Standings
>    American League
>      EAST W L PCT GB HOME ROAD EAST CENT WEST NL L10 STRK
>      Red Sox 43 29 .597 - 23-15 20-14 23-13 8-7 6-6 6-3 6-4 L2
>      Yankees 41 31 .569 2.0 21-15 20-16 19-11 12-9 5-7 5-4 6-3 W2
>      Blue Jays 35 38 .479 8.5 18-20 17-18 14-13 6-7 11-13 4-5 5-5 W3
>      Orioles 34 39 .466 9.5 20-20 14-19 15-17 9-12 6-5 4-5 5-5 L1
> ........( it continues with all the standings)

Even without regular expressions, I think input.readlines()[4:] or the 
like would work, and be simpler than what you do now.

re.findall('((?:[A-Z]\w+ ){1,2}[-0-9. ]+\w\d)', S) does the trick on this 
data.


(?:[A-Z]\w+ )

matches a capital letter followed by alphanumerics followed by a space, 
and doesn't group on it.  Perhaps should be (?:[A-Z][a-z]+ )

{1,2}

matches 1 or 2 words, this would fail on a team with a three word name

[-0-9. ]+

matches more than one numeral, =, . or space.  That covers the stuff in 
the middle.  You may like to make it more specific.

\w\d

then an alphanumeric followed by a digit to end.  If the first character 
is always a capital letter, it could be [A-Z]\d and if it's always W or L,
[WL]\d


)

and return the whole match as a group.

> Also could you tell me if its possible to download the data from the
> web-page in python so that it doesnt even have to deal with opening the
> file.

Sure is!  Check out the urllib module.


                       Graham



More information about the Python-list mailing list