Reading a file, sans whitespace

Mon May 24 12:31:16 EDT 2004

> Michael Geary wrote:
> > For example, these do exactly the same thing:
> >
> > import re
> > for line in file( 'inputFile' ).readlines():
> >     print re.split( '\s+', line.strip() )
> >
> > import re
> > reWhitespace = re.compile( '\s+' )
> > for line in file( 'inputFile' ).readlines():
> >     print reWhitespace.split( line.strip() )
> >
> > But for a large file, the second version will be faster because
> > the regular expression is compiled only once instead of every
> > time through the loop.

Terry Reedy wrote:
> I am curious whether you have actually timed this or seen others
> timings. My impression (from other posts and from reading the
> code a year ago) is that the current re implementation caches
> compiled re's (recache[hash(restring)] = re.compile(restring))
> just so that the first example will *not* recompile every time thru
> the loop.  If so, I think one should name an re for pretty much the
> same reasons as for anything else: conceptual chunking and reuse
> in multiple places.

Oh man, is my face red! No, I didn't know about the caching, and I hadn't
timed this. One should never make assumptions about performance issues! :-)

Also, as Konstantin pointed out, file( 'inputFile' ).readlines() should be
just file( 'inputFile' ), and I just noticed that I didn't use raw strings
for the regular expressions. '\s+' happens to work, but it would be better
to be in the habit of writing r'\s+' instead. This was not my day for
posting good code samples!

Now that you've shamed me into actually testing the performance, it turns
out that precompiling the regular expression does make a difference.
Consider these examples:

import re, time
input = []
for i in xrange( 1000000 ):
    input.append( '%d abc   def   ghi  jkl mno    pqr    stu' % i )
start = time.time()
for line in input:
    result = re.split( r'\s+', line )
print time.time() - start

import re, time
input = []
for i in xrange( 1000000 ):
    input.append( '%d abc   def   ghi  jkl mno    pqr    stu' % i )
start = time.time()
reWhitespace = re.compile( r'\s+' )
for line in input:
    result = reWhitespace.split( line )
print time.time() - start

On my PIII-1.2GHz system, the first version runs in 27 seconds, and the
second version runs in 18 seconds, quite an improvement. I would guess that
the hash lookup for the cached regular expression is what's taking the extra
time in the first version, but I don't want to assume that's what it is. :-)

-Mike