fast text processing

Tue Feb 21 04:10:22 EST 2006

Alexis Gallagher wrote:
> (I tried to post this yesterday but I think my ISP ate it. Apologies if 
> this is a double-post.)
> 
> Is it possible to do very fast string processing in python? My 
> bioinformatics application needs to scan very large ASCII files (80GB+), 
> compare adjacent lines, and conditionally do some further processing. I 
> believe the disk i/o is the main bottleneck so for now that's what I'm 
> optimizing. What I have now is roughly as follows (on python 2.3.5).
> 
> filehandle = open("data",'r',buffering=1000)

This buffer size seems, shall we say, unadventurous? It's likely to slow 
things down considerably, since the filesystem is probably going to 
naturally wnt to use a rather larger value. I'd suggest a 64k minumum.
> 
> lastLine = filehandle.readline()
> 
I'd suggest

lastTokens = filehandle.readline().strip().split(delimiter)

here. You have no need of the line other than to split it into tokens.

> for currentLine in filehandle.readlines():
> 
Note that this is going to read the whole file in to (virtual) memory 
before entering the loop. I somehow suspect you'd rather avoid this if 
you could. I further suspect your testing has been with smaller files 
than 80GB ;-). You might want to consider

for currentLine in filehandle:

as an alternative. This uses the file's generator properties to produce 
the next line each time round the loop.

>      lastTokens = lastLine.strip().split(delimiter)

The line above goes away if you adopt the loop initialization suggestion 
above. Otherwise you are repeating the splitting of each line twice, 
once as the current line then again as the last line.

>      currentTokens = currentLine.strip().split(delimiter)
> 
>      lastGeno = extract(lastTokens[0])
>      currentGeno = extract(currentTokens[0])
> 
If the extract() operation is stateless (in other words if it always 
produces the same output for a given input) then again you are 
unecessarily repeating yourself here by running extract() on the same 
data as the current first token and the last first token (if you see 
what I mean).

I might also observe that you seem to expect only two tokens per line. 
If this is invariable the case then you might want to consider writing 
an unpacking assignment instead, such as

        cToken0, cToken1, newline = currentLine.strip().split(delimiter)

to save the indexing. Not a big deal, thugh, and it *will* break if you 
have more than one delimiter in a line, as the interpreter won;t then 
know what to do with the third and subsequent elements.

>      # prepare for next iteration
>      lastLine = currentLine
> 
Of course now you are going to try and strip the delimiter off the line 
and split it again when you loop around again. You should now just be 
able to say

     lastTokens = currentTokens

instead.

>      if lastGeno == currentGeno:
>         table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))
> 
> So on every iteration I'm processing mutable strings -- this seems 
> wrong. What's the best way to speed this up? Can I switch to some fast 
> byte-oriented immutable string library? Are there optimizing compilers? 
> Are there better ways to prep the file handle?
> 
I'm sorry but I am not sure where the mutable strings come in. Python 
strings are immutable anyway. Well-known for it. It might be a slight 
problem that you are creating a second terminator-less copy of each 
line, but that's probably inevitable.

Of course you leave us in the dark about the nature of 
table.markEquivalent as well. Depending on the efficiency of the 
extract() operation you might want to consider short-circuiting the loop 
if the two tokens have already been marked as equivalent. That might be 
a big win or not depending on relative efficiency.

> Perhaps this is a job for C, but I am of that soft generation which 
> fears memory management. I'd need to learn how to do buffered reading in 
> C, how to wrap the C in python, and how to let the C call back into 
> python to call markEquivalent(). It sounds painful. I _have_ done some 
> benchmark comparisons of only the underlying line-based file reading 
> against a Common Lisp version, but I doubt I'm using the optimal 
> construct in either language so I hesitate to trust my results, and 
> anyway the interlanguage bridge will be even more obscure in that case.
> 
Probably the biggest gain will be in simply not reading the whole file 
into memory by calling its .readlines() method.

Summarising. consider something more like:

filehandle = open("data",'r',buffering=64*1024)
# You could also try just leaving the buffering spec out

lastTokens = filehandle.readline().strip().split(delim)
lastGeno = extract(lastTokens[0])

for currentLine in filehandle:

     currentTokens = currentLine.strip().split(delim)
     currentGeno = extract(currentTokens[0])
     if lastGeno == currentGeno:
         table.markEquivalent(int(lastTokens[1]), int(currentTokens[1]))

     lastGeno = currentGeno
     lastTokens = currentTokens

> Much obliged for any help,
> Alexis

Hope this is.

one last thought: the bioPython package will potentially save you huge 
amounts of time. They guys who developed and maintain it have done a lot 
of genome-slinging, and they appear to know what they are doing.

They may have already solved several problems you are struggling with.

regards
  Steve
-- 
Steve Holden       +44 150 684 7255  +1 800 494 3119
Holden Web LLC                     www.holdenweb.com
PyCon TX 2006                  www.python.org/pycon/