fast text processing

Tue Feb 21 12:42:24 EST 2006

Alexis Gallagher wrote:
> Steve,
> 
> First, many thanks!
> 
> Steve Holden wrote:
>> Alexis Gallagher wrote:
>>>
>>> filehandle = open("data",'r',buffering=1000)
>>
>> This buffer size seems, shall we say, unadventurous? It's likely to
>> slow things down considerably, since the filesystem is probably going
>> to naturally wnt to use a rather larger value. I'd suggest a 64k minumum.
> 
> Good to know. I should have dug into the docs deeper. Somehow I thought
> it listed lines not bytes.
> 
>>> for currentLine in filehandle.readlines():
>>>
>> Note that this is going to read the whole file in to (virtual) memory
>> before entering the loop. I somehow suspect you'd rather avoid this if
>> you could. I further suspect your testing has been with smaller files
>> than 80GB ;-). You might want to consider
>>
> 
> Oops! Thanks again. I thought that readlines() was the generator form,
> based on the docstring comments about the deprecation of xreadlines().
> 
>>> So on every iteration I'm processing mutable strings -- this seems
>>> wrong. What's the best way to speed this up? Can I switch to some
>>> fast byte-oriented immutable string library? Are there optimizing
>>> compilers? Are there better ways to prep the file handle?
>>>
>> I'm sorry but I am not sure where the mutable strings come in. Python
>> strings are immutable anyway. Well-known for it.
> 
> I misspoke. I think was mixing this up with the issue of object-creation
> overhead for all of the string handling in general. Is this a bottleneck
> to string processing in python, or is this a hangover from my Java days?
> I would have thought that dumping the standard string processing
> libraries in favor of byte manipulation would have been one of the
> biggest wins.
> 
>> Of course you leave us in the dark about the nature of
>> table.markEquivalent as well.
> 
> markEquivalent() implements union-join (aka, uptrees) to generate
> equivalence classes. Optimising that was going to be my next task
> 
> I feel a bit silly for missing the double-processing of everything.
> Thanks for pointing that out. And I will check out the biopython package.
> 
> I'm still curious if optimizing compilers are worth examining. For
> instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm
> guessing that both this tokenizing and the uptree implementations sound
> like good candidates for one of those tools, once I shake out these
> algorithmic problems.
> 
> 
> alexis

When your problem is I/O bound there is almost nothing that can be
done to speed it up without some sort of refactoring of the input
data itself.  Python reads bytes off a hard drive just as fast as
any compiled language.  A good test is to copy the file and measure
the time.  You can't make your program run any faster than a copy
of the file itself without making hardware changes (e.g. RAID
arrays, etc.).

You might also want to take a look at csv module.  Reading lines
and splitting on delimeters is almost always handled well by csv.

-Larry Bates