fast text processing
Alexis Gallagher
public at alexisgallagher.com
Tue Feb 21 05:18:15 EST 2006
Steve,
First, many thanks!
Steve Holden wrote:
> Alexis Gallagher wrote:
>>
>> filehandle = open("data",'r',buffering=1000)
>
> This buffer size seems, shall we say, unadventurous? It's likely to slow
> things down considerably, since the filesystem is probably going to
> naturally wnt to use a rather larger value. I'd suggest a 64k minumum.
Good to know. I should have dug into the docs deeper. Somehow I thought
it listed lines not bytes.
>> for currentLine in filehandle.readlines():
>>
> Note that this is going to read the whole file in to (virtual) memory
> before entering the loop. I somehow suspect you'd rather avoid this if
> you could. I further suspect your testing has been with smaller files
> than 80GB ;-). You might want to consider
>
Oops! Thanks again. I thought that readlines() was the generator form,
based on the docstring comments about the deprecation of xreadlines().
>> So on every iteration I'm processing mutable strings -- this seems
>> wrong. What's the best way to speed this up? Can I switch to some fast
>> byte-oriented immutable string library? Are there optimizing
>> compilers? Are there better ways to prep the file handle?
>>
> I'm sorry but I am not sure where the mutable strings come in. Python
> strings are immutable anyway. Well-known for it.
I misspoke. I think was mixing this up with the issue of object-creation
overhead for all of the string handling in general. Is this a bottleneck
to string processing in python, or is this a hangover from my Java days?
I would have thought that dumping the standard string processing
libraries in favor of byte manipulation would have been one of the
biggest wins.
> Of course you leave us in the dark about the nature of
> table.markEquivalent as well.
markEquivalent() implements union-join (aka, uptrees) to generate
equivalence classes. Optimising that was going to be my next task
I feel a bit silly for missing the double-processing of everything.
Thanks for pointing that out. And I will check out the biopython package.
I'm still curious if optimizing compilers are worth examining. For
instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm
guessing that both this tokenizing and the uptree implementations sound
like good candidates for one of those tools, once I shake out these
algorithmic problems.
alexis
More information about the Python-list
mailing list