speeding up reading files (possibly with cython)

Sat Mar 7 19:35:26 EST 2009

On Mar 8, 9:06 am, per <perfr... at gmail.com> wrote:
> hi all,
>
> i have a program that essentially loops through a textfile file thats
> about 800 MB in size containing tab separated data... my program
> parses this file and stores its fields in a dictionary of lists.
>
> for line in file:
>   split_values = line.strip().split('\t')

line.strip() is NOT a very good idea because it strips all whitespace
including tabs.

line.rstrip('\n') is sufficient.

BUT as Skip has pointed out, you should be using the csv module
anyway.

An 800Mb file is unlikely to have been written by Excel. Excel has
this stupid idea of wrapping quotes around fields that contain commas
(and quotes) even when the field delimiter is NOT a comma.

Experiment: open Excel, enter the following 4 strings in cells A1:D1
normal
embedded,comma
"Hello"embedded-quote
normality returns
then save as Text (Tab-delimited).

Here's what you get:
| >>> open('Excel_tab_delimited.txt', 'rb').read()
| 'normal\t"embedded,comma"\t"embedded""Hello""quote"\tnormality
returns\r\n'
| >>>

>   # do stuff with split_values
>
> currently, this is very slow in python, even if all i do is break up
> each line using split() and store its values in a dictionary, indexing
> by one of the tab separated values in the file.
>
> is this just an overhead of python that's inevitable? do you guys
> think that switching to cython might speed this up, perhaps by
> optimizing the main for loop?  or is this not a viable option?

You are unlikely to get much speed-up .. I'd expect that the loop
overhead would be a tiny part of the execution time.