[Numpy-discussion] Possible roadmap addendum: building better text file readers
Wes McKinney
wesmckinn at gmail.com
Thu Feb 23 14:32:13 EST 2012
dear all,
I haven't read all 180 e-mails, but I didn't see this on Travis's
initial list.
All of the existing flat file reading solutions I have seen are
not suitable for many applications, and they compare very unfavorably
to tools present in other languages, like R. Here are some of the
main issues I see:
- Memory usage: creating millions of Python objects when reading
a large file results in horrendously bad memory utilization,
which the Python interpreter is loathe to return to the
operating system. Any solution using the CSV module (like
pandas's parsers-- which are a lot faster than anything else I
know of in Python) suffers from this problem because the data
come out boxed in tuples of PyObjects. Try loading a 1,000,000
x 20 CSV file into a structured array using np.genfromtxt or
into a DataFrame using pandas.read_csv and you will immediately
see the problem. R, by contrast, uses very little memory.
- Performance: post-processing of Python objects results in poor
performance. Also, for the actual parsing, anything regular
expression based (like the loadtable effort over the summer,
all apologies to those who worked on it), is doomed to
failure. I think having a tool with a high degree of
compatibility and intelligence for parsing unruly small files
does make sense though, but it's not appropriate for large,
well-behaved files.
- Need to "factorize": as soon as there is an enum dtype in
NumPy, we will want to enable the file parsers for structured
arrays and DataFrame to be able to "factorize" / convert to
enum certain columns (for example, all string columns) during
the parsing process, and not afterward. This is very important
for enabling fast groupby on large datasets and reducing
unnecessary memory usage up front (imagine a column with a
million values, with only 10 unique values occurring). This
would be trivial to implement using a C hash table
implementation like khash.h
To be clear: I'm going to do this eventually whether or not it
happens in NumPy because it's an existing problem for heavy
pandas users. I see no reason why the code can't emit structured
arrays, too, so we might as well have a common library component
that I can use in pandas and specialize to the DataFrame internal
structure.
It seems clear to me that this work needs to be done at the
lowest level possible, probably all in C (or C++?) or maybe
Cython plus C utilities.
If anyone wants to get involved in this particular problem right
now, let me know!
best,
Wes
More information about the NumPy-Discussion
mailing list