[Numpy-discussion] Possible roadmap addendum: building better text file readers

Thu Feb 23 14:32:13 EST 2012

dear all,

I haven't read all 180 e-mails, but I didn't see this on Travis's
initial list.

All of the existing flat file reading solutions I have seen are
not suitable for many applications, and they compare very unfavorably
to tools present in other languages, like R. Here are some of the
main issues I see:

- Memory usage: creating millions of Python objects when reading
  a large file results in horrendously bad memory utilization,
  which the Python interpreter is loathe to return to the
  operating system. Any solution using the CSV module (like
  pandas's parsers-- which are a lot faster than anything else I
  know of in Python) suffers from this problem because the data
  come out boxed in tuples of PyObjects. Try loading a 1,000,000
  x 20 CSV file into a structured array using np.genfromtxt or
  into a DataFrame using pandas.read_csv and you will immediately
  see the problem. R, by contrast, uses very little memory.

- Performance: post-processing of Python objects results in poor
  performance. Also, for the actual parsing, anything regular
  expression based (like the loadtable effort over the summer,
  all apologies to those who worked on it), is doomed to
  failure. I think having a tool with a high degree of
  compatibility and intelligence for parsing unruly small files
  does make sense though, but it's not appropriate for large,
  well-behaved files.

- Need to "factorize": as soon as there is an enum dtype in
  NumPy, we will want to enable the file parsers for structured
  arrays and DataFrame to be able to "factorize" / convert to
  enum certain columns (for example, all string columns) during
  the parsing process, and not afterward. This is very important
  for enabling fast groupby on large datasets and reducing
  unnecessary memory usage up front (imagine a column with a
  million values, with only 10 unique values occurring). This
  would be trivial to implement using a C hash table
  implementation like khash.h

To be clear: I'm going to do this eventually whether or not it
happens in NumPy because it's an existing problem for heavy
pandas users. I see no reason why the code can't emit structured
arrays, too, so we might as well have a common library component
that I can use in pandas and specialize to the DataFrame internal
structure.

It seems clear to me that this work needs to be done at the
lowest level possible, probably all in C (or C++?) or maybe
Cython plus C utilities.

If anyone wants to get involved in this particular problem right
now, let me know!

best,
Wes