[Numpy-discussion] Possible roadmap addendum: building better text file readers

Thu Feb 23 15:23:11 EST 2012

Wes -

I designed the recfile package to fill this need.  It might be a start.  

Some features: 

    - the ability to efficiently read any subset of the data without
      loading the whole file.
    - reads directly into a recarray, so no overheads.
    - object oriented interface, mimicking recarray slicing.
    - also supports writing

Currently it is fixed-width fields only.  It is C++, but wouldn't be
hard to convert it C if that is a requirement.  Also, it works for
binary or ascii.

    http://code.google.com/p/recfile/

the trunk is pretty far past the most recent release.

Erin Scott Sheldon

Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
> dear all,
> 
> I haven't read all 180 e-mails, but I didn't see this on Travis's
> initial list.
> 
> All of the existing flat file reading solutions I have seen are
> not suitable for many applications, and they compare very unfavorably
> to tools present in other languages, like R. Here are some of the
> main issues I see:
> 
> - Memory usage: creating millions of Python objects when reading
>   a large file results in horrendously bad memory utilization,
>   which the Python interpreter is loathe to return to the
>   operating system. Any solution using the CSV module (like
>   pandas's parsers-- which are a lot faster than anything else I
>   know of in Python) suffers from this problem because the data
>   come out boxed in tuples of PyObjects. Try loading a 1,000,000
>   x 20 CSV file into a structured array using np.genfromtxt or
>   into a DataFrame using pandas.read_csv and you will immediately
>   see the problem. R, by contrast, uses very little memory.
> 
> - Performance: post-processing of Python objects results in poor
>   performance. Also, for the actual parsing, anything regular
>   expression based (like the loadtable effort over the summer,
>   all apologies to those who worked on it), is doomed to
>   failure. I think having a tool with a high degree of
>   compatibility and intelligence for parsing unruly small files
>   does make sense though, but it's not appropriate for large,
>   well-behaved files.
> 
> - Need to "factorize": as soon as there is an enum dtype in
>   NumPy, we will want to enable the file parsers for structured
>   arrays and DataFrame to be able to "factorize" / convert to
>   enum certain columns (for example, all string columns) during
>   the parsing process, and not afterward. This is very important
>   for enabling fast groupby on large datasets and reducing
>   unnecessary memory usage up front (imagine a column with a
>   million values, with only 10 unique values occurring). This
>   would be trivial to implement using a C hash table
>   implementation like khash.h
> 
> To be clear: I'm going to do this eventually whether or not it
> happens in NumPy because it's an existing problem for heavy
> pandas users. I see no reason why the code can't emit structured
> arrays, too, so we might as well have a common library component
> that I can use in pandas and specialize to the DataFrame internal
> structure.
> 
> It seems clear to me that this work needs to be done at the
> lowest level possible, probably all in C (or C++?) or maybe
> Cython plus C utilities.
> 
> If anyone wants to get involved in this particular problem right
> now, let me know!
> 
> best,
> Wes
-- 
Erin Scott Sheldon
Brookhaven National Laboratory