[Numpy-discussion] Possible roadmap addendum: building better text file readers

Thu Feb 23 15:24:44 EST 2012

On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon <erin.sheldon at gmail.com> wrote:
> Wes -
>
> I designed the recfile package to fill this need.  It might be a start.
>
> Some features:
>
>    - the ability to efficiently read any subset of the data without
>      loading the whole file.
>    - reads directly into a recarray, so no overheads.
>    - object oriented interface, mimicking recarray slicing.
>    - also supports writing
>
> Currently it is fixed-width fields only.  It is C++, but wouldn't be
> hard to convert it C if that is a requirement.  Also, it works for
> binary or ascii.
>
>    http://code.google.com/p/recfile/
>
> the trunk is pretty far past the most recent release.
>
> Erin Scott Sheldon

Can you relicense as BSD-compatible?

> Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
>> dear all,
>>
>> I haven't read all 180 e-mails, but I didn't see this on Travis's
>> initial list.
>>
>> All of the existing flat file reading solutions I have seen are
>> not suitable for many applications, and they compare very unfavorably
>> to tools present in other languages, like R. Here are some of the
>> main issues I see:
>>
>> - Memory usage: creating millions of Python objects when reading
>>   a large file results in horrendously bad memory utilization,
>>   which the Python interpreter is loathe to return to the
>>   operating system. Any solution using the CSV module (like
>>   pandas's parsers-- which are a lot faster than anything else I
>>   know of in Python) suffers from this problem because the data
>>   come out boxed in tuples of PyObjects. Try loading a 1,000,000
>>   x 20 CSV file into a structured array using np.genfromtxt or
>>   into a DataFrame using pandas.read_csv and you will immediately
>>   see the problem. R, by contrast, uses very little memory.
>>
>> - Performance: post-processing of Python objects results in poor
>>   performance. Also, for the actual parsing, anything regular
>>   expression based (like the loadtable effort over the summer,
>>   all apologies to those who worked on it), is doomed to
>>   failure. I think having a tool with a high degree of
>>   compatibility and intelligence for parsing unruly small files
>>   does make sense though, but it's not appropriate for large,
>>   well-behaved files.
>>
>> - Need to "factorize": as soon as there is an enum dtype in
>>   NumPy, we will want to enable the file parsers for structured
>>   arrays and DataFrame to be able to "factorize" / convert to
>>   enum certain columns (for example, all string columns) during
>>   the parsing process, and not afterward. This is very important
>>   for enabling fast groupby on large datasets and reducing
>>   unnecessary memory usage up front (imagine a column with a
>>   million values, with only 10 unique values occurring). This
>>   would be trivial to implement using a C hash table
>>   implementation like khash.h
>>
>> To be clear: I'm going to do this eventually whether or not it
>> happens in NumPy because it's an existing problem for heavy
>> pandas users. I see no reason why the code can't emit structured
>> arrays, too, so we might as well have a common library component
>> that I can use in pandas and specialize to the DataFrame internal
>> structure.
>>
>> It seems clear to me that this work needs to be done at the
>> lowest level possible, probably all in C (or C++?) or maybe
>> Cython plus C utilities.
>>
>> If anyone wants to get involved in this particular problem right
>> now, let me know!
>>
>> best,
>> Wes
> --
> Erin Scott Sheldon
> Brookhaven National Laboratory