[Numpy-discussion] Possible roadmap addendum: building better text file readers
Wes McKinney
wesmckinn at gmail.com
Thu Feb 23 15:24:44 EST 2012
On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon <erin.sheldon at gmail.com> wrote:
> Wes -
>
> I designed the recfile package to fill this need. It might be a start.
>
> Some features:
>
> - the ability to efficiently read any subset of the data without
> loading the whole file.
> - reads directly into a recarray, so no overheads.
> - object oriented interface, mimicking recarray slicing.
> - also supports writing
>
> Currently it is fixed-width fields only. It is C++, but wouldn't be
> hard to convert it C if that is a requirement. Also, it works for
> binary or ascii.
>
> http://code.google.com/p/recfile/
>
> the trunk is pretty far past the most recent release.
>
> Erin Scott Sheldon
Can you relicense as BSD-compatible?
> Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
>> dear all,
>>
>> I haven't read all 180 e-mails, but I didn't see this on Travis's
>> initial list.
>>
>> All of the existing flat file reading solutions I have seen are
>> not suitable for many applications, and they compare very unfavorably
>> to tools present in other languages, like R. Here are some of the
>> main issues I see:
>>
>> - Memory usage: creating millions of Python objects when reading
>> a large file results in horrendously bad memory utilization,
>> which the Python interpreter is loathe to return to the
>> operating system. Any solution using the CSV module (like
>> pandas's parsers-- which are a lot faster than anything else I
>> know of in Python) suffers from this problem because the data
>> come out boxed in tuples of PyObjects. Try loading a 1,000,000
>> x 20 CSV file into a structured array using np.genfromtxt or
>> into a DataFrame using pandas.read_csv and you will immediately
>> see the problem. R, by contrast, uses very little memory.
>>
>> - Performance: post-processing of Python objects results in poor
>> performance. Also, for the actual parsing, anything regular
>> expression based (like the loadtable effort over the summer,
>> all apologies to those who worked on it), is doomed to
>> failure. I think having a tool with a high degree of
>> compatibility and intelligence for parsing unruly small files
>> does make sense though, but it's not appropriate for large,
>> well-behaved files.
>>
>> - Need to "factorize": as soon as there is an enum dtype in
>> NumPy, we will want to enable the file parsers for structured
>> arrays and DataFrame to be able to "factorize" / convert to
>> enum certain columns (for example, all string columns) during
>> the parsing process, and not afterward. This is very important
>> for enabling fast groupby on large datasets and reducing
>> unnecessary memory usage up front (imagine a column with a
>> million values, with only 10 unique values occurring). This
>> would be trivial to implement using a C hash table
>> implementation like khash.h
>>
>> To be clear: I'm going to do this eventually whether or not it
>> happens in NumPy because it's an existing problem for heavy
>> pandas users. I see no reason why the code can't emit structured
>> arrays, too, so we might as well have a common library component
>> that I can use in pandas and specialize to the DataFrame internal
>> structure.
>>
>> It seems clear to me that this work needs to be done at the
>> lowest level possible, probably all in C (or C++?) or maybe
>> Cython plus C utilities.
>>
>> If anyone wants to get involved in this particular problem right
>> now, let me know!
>>
>> best,
>> Wes
> --
> Erin Scott Sheldon
> Brookhaven National Laboratory
More information about the NumPy-Discussion
mailing list