[Numpy-discussion] Possible roadmap addendum: building better text file readers
Erin Sheldon
erin.sheldon at gmail.com
Thu Feb 23 15:23:11 EST 2012
Wes -
I designed the recfile package to fill this need. It might be a start.
Some features:
- the ability to efficiently read any subset of the data without
loading the whole file.
- reads directly into a recarray, so no overheads.
- object oriented interface, mimicking recarray slicing.
- also supports writing
Currently it is fixed-width fields only. It is C++, but wouldn't be
hard to convert it C if that is a requirement. Also, it works for
binary or ascii.
http://code.google.com/p/recfile/
the trunk is pretty far past the most recent release.
Erin Scott Sheldon
Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
> dear all,
>
> I haven't read all 180 e-mails, but I didn't see this on Travis's
> initial list.
>
> All of the existing flat file reading solutions I have seen are
> not suitable for many applications, and they compare very unfavorably
> to tools present in other languages, like R. Here are some of the
> main issues I see:
>
> - Memory usage: creating millions of Python objects when reading
> a large file results in horrendously bad memory utilization,
> which the Python interpreter is loathe to return to the
> operating system. Any solution using the CSV module (like
> pandas's parsers-- which are a lot faster than anything else I
> know of in Python) suffers from this problem because the data
> come out boxed in tuples of PyObjects. Try loading a 1,000,000
> x 20 CSV file into a structured array using np.genfromtxt or
> into a DataFrame using pandas.read_csv and you will immediately
> see the problem. R, by contrast, uses very little memory.
>
> - Performance: post-processing of Python objects results in poor
> performance. Also, for the actual parsing, anything regular
> expression based (like the loadtable effort over the summer,
> all apologies to those who worked on it), is doomed to
> failure. I think having a tool with a high degree of
> compatibility and intelligence for parsing unruly small files
> does make sense though, but it's not appropriate for large,
> well-behaved files.
>
> - Need to "factorize": as soon as there is an enum dtype in
> NumPy, we will want to enable the file parsers for structured
> arrays and DataFrame to be able to "factorize" / convert to
> enum certain columns (for example, all string columns) during
> the parsing process, and not afterward. This is very important
> for enabling fast groupby on large datasets and reducing
> unnecessary memory usage up front (imagine a column with a
> million values, with only 10 unique values occurring). This
> would be trivial to implement using a C hash table
> implementation like khash.h
>
> To be clear: I'm going to do this eventually whether or not it
> happens in NumPy because it's an existing problem for heavy
> pandas users. I see no reason why the code can't emit structured
> arrays, too, so we might as well have a common library component
> that I can use in pandas and specialize to the DataFrame internal
> structure.
>
> It seems clear to me that this work needs to be done at the
> lowest level possible, probably all in C (or C++?) or maybe
> Cython plus C utilities.
>
> If anyone wants to get involved in this particular problem right
> now, let me know!
>
> best,
> Wes
--
Erin Scott Sheldon
Brookhaven National Laboratory
More information about the NumPy-Discussion
mailing list