[Numpy-discussion] Possible roadmap addendum: building better text file readers

Thu Feb 23 15:15:54 EST 2012

On Thu, Feb 23, 2012 at 3:08 PM, Travis Oliphant <travis at continuum.io> wrote:
> This is actually on my short-list as well --- it just didn't make it to the list.
>
> In fact, we have someone starting work on it this week.  It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list.
>
> Integration with np.loadtxt is a high-priority.  I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in NumPy.  I have no interest in making a new one if we can avoid it.   But, we do need to make it faster with less memory overhead for simple cases like Wes describes.
>
> -Travis

Yeah, what I envision is just an infrastructural "parsing engine" to
replace the pure Python guts of np.loadtxt, np.genfromtxt, and the csv
module + Cython guts of pandas.read_{csv, table, excel}. It needs to
be somewhat adaptable to some of the domain specific decisions of
structured arrays vs. DataFrames-- like I use Python objects for
strings, but one consequence of this is that I can "intern" strings
(only one PyObject per unique string value occurring) where structured
arrays cannot, so you get much better performance and memory usage
that way. That's soon to change, though, I gather, at which point I'll
almost definitely (!) move to pointer arrays instead of dtype=object
arrays.

- Wes

>
>
> On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
>
>> Hi,
>>
>> 23.02.2012 20:32, Wes McKinney kirjoitti:
>> [clip]
>>> To be clear: I'm going to do this eventually whether or not it
>>> happens in NumPy because it's an existing problem for heavy
>>> pandas users. I see no reason why the code can't emit structured
>>> arrays, too, so we might as well have a common library component
>>> that I can use in pandas and specialize to the DataFrame internal
>>> structure.
>>
>> If you do this, one useful aim could be to design the code such that it
>> can be used in loadtxt, at least as a fast path for common cases. I'd
>> really like to avoid increasing the number of APIs for text file loading.
>>
>> --
>> Pauli Virtanen
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion