[Numpy-discussion] genloadtxt : last call

Jarrod Millman millman at berkeley.edu
Tue Dec 9 04:34:29 EST 2008


On Fri, Dec 5, 2008 at 3:59 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
> All,
> Here's the latest version of genloadtxt, with some recent corrections. With
> just a couple of tweaking, we end up with some decent speed: it's still
> slower than np.loadtxt, but only 15% so according to the test at the end of
> the package.
>
> And so, now what ? Should I put the module in numpy.lib.io ? Elsewhere ?

Thanks for working on this.  I think that having simple, easy-to-use,
flexible, and fast IO code is extremely important; so I really
appreciate this work.

I have a few general comments about the IO code and where I would like
to see it going:

Where should IO code go?
------------------------------------

>From the user's perspective, I would like all the NumPy IO code to be
in the same place in NumPy; and all the SciPy IO code to be in the
same place in SciPy.  So, for instance, the user shouldn't get
`mloadtxt` from `numpy.ma.io`.  Another way of saying this is that in
IPython, I should be able to see all NumPy IO functions by
tab-completing once.

Slightly less important to me is that I would like to be able to do:
  from numpy import io as npio
  from scipy import io as spio

What is the difference between NumPy and SciPy IO?
------------------------------------------------------------------------

It was decided last year that numpy io should provide simple, generic,
core io functionality.  While scipy io would provide more domain- or
application-specific io code (e.g., Matlab IO, WAV IO, etc.)  My
vision for scipy io, which I know isn't shared, is to be more or less
aiming to be all inclusive (e.g., all image, sound, and data formats).
 (That is a different discussion; just wanted it to be clear where I
stand.)

For numpy io, it should include:
 - generic helper routines for data io (i.e., datasource, etc.)
 - a standard, supported binary format (i.e., npy/npz)
 - generic ascii file support (i.e, loadtxt, etc.)

What about AstroAsciiData?
-------------------------------------

I sent an email asking about AstroAsciiData last week.  The only
response I got was from Manuel Metz  saying that he was switching to
AstroAsciiData since it did exactly what he needed.  In my mind, I
would prefer that numpy io had the best ascii data handling.  So I
wonder if it would make sense to incorporate AstroAsciiData?

As far as I know, it is pure Python with a BSD license.  Maybe the
authors would be willing to help integrate the code and continue
maintaining it in numpy.  If others are supportive of this general
approach, I would be happy to approach them.  It is possible that we
won't want all their functionality, but it would be good to avoid
duplicating effort.

I realize that this may not be persuasive to everyone, but I really
feel that IO code is special and that it is an area where numpy/scipy
should devote some effort at consolidating the community on some
standard packages and approaches.

3. What about data source?

On a related note, I wanted to point out datasource.  Data source is a
file interface for handling local and remote data files:
http://projects.scipy.org/scipy/numpy/browser/trunk/numpy/lib/_datasource.py

It was originally developed by Jonathan Taylor and then modified by
Brian Hawthorne and Chris Burns.  It is fairly well-documented and
tested, so it would be easier to take a look at it than or me to
reexplain it here.  The basic idea is to have a drop-in replacement
for file handling, which would abstract away whether the file was
remote or local, compressed or not, etc.  The hope was that it would
allow us to simplify support for remote file access and handling
compressed files by merely using a datasource instead of a filename:
  def loadtxt(fname ....
vs.
  def loadtxt(datasource ....

I would appreciate hearing whether this seems doable or useful.
Should we remove datasource?  Start using it more?  Does it need to be
slightly or dramatically improved/overhauled?  Renamed `datafile` or
paired with a `datadestination`?  Support
versioning/checksumming/provenance tracking (a tad ambitious;))?  Is
anyone interested in picking up where we left off and improving it?

Thoughts? Suggestions?

Documentation
---------------------

The main reason that I am so interested in the IO code is that it
seems like it is one of the first areas that users will look.  ("I
have heard about this Python for scientific programming thing and I
wonder what all the fuss is about?  Let me try NumPy; this seems
pretty good.  Now let's see how to load in some of my data....")

I just took a quick look through the documentation and I couldn't find
any in the User Guide and this is the main IO page in the reference
manual:
  http://docs.scipy.org/doc/numpy/reference/routines.io.html

I would like to see a section on data IO in the user guide and have a
more prominent mention of IO code in the reference manual (i.e.,
http://docs.scipy.org/doc/numpy/reference/io.html ?).

Unfortunately, I don't have time to help out; but since it looks like
there has been some recent activity in this area I thought I'd mention
it.

As always--thanks to everyone who is actually putting in hard work!
Sorry I am not offering to actually help out here, but I hope that
someone will be interested and able to pursue some of these issues.

Thanks again,
Jarrod


On Thu, Dec 4, 2008 at 3:41 PM, Jarrod Millman <millman at berkeley.edu> wrote:
> I am not familiar with this, but it looks quite useful:
> http://www.stecf.org/software/PYTHONtools/astroasciidata/
> or (http://www.scipy.org/AstroAsciiData)
>
> "Within the AstroAsciiData project we envision a module which can be
> used to work on all kinds of ASCII tables. The module provides a
> convenient tool such that the user easily can:
>
>    * read in ASCII tables;
>    * manipulate table elements;
>    * save the modified ASCII table;
>    * read and write meta data such as column names and units;
>    * combine several tables;
>    * delete/add rows and columns;
>    * manage metadata in the table headers."
>
> Is anyone familiar with this package?  Would make sense to investigate
> including this or adopting some of its interface/features?



More information about the NumPy-Discussion mailing list