[Numpy-discussion] Fast Reading of ASCII files

Chris Barker chris.barker at noaa.gov
Tue Dec 13 13:08:44 EST 2011


NOTE:

Let's keep this on the list.

On Tue, Dec 13, 2011 at 9:19 AM, denis <denis-bz-gg at t-online.de> wrote:

> Chris,
>  unified, consistent save / load is a nice goal
>
> 1) header lines with date, pwd etc.: "where'd this come from ?"
>
>    # (5, 5)  svm.py  bz/py/ml/svm  2011-12-13 Dec 11:56  -- automatic
>    # 80.6 % correct -- user info
>      245    39     4     5    26
>    ...
>
I'm not sure I understand what you are expecting here: What would be
automatic? if itparses a datetime on the header, what would it do with it?
But anyway, this seems to me:
  - very application specific -- this is for the users code to write
  - not what we are talking about at this point anyway -- I think this
discussion is about a lower-level, does-the-simple-things-fast reader --
that may or may not be able to form the basis of a higher-level fuller
featured reader.


> 2) read any CSVs: comma or blank-delimited, with/without column names,
>    a la loadcsv() below
>

yup -- though the column name reading would be part of a higher-level
reader as far as I'm concerned.


> 3) sparse or masked arrays ?
>
> sparse probably not, that seem pretty domain dependent to me -- though
hopefully one could build such a thing on top of the lower level reader.
 Masked support would be good -- once we're convinced what the future of
masked arrays are in numpy. I was thinking that the masked array issue
would really be a higher-level feature -- it certainly could be if you need
to mask "special value" stype files (i.e. 9999), but we may have to build
it into the lower level reader for cases where the mask is specified by
non-numerical values -- i.e. there are some met files that use "MM" or some
other text, so you can't put it into a numerical array first.

>
> Longterm wishes: beyond the scope of one file <-> one array
> but essential for larger projects:
> 1) dicts / dotdicts:
>    Dotdict( A=anysizearray, N=scalar ... ) <-> a directory of little
> files
>    is easy, better than np.savez
>    (Haven't used hdf5, I believe Matlabv7  does.)
>
> 2) workflows: has anyone there used visTrails ?
>

outside of the spec of this thread...

>
> Anyway it seems to me (old grey cynic) that Numpy/scipy developers
> prefer to code first, spec and doc later. Too pessimistic ?
>
>
Well, I think many of us believe in a more agile style approach --
incremental development. But really, as an open source project, it's really
about scratching an itch -- so there is usually a spec in mind for the itch
at hand. In this case, however, that has been a weakness -- clearly a
number of us hav written small solutions to our particular problem at hand,
but no we haven't arrived at a more general purpose solution yet. So a bit
of spec-ing ahead of time may be called for.

On that:

I"ve been thinking from teh botom-up -- imaging what I need for the simple
case, and how it might apply to more complex cases -- but maybe we should
think about this another way:

What we're talking about here is really about core software engineering --
optimization. It's easy to write a pure-python simple file parser, and
reasonable to write a complex one (genfromtxt) -- the issue is performance
-- we need some more C (or Cython) code to really speed it up, but none of
us wants to write the complex case code in C. So:

genfromtxt is really nice for many of the complex cases. So perhaps
another approach is to look at genfromtxt, and see what
high performance lower-level functionality we could develop that could make
it fast -- then we are done.

This actually mirrors exactly what we all usually recommend for python
development in general -- write it in Python, then, if it's really not fast
enough, write the bottle-neck in C.

So where are the bottle necks in genfromtxt? Are there self-contained
portions that could be re-written in C/Cython?

-Chris






-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20111213/2b6d09f4/attachment.html>


More information about the NumPy-Discussion mailing list