[Numpy-discussion] fromfile() improvements (was: planning for numpy 1.3.0 release)
Christopher Barker
Chris.Barker at noaa.gov
Wed Sep 10 01:44:38 EDT 2008
Stéfan van der Walt wrote:
> 2008/9/9 Christopher Barker <Chris.Barker at noaa.gov>:
>> Anyone want to help with improvements to fromfile() for text files?
>
> This is low hanging fruit for anyone with some experience in C. We
> can definitely get it done for 1.3. Chris, would you file a ticket
> and add the detail from your mailing list posts, if that hasn't
> already been done?
Done:
http://scipy.org/scipy/numpy/ticket/909
( By the way, is there a way to fix the typo in the ticket title? --oops!)
There are a few fromfile() related tickets that I referenced as well.
It's not totally straightforward what should be done, so I've included
the text of the ticket here to start a discussion:
Proposed Enhancements and bug fixes for fromfile() and fromstring() text
handling:
Motivation:
The goal of the fromfile() text file handling capability is to enable
users to write code that can read a lot of numbers from a text file into
an array. Python provides a lot of nifty text processing capabilities,
and there are a number of higher level facilities for reading blocks of
data (including numpy.loadtxt). These are very capable, but there really
is a significant performance hit, at least when loading 10s of thousands
of numbers into a file.
We don't want to write all of loadtxt() and friends in C. Rather, the
goal is to allow the simple cases to be done very efficiently, and
hopefully fancier text reading packages can build on it to add more
features.
Unfortunately, the current (numpy version 1.2) version has a few bugs
and limitations that keep of from being nearly as useful as it could be.
Possible features:
* Create fromtextfile() and fromtextstring functions, distinct from
fromfile() and fromstring(). It really is a different functionality.
fromfile() could still call fromtextfile() for backward compatibility.
* Allow more than one separator? for example, a comma or
whitespace? In the general case, the user could perhaps specify any
number of separators, though I doubt that would be useful in practice.
At the very least, however, fromtextfile() should support reading files
that look like:
43.5, 345.6, 123.456, 234.33
34.5, 22.57, 2345, 2345, 252
...
That is, comma separated, but being able to read multiple lines in one shot.
The easiest way to support that would probably be to always allow
whitespace as a separator, and add the one passed in. I can't think of a
reason not to do this, but maybe I'm not very imaginative.
* Allow the user to specify a shape for the output array. There may
be little point, as all this does is save a calls to reshape(), but it
may be another way to support the above. i.e. you could read that data
with:
a = np.fromtextfile(infile, dtype=np.float, sep=',', shape=(-1, 4))
Then it would know to skip the newlines every 4 elements.
* Allow the user to specify a comment string. The reader would then
skip everything in the file between the comment string and a newline.
Maybe Universal newline -- any of \r, \n or \r\n. Or simply expect that
the user has opened the file with mode 'U' if they want that. This could
also be extended to support C-style comments with an opening and closing
character sequence, but that's a lot less common.
* Allow the user to specify a Locale. It may be best to be able to
specify a locale, rather than relying on the system on (whether '.' or
',' is the decimal separator, for instance. (ticket #884)
* parsing of "Inf" and the like that doesn't depend on system
(ticket #510). This would be nice, but maybe too difficult -- would we
need to write our own scanf?
Bugs to be fixed: ¶
* fromfile() and fromstring handling malformed data poorly: ticket
#883
* Any others?
NOTE: my C is pretty lame, or I'd do some of this. I could help out with
writing tests, etc. though.
Thanks all,
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the NumPy-Discussion
mailing list