[Numpy-discussion] fromfile() for reading text (one more time!)
Christopher Barker
Chris.Barker at noaa.gov
Thu Jan 7 15:08:23 EST 2010
Pauli Virtanen wrote:
> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
> it also does odd things with spaces
>> embedded in the separator:
>>
>> ", $ #" matches all of: ",$#" ", $#" ",$ #"
> That's a documented feature:
Fair enough.
OK, I've written a patch that allows newlines to be interpreted as
separators in addition to whatever is specified in sep.
In the process of testing, I found again these issues, which are still
marked as "needs decision".
http://projects.scipy.org/numpy/ticket/883
In short: what to do with missing values?
I'd like to address this bug, but I need a decision to do so.
My proposal:
Raise an ValueError with missing values.
Justification:
No function should EVER return data that is not there. Period. It is
simply asking for hard to find bugs. Therefore:
fromstring("3, 4,,5", sep=",")
Should never, ever, return:
array([ 3., 4., 0., 5.])
Which is what it does now. bad. bad. bad.
Alternatives:
A) Raising a ValueError is the easiest way to get "proper" behavior.
Folks can use a more sophisticated file reader if they want missing
values handled. I'm willing to contribute this patch.
B) If the dtype is a floating point type, NaN could fill in the
missing values -- a fine idea, but you can't use it for integers, and
zero is a really bad replacement!
C) The user could specify what they want filled in for missing
values. This is a fine idea, though I'm not sure I want to take the time
to impliment it.
Oh, and this is a bug too, with probably the same solution:
In [20]: np.fromstring("hjba", sep=',')
Out[20]: array([ 0.])
In [26]: np.fromstring("34gytf39", sep=',')
Out[26]: array([ 34.])
One more unresolved question:
what should:
np.fromstring("3, 4, 5,", sep=",")
return?
it currently returns:
array([ 3., 4., 5.])
which seems a bit inconsitent with missing value handling. I also found
a bug:
In [6]: np.fromstring("3, 4, 5 , ", sep=",")
Out[6]: array([ 3., 4., 5., 0.])
so if there is some extra whitespace in there, it does return a missing
value. With my proposal, that wouldn't happen, but you might get an
exception. I think you should, but it'll be easier to implement my
"allow newlines" code if not.
so, should I do (A) ?
Another question:
I've got a patch mostly working (except for the above issues) that will
allow fromfile/string to read multiline non-whitespace separated data in
one shot:
In [15]: str
Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
In [16]: np.fromstring(str, sep=',', allow_newlines=True)
Out[16]:
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11.,
12.])
I think this is a very helpful enhancement, and, as it is a new kwarg,
backward compatible:
1) Might it be accepted for inclusion?
2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit,
but also long -- I used it for the flag name in the C code, too.
3) What C datatype should I use for a boolean flag? I used a char, but I
don't know what the numpy standard is.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the NumPy-Discussion
mailing list