[Numpy-discussion] fromfile() for reading text (one more time!)

Thu Jan 7 15:32:55 EST 2010

On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker
<Chris.Barker at noaa.gov> wrote:
> Pauli Virtanen wrote:
>> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
>> it also does odd things with spaces
>>> embedded in the separator:
>>>
>>> ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
>
>> That's a documented feature:
>
> Fair enough.
>
> OK, I've written a patch that allows newlines to be interpreted as
> separators in addition to whatever is specified in sep.
>
> In the process of testing, I found again these issues, which are still
> marked as "needs decision".
>
> http://projects.scipy.org/numpy/ticket/883
>
> In short: what to do with missing values?
>
> I'd like to address this bug, but I need a decision to do so.
>
>
> My proposal:
>
> Raise an ValueError with missing values.
>
>
> Justification:
>
> No function should EVER return data that is not there. Period. It is
> simply asking for hard to find bugs. Therefore:
>
> fromstring("3, 4,,5", sep=",")
>
> Should never, ever, return:
>
> array([ 3.,  4.,  0.,  5.])
>
> Which is what it does now. bad. bad. bad.
>
>
>
>
> Alternatives:
>
>   A) Raising a ValueError is the easiest way to get "proper" behavior.
> Folks can use a more sophisticated file reader if they want missing
> values handled. I'm willing to contribute this patch.
>
>   B) If the dtype is a floating point type, NaN could fill in the
> missing values -- a fine idea, but you can't use it for integers, and
> zero is a really bad replacement!
>
>   C) The user could specify what they want filled in for missing
> values. This is a fine idea, though I'm not sure I want to take the time
> to impliment it.
>
> Oh, and this is a bug too, with probably the same solution:
>
> In [20]: np.fromstring("hjba", sep=',')
> Out[20]: array([ 0.])
>
> In [26]: np.fromstring("34gytf39", sep=',')
> Out[26]: array([ 34.])
>
>
> One more unresolved question:
>
> what should:
>
> np.fromstring("3, 4, 5,", sep=",")
>
> return?
>
> it currently returns:
>
> array([ 3.,  4.,  5.])
>
> which seems a bit inconsitent with missing value handling. I also found
> a bug:
>
> In [6]: np.fromstring("3, 4, 5 , ", sep=",")
> Out[6]: array([ 3.,  4.,  5.,  0.])
>
> so if there is some extra whitespace in there, it does return a missing
> value. With my proposal, that wouldn't happen, but you might get an
> exception. I think you should, but it'll be easier to implement my
> "allow newlines" code if not.
>
>
> so, should I do (A) ?
>
>
> Another question:
>
> I've got a patch mostly working (except for the above issues) that will
> allow fromfile/string to read multiline non-whitespace separated data in
> one shot:
>
>
> In [15]: str
> Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>
> In [16]: np.fromstring(str, sep=',', allow_newlines=True)
> Out[16]:
> array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
>         12.])
>
>
> I think this is a very helpful enhancement, and, as it is a new kwarg,
> backward compatible:
>
> 1) Might it be accepted for inclusion?
>
> 2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit,
> but also long -- I used it for the flag name in the C code, too.
>
> 3) What C datatype should I use for a boolean flag? I used a char, but I
> don't know what the numpy standard is.
>
>
> -Chris
>
>

I don't know much about this, just a few more test cases

comma and newline
str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12'

extra comma at end of file
str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,'

extra newlines at end of file
str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

It would be nice if these cases would go through without missing
values or exception, but I don't often have files that are clean
enough for fromfile().

I'm in favor of nan for missing values with floating point numbers. It
would make it easy to read correctly formatted csv files, even if the
data is not complete.

Josef

>
>
>
>
>
>
>
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>