[Numpy-discussion] fromfile() for reading text (one more time!)

Thu Jan 7 16:11:01 EST 2010

On Thu, Jan 7, 2010 at 2:32 PM,  <josef.pktd at gmail.com> wrote:
> On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker
> <Chris.Barker at noaa.gov> wrote:
>> Pauli Virtanen wrote:
>>> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
>>> it also does odd things with spaces
>>>> embedded in the separator:
>>>>
>>>> ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
>>
>>> That's a documented feature:
>>
>> Fair enough.
>>
>> OK, I've written a patch that allows newlines to be interpreted as
>> separators in addition to whatever is specified in sep.
>>
>> In the process of testing, I found again these issues, which are still
>> marked as "needs decision".
>>
>> http://projects.scipy.org/numpy/ticket/883
>>
>> In short: what to do with missing values?
>>
>> I'd like to address this bug, but I need a decision to do so.
>>
>>
>> My proposal:
>>
>> Raise an ValueError with missing values.
>>
>>
>> Justification:
>>
>> No function should EVER return data that is not there. Period. It is
>> simply asking for hard to find bugs. Therefore:
>>
>> fromstring("3, 4,,5", sep=",")
>>
>> Should never, ever, return:
>>
>> array([ 3.,  4.,  0.,  5.])
>>
>> Which is what it does now. bad. bad. bad.
>>
>>
>>
>>
>> Alternatives:
>>
>>   A) Raising a ValueError is the easiest way to get "proper" behavior.
>> Folks can use a more sophisticated file reader if they want missing
>> values handled. I'm willing to contribute this patch.
>>
>>   B) If the dtype is a floating point type, NaN could fill in the
>> missing values -- a fine idea, but you can't use it for integers, and
>> zero is a really bad replacement!
>>
>>   C) The user could specify what they want filled in for missing
>> values. This is a fine idea, though I'm not sure I want to take the time
>> to impliment it.
>>
>> Oh, and this is a bug too, with probably the same solution:
>>
>> In [20]: np.fromstring("hjba", sep=',')
>> Out[20]: array([ 0.])
>>
>> In [26]: np.fromstring("34gytf39", sep=',')
>> Out[26]: array([ 34.])
>>
>>
>> One more unresolved question:
>>
>> what should:
>>
>> np.fromstring("3, 4, 5,", sep=",")
>>
>> return?
>>
>> it currently returns:
>>
>> array([ 3.,  4.,  5.])
>>
>> which seems a bit inconsitent with missing value handling. I also found
>> a bug:
>>
>> In [6]: np.fromstring("3, 4, 5 , ", sep=",")
>> Out[6]: array([ 3.,  4.,  5.,  0.])
>>
>> so if there is some extra whitespace in there, it does return a missing
>> value. With my proposal, that wouldn't happen, but you might get an
>> exception. I think you should, but it'll be easier to implement my
>> "allow newlines" code if not.
>>
>>
>> so, should I do (A) ?
>>
>>
>> Another question:
>>
>> I've got a patch mostly working (except for the above issues) that will
>> allow fromfile/string to read multiline non-whitespace separated data in
>> one shot:
>>
>>
>> In [15]: str
>> Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>>
>> In [16]: np.fromstring(str, sep=',', allow_newlines=True)
>> Out[16]:
>> array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
>>         12.])
>>
>>
>> I think this is a very helpful enhancement, and, as it is a new kwarg,
>> backward compatible:
>>
>> 1) Might it be accepted for inclusion?
>>
>> 2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit,
>> but also long -- I used it for the flag name in the C code, too.
>>
>> 3) What C datatype should I use for a boolean flag? I used a char, but I
>> don't know what the numpy standard is.
>>
>>
>> -Chris
>>
>>
>
> I don't know much about this, just a few more test cases
>
> comma and newline
> str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12'
>
> extra comma at end of file
> str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,'
>
> extra newlines at end of file
> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
>
> It would be nice if these cases would go through without missing
> values or exception, but I don't often have files that are clean
> enough for fromfile().
>
> I'm in favor of nan for missing values with floating point numbers. It
> would make it easy to read correctly formatted csv files, even if the
> data is not complete.
>

Using the numpy NaN or similar (noting R's approach to missing values
which in turn allows it to have the above functionality) is just a
very bad idea for missing values because you always have to check that
which NaN is a missing value and which was due to some numerical
calculation. It is a very bad idea because we have masked arrays that
nicely but slowly handle this situation.

>From what I can see is that you expect that fromfile() should only
split at the supplied delimiters, optionally(?) strip any whitespace
and force a specific dtype. I would agree that the failure of any of
one these should create an exception by default rather than making the
best guess. So 'missing data'  would potentially fail with forcing the
specified dtype. Thus, you should either create an exception for
invalid data (with appropriate location) or use masked arrays.

Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
actually assumes multiple delimiters because there is no comma between
4 and 5 and 8 and 9. So I think it would be better if fromfile
accepted multiple delimiters. In Josef's last case how many 'missing
values should there be?

Bruce