[Numpy-discussion] genfromtxt - the return

Wed Oct 7 15:54:51 EDT 2009

On 10/07/2009 02:14 PM, Christopher Barker wrote:
> Pierre GM wrote:
>    
>> On Oct 6, 2009, at 10:08 PM, Bruce Southey wrote:
>>      
>>> option to merge delimiters - actually in SAS it is default
>>>        
> Wow! that sure strikes me as a bad choice.
>
>    
>> Ahah! I get it. Well, I remember that we discussed something like that a
>> few months ago when I started working on np.genfromtxt, and the
>> default of *not* merging whitespaces was requested. I gonna check
>> whether we can't put this option somewhere now...
>>      
> I'd think you might want to have two options: either "whitespace" which
> would be any type or amount of whitespace, or a specific delimeter: say
> "\t" or " " or "  " (two spaces), etc. In that case, it would mean "one
> and only one of these".
>
> Of course, this would fail in Bruce's example:
>
>   >>>>  A B C D
>   >>>>  1 2 3 4
>   >>>>  1     4 5
>
> as there is a space for the delimeter, and one for the data! This looks
> like fixed-format to me. if it were single-space delimited, it would
> look more like:
>
> when the delimiter is whitespace.
> A B C D E
> 1 2 3 4 5
> 1   4 5
>
> which is the same as:
>
> A, B, C, D, E
> 1, 2, 3, 4, 5
> 1,  ,  , 4, 5
>
>
> If something like SAS actually does merge decimeters, which I interpret
> to mean that if there are a few empty fields and you call for
> tab-delimited , you only get one tab, then information as simply been
> lost -- there is no way to recover it!
>
> -Chris
>
>    
To use fixed length fields you really need nicely formatted data and I
usually do not have that. As a default it does not always work for non-whitespace delimiters such as:
A,B,C
,,1
1,2,3

There is an option to override that behavior. But it is very useful when you have
extra whitespace especially reading in text strings that have different
lengths or different levels of whitespace padding.

The following is correct in that Python does merge whitespace delimiters by default. This is also what SAS does by default for any delimiter. But it is incorrect if each whitespace character is a delimiter:

s = StringIO('''
  1 10 100\r\n
10  1 1000''')
np.genfromtxt(s)
array([[    1.,    10.,   100.],
        [   10.,     1.,  1000.]])

np.genfromtxt(s, delimiter=' ')
Traceback (most recent call last):
   File "<stdin>", line 1, in<module>
   File "/usr/lib64/python2.6/site-packages/numpy/lib/io.py", line 1048, in genfromtxt
     raise IOError('End-of-file reached before encountering data.')
IOError: End-of-file reached before encountering data.

Anyhow, I do like what genfromtxt is doing so merging multiple delimiters of the same type is not really needed.

Bruce