[Numpy-discussion] Question about improving genfromtxt errors

Wed Sep 30 12:56:52 EDT 2009

On 09/30/2009 10:22 AM, Skipper Seabold wrote:
> On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey<bsouthey at gmail.com>  wrote:
> <snip>
>    
>> Hi,
>> The first case just has to handle a missing delimiter - actually I expect
>> that most of my cases would relate this. So here is simple Python code to
>> generate arbitrary large list with the occasional missing delimiter.
>>
>> I set it so it reads the desired number of rows and frequency of bad rows
>> from the linux command line.
>> $time python tbig.py 1000000 100000
>>
>> If I comment out the extra prints in io.py that I put in, it takes about 22
>> seconds to finish if the delimiters are correct. If I have the missing
>> delimiter it takes 20.5 seconds to crash.
>>
>>
>> Bruce
>>
>>      
> I think this would actually cover most of the problems I was running
> into.  The only other one I can think of is when I used a converter
> that I thought would work, but it got unexpected data.  For example,
>
> from StringIO import StringIO
> import numpy as np
>
> strip_rand = lambda x : float(('r' in x.lower() and x.split()[-1]) or
> (not 'r' in x.lower() and x.strip() or 0.0))
>
> # Example usage
> strip_rand('R 40')
> strip_rand('  ')
> strip_rand('')
> strip_rand('40')
>
> strip_per = lambda x : float(('%' in x.lower() and x.split()[0]) or
> (not '%' in x.lower() and x.strip() or 0.0))
>
> # Example usage
> strip_per('7 %')
> strip_per('7')
> strip_per(' ')
> strip_per('')
>
> # Unexpected usage
> strip_per('R 1')
>    
Does this work for you?
I get an:
ValueError: invalid literal for float(): R 1

> s = StringIO('D01N01,10/1/2003 ,1 %,R 75,400,600\r\nL24U05,12/5/2003\
> ,2 %,1,300, 150.5\r\nD02N03,10/10/2004 ,R 1,,7,145.55')
>    
Can you provide the correct line before the bad line?
It just makes it easy to understand why a line is bad.

> data = np.genfromtxt(s, converters = {2 : strip_per, 3 : strip_rand},
> delimiter=",", dtype=None)
>
> I don't have a clean install right now, but I think this returned a
> converter is locked for upgrading error.  I would just like to know
> where the problem occured (line and column, preferably not
> zero-indexed), so I can go and have a look at my data.
>    
I rather limited understanding here. I think the problem is that Python 
is raising a ValueError because your strip_per() is wrong. It is not 
informative to you because _iotools.py is not aware that an invalid 
converter will raise a ValueError. Therefore there needs to be some way 
to test that the converter is correct or not.

This this case I think it is the delimiter so checking the column 
numbers should occur before the application of the converter to that row.

Bruce