[Numpy-discussion] Question about improving genfromtxt errors

Fri Oct 2 10:34:39 EDT 2009

On 09/30/2009 12:44 PM, Skipper Seabold wrote:
> On Wed, Sep 30, 2009 at 12:56 PM, Bruce Southey<bsouthey at gmail.com>  wrote:
>    
>> On 09/30/2009 10:22 AM, Skipper Seabold wrote:
>>      
>>> On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey<bsouthey at gmail.com>    wrote:
>>> <snip>
>>>
>>>        
>>>> Hi,
>>>> The first case just has to handle a missing delimiter - actually I expect
>>>> that most of my cases would relate this. So here is simple Python code to
>>>> generate arbitrary large list with the occasional missing delimiter.
>>>>
>>>> I set it so it reads the desired number of rows and frequency of bad rows
>>>> from the linux command line.
>>>> $time python tbig.py 1000000 100000
>>>>
>>>> If I comment out the extra prints in io.py that I put in, it takes about 22
>>>> seconds to finish if the delimiters are correct. If I have the missing
>>>> delimiter it takes 20.5 seconds to crash.
>>>>
>>>>
>>>> Bruce
>>>>
>>>>
>>>>          
>>> I think this would actually cover most of the problems I was running
>>> into.  The only other one I can think of is when I used a converter
>>> that I thought would work, but it got unexpected data.  For example,
>>>
>>> from StringIO import StringIO
>>> import numpy as np
>>>
>>> strip_rand = lambda x : float(('r' in x.lower() and x.split()[-1]) or
>>> (not 'r' in x.lower() and x.strip() or 0.0))
>>>
>>> # Example usage
>>> strip_rand('R 40')
>>> strip_rand('  ')
>>> strip_rand('')
>>> strip_rand('40')
>>>
>>> strip_per = lambda x : float(('%' in x.lower() and x.split()[0]) or
>>> (not '%' in x.lower() and x.strip() or 0.0))
>>>
>>> # Example usage
>>> strip_per('7 %')
>>> strip_per('7')
>>> strip_per(' ')
>>> strip_per('')
>>>
>>> # Unexpected usage
>>> strip_per('R 1')
>>>
>>>        
>> Does this work for you?
>> I get an:
>> ValueError: invalid literal for float(): R 1
>>
>>      
> No, that's the idea.  Sorry this was a bit opaque.
>
>    
>>      
>>> s = StringIO('D01N01,10/1/2003 ,1 %,R 75,400,600\r\nL24U05,12/5/2003\
>>> ,2 %,1,300, 150.5\r\nD02N03,10/10/2004 ,R 1,,7,145.55')
>>>
>>>        
>> Can you provide the correct line before the bad line?
>> It just makes it easy to understand why a line is bad.
>>
>>      
> The idea is that I have a column, which I expect to be percentages,
> but these are coded in by different data collectors, so some code a 0
> for 0, some just leave it missing which could just as well be 0, some
> use the %.  What I didn't expect was that some put in a money amount,
> hence the 'R 7', which my converter doesn't catch.
>
>    
>>> data = np.genfromtxt(s, converters = {2 : strip_per, 3 : strip_rand},
>>> delimiter=",", dtype=None)
>>>
>>> I don't have a clean install right now, but I think this returned a
>>> converter is locked for upgrading error.  I would just like to know
>>> where the problem occured (line and column, preferably not
>>> zero-indexed), so I can go and have a look at my data.
>>>
>>>        
>> I rather limited understanding here. I think the problem is that Python
>> is raising a ValueError because your strip_per() is wrong. It is not
>> informative to you because _iotools.py is not aware that an invalid
>> converter will raise a ValueError. Therefore there needs to be some way
>> to test that the converter is correct or not.
>>
>>      
> _iotools does catch this I believe, though I don't understand the
> upgrading and locking properly.  The kludgy fix that I provided in the
> first post "I do not report the error from
> _iotools.StringConverter...", catches that an error is raised from
> _iotools and tells me exactly where the converter fails, so I can go
> to, say line 750,000 column 250 (and converter with key 249) instead
> of not knowing anything except that one of my ~500 converters failed
> somewhere in a 1 million line data file.  If you still want to keep
> the error messages from _iotools.StringConverter, then they maybe they
> could have a (%s, %s) added and then this can be filled in in
> genfromtxt when you know (line, column) or something similar as was
> kind of suggested in a post in this thread I believe.  Then again,
> this might not be possible.  I haven't tried.
>
>    
I added another patch to ticket 1212
http://projects.scipy.org/numpy/ticket/1212

I tried to rework my first patch because I had forgotten that the header 
of the file that I was using was missing a delimiter. (Something I need 
to investigate more.) Hopefully it helps towards a better solution.

I added a try/except block around the 'converter.upgrade(item)' line 
which appears to provide the results for your file. While not the best 
solution. In addition, I modified the loop to enumerate the converter 
list so I could find which one in the list fails. The output for your 
example:

Row Number: 3 Failed Converter 2 in list of converters
[('D01N01', '10/1/2003 ', 1.0, 75.0, 400, 600.0)
  ('L24U05', '12/5/2003', 2.0, 1.0, 300, 150.5)
  ('D02N03', '10/10/2004 ', 0.0, 0.0, 7, 145.55000000000001)]

>> This this case I think it is the delimiter so checking the column
>> numbers should occur before the application of the converter to that row.
>>
>>      
> Sometimes it was the case where I had an extra comma in a number 1,000
> say and then the converter tried to work on the wrong column, and
> sometimes it was because my converter didn't cover every use case,
> because I didn't know it yet.  Either way, I just needed a gentle
> nudge in the right direction.
>
> If that doesn't clear up what I was after, I can try to provide a more
> detailed code sample.
>
> Skipper
> _______________________________________________
>    
I do not see how to write code to determine when a delimiter has more 
than one meaning. While there are more columns than expected, it can be 
very hard to determine which column is incorrect without additional 
information. We might be able to that we we associate a format to a 
column. But then you would have to split columns one by one and checking 
each one as you do so. Probably not hard to do but a lot of work to 
validate it. For example, I have numerous problems with dates in SAS 
because you have 2 or 4 digit years, 1 or  2 digits days and months. But 
any variation than expected leads to errors if it expects 2 digit years 
and gets a 4 digit year. So I usually read dates as strings and then 
parse it as I want.

Bruce