[Numpy-discussion] Question about improving genfromtxt errors

Wed Sep 30 11:22:39 EDT 2009

On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey <bsouthey at gmail.com> wrote:
<snip>
>
> Hi,
> The first case just has to handle a missing delimiter - actually I expect
> that most of my cases would relate this. So here is simple Python code to
> generate arbitrary large list with the occasional missing delimiter.
>
> I set it so it reads the desired number of rows and frequency of bad rows
> from the linux command line.
> $time python tbig.py 1000000 100000
>
> If I comment out the extra prints in io.py that I put in, it takes about 22
> seconds to finish if the delimiters are correct. If I have the missing
> delimiter it takes 20.5 seconds to crash.
>
>
> Bruce
>

I think this would actually cover most of the problems I was running
into.  The only other one I can think of is when I used a converter
that I thought would work, but it got unexpected data.  For example,

from StringIO import StringIO
import numpy as np

strip_rand = lambda x : float(('r' in x.lower() and x.split()[-1]) or
(not 'r' in x.lower() and x.strip() or 0.0))

# Example usage
strip_rand('R 40')
strip_rand('  ')
strip_rand('')
strip_rand('40')

strip_per = lambda x : float(('%' in x.lower() and x.split()[0]) or
(not '%' in x.lower() and x.strip() or 0.0))

# Example usage
strip_per('7 %')
strip_per('7')
strip_per(' ')
strip_per('')

# Unexpected usage
strip_per('R 1')

s = StringIO('D01N01,10/1/2003 ,1 %,R 75,400,600\r\nL24U05,12/5/2003\
,2 %,1,300, 150.5\r\nD02N03,10/10/2004 ,R 1,,7,145.55')

data = np.genfromtxt(s, converters = {2 : strip_per, 3 : strip_rand},
delimiter=",", dtype=None)

I don't have a clean install right now, but I think this returned a
converter is locked for upgrading error.  I would just like to know
where the problem occured (line and column, preferably not
zero-indexed), so I can go and have a look at my data.

One more note, being able to autostrip whitespace turned out to be
very helpful.  I didn't realize how much memory strings of spaces
could take up, and as soon as I turned this on, I was able to process
an array with a lot of whitespace without filling up my memory.  So I
think maybe autostrip should be turned on by default?

I will post anything else if it occurs to me.

Skipper