[Numpy-discussion] Question about improving genfromtxt errors

Mon Sep 28 13:54:54 EDT 2009

On Mon, Sep 28, 2009 at 1:36 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
>
> On Sep 28, 2009, at 12:51 PM, Skipper Seabold wrote:
>
>> This was probably due to the way that I timed it, honestly.  I only
>> did it once.  The only differences I made for that part were in the
>> first post of the thread.  Two incremented scalars for line numbers
>> and column numbers and a try/except block.
>>
>> I'm really not against a debug mode if someone wants to do it, and
>> it's deemed necessary.  If it could be made to log all of the errors
>> that would be extremely helpful.  I still need to post some of my use
>> cases though.  Anything to help make data cleaning less of a chore...
>
> I was thinking about something this week-end: we could create a second
> list when looping on the rows, where we would store the length of each
> splitted row. After the loop, we can find if these values don't match
> the expected number of columns `nbcols` and where. Then, we can decide
> to strip the `rows` list of its invalid values (that corresponds to
> skipping) or raise an exception, but in both cases we know where the
> problem is.
> My only concern is that we'd be creating yet another list of integers,
> which would increase memory usage. Would it be a problem ?
> In other news, I should eventually be able to tackle that this week...
>

I don't think it would be prohibitively large.  One of the datasets I
was working with was about a million lines with about 500 columns in
each.  So...if this is how you actually do this then you have.

L = [500] * 1201798

import sys

print sys.getsizeof(L)/(1000000.), "MB"

# (9.6144560000000006, 'MB')

I can't think of a case where I would want to just skip bad rows.
Also, I'd definitely like to know about each line that had problems in
an error log if we're going to go through the whole file anyway.  No
hurry on this, just getting my thoughts out there after my experience.
 I will post some test cases tonight probably.

Skipper