[Numpy-discussion] Question about improving genfromtxt errors

Tue Sep 29 13:57:19 EDT 2009

On 09/29/2009 11:37 AM, Christopher Barker wrote:
> Pierre GM wrote:
>    
>> I was thinking about something this week-end: we could create a second
>> list when looping on the rows, where we would store the length of each
>> splitted row. After the loop, we can find if these values don't match
>> the expected number of columns `nbcols` and where. Then, we can decide
>> to strip the `rows` list of its invalid values (that corresponds to
>> skipping) or raise an exception, but in both cases we know where the
>> problem is.
>> My only concern is that we'd be creating yet another list of integers,
>> which would increase memory usage. Would it be a problem ?
>>      
> I doubt it would be that big deal, however...
>    
  Probably more than memory is the execution time involved in printing 
these problem rows.

There are already two loops over the data where you can measure the 
number of elements in the row but the first may be more appropriate.

So a simple solution is that in the first loop you could append the 
'bad' rows to one list and append to a 'good' rows to a exist row list 
or just store the row number that is bad.

Untested code for corresponding part of io.py:

      row_bad=[] # store bad rows
      bad_row_numbers=[] # store just the row number
      row_number=0 #simple row counter that probably should be the first 
data row not first line of the file
     for line in itertools.chain([first_line,], fhd):
         values = split_line(line)
         # Skip an empty line
         if len(values) == 0:
             continue
         # Select only the columns we need
         if usecols:
             values = [values[_] for _ in usecols]
         # Check whether we need to update the converter
         if dtype is None:
             for (converter, item) in zip(converters, values):
                 converter.upgrade(item)
         if len(values) != nbcols:
             row_bad.append(line) # store bad row so the user can search 
for that line
             bad_row_numbers.append(row_number) # store just the bad row 
number so user can go to the appropriate line(s) in file
         else:
              append_to_rows(tuple(values))
      row_number=row_number+1 # increment row counter

Note I assume that nbcols is the expected number of columns but I seem 
to be one off with my counting.

Then if len(rows_bad) is greater than zero you could raise or print out 
a warning and the rows then raise an exception or continue. The problem 
with continuing is that a user may not be aware that there is a warning.

Bruce