[Numpy-discussion] Question about improving genfromtxt errors
Bruce Southey
bsouthey at gmail.com
Tue Sep 29 13:57:19 EDT 2009
On 09/29/2009 11:37 AM, Christopher Barker wrote:
> Pierre GM wrote:
>
>> I was thinking about something this week-end: we could create a second
>> list when looping on the rows, where we would store the length of each
>> splitted row. After the loop, we can find if these values don't match
>> the expected number of columns `nbcols` and where. Then, we can decide
>> to strip the `rows` list of its invalid values (that corresponds to
>> skipping) or raise an exception, but in both cases we know where the
>> problem is.
>> My only concern is that we'd be creating yet another list of integers,
>> which would increase memory usage. Would it be a problem ?
>>
> I doubt it would be that big deal, however...
>
Probably more than memory is the execution time involved in printing
these problem rows.
There are already two loops over the data where you can measure the
number of elements in the row but the first may be more appropriate.
So a simple solution is that in the first loop you could append the
'bad' rows to one list and append to a 'good' rows to a exist row list
or just store the row number that is bad.
Untested code for corresponding part of io.py:
row_bad=[] # store bad rows
bad_row_numbers=[] # store just the row number
row_number=0 #simple row counter that probably should be the first
data row not first line of the file
for line in itertools.chain([first_line,], fhd):
values = split_line(line)
# Skip an empty line
if len(values) == 0:
continue
# Select only the columns we need
if usecols:
values = [values[_] for _ in usecols]
# Check whether we need to update the converter
if dtype is None:
for (converter, item) in zip(converters, values):
converter.upgrade(item)
if len(values) != nbcols:
row_bad.append(line) # store bad row so the user can search
for that line
bad_row_numbers.append(row_number) # store just the bad row
number so user can go to the appropriate line(s) in file
else:
append_to_rows(tuple(values))
row_number=row_number+1 # increment row counter
Note I assume that nbcols is the expected number of columns but I seem
to be one off with my counting.
Then if len(rows_bad) is greater than zero you could raise or print out
a warning and the rows then raise an exception or continue. The problem
with continuing is that a user may not be aware that there is a warning.
Bruce
More information about the NumPy-Discussion
mailing list