[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Sun Jul 8 16:15:15 EDT 2007

On 7/8/07, Timothy Hochberg <tim.hochberg at ieee.org> wrote:
>
>
> On 7/8/07, Torgil Svensson <torgil.svensson at gmail.com> wrote:
> > Given that both your script and the mlab version preloads the whole
> > file before calling numpy constructor I'm curious how that compares in
> > speed to using numpy's fromiter function on your data. Using fromiter
> > should improve on memory usage (~50% ?).
> >
> > The drawback is for string columns where we don't longer know the
> > width of the largest item. I made it fall-back to "object" in this
> > case.
> >
> > Attached is a fromiter version of your script. Possible speedups could
> > be done by trying different approaches to the "convert_row" function,
> > for example using "zip" or "enumerate" instead of "izip".
>
> I suspect that you'd do better here if you removed a bunch of layers from
> the conversion functions. Right now it looks like:
> imap->chain->convert_row->tuple->generator->izip. That's
> five levels deep and Python functions are reasonably expensive. I would try
> to be a lot less clever and do something like:
>
>     def data_iterator(row_iter, delim):
>         row0 = row_iter.next().split(delim)
>         converters = find_formats(row0) # left as an exercise
>         yield tuple(f(x) for f, x in zip(conversion_functions, row0))
>         for row in row_iter:
>             yield tuple(f(x) for f, x in zip(conversion_functions, row0))
>

That sounds sane. I've maybe been attracted to bad habits here and got
away with it since i'm very i/o-bound in these cases. My main
objective has been reducing memory footprint to reduce swapping.

> That's just a sketch and I haven't timed it, but it cuts a few levels out of
> the call chain, so has a reasonable chance of being faster. If you wanted to
> be really clever, you could use some exec magic after you figure out the
> conversion functions to compile a special function that generates the tuples
> directly without any use of tuple or zip. I don't have time to work through
> the details right now, but the code you would compile would end up looking
> this:
>
> for (x0, x1, x2) in row_iter:
>    yield (int(x0), float(x1), float(x2))
>
> Here we've assumed that find_formats determined that there are three fields,
> an int and two floats. Once you have this info you can build an appropriate
> function and exec it. This would cut another couple levels out of the call
> chain. Again, I haven't timed it, or tried it, but it looks like it would be
> fun to try.
>
> -tim
>

Thank you for the lesson!  Great tip. This opens up for a variety of
new coding options. I've made an attempt on the fun part. Attached are
a version that generates the following generator code for Vincent's
__main__=='__name__' - code:

def get_data_iterator(row_iter,delim):
    yield (int('1'),int('3'),datestr2num('1/97'),float('1.12'),float('2.11'),float('1.2'))
    for row in row_iter:
        x0,x1,x2,x3,x4,x5=row.split(delim)
        yield (int(x0),int(x1),datestr2num(x2),float(x3),float(x4),float(x5))

Best Regards,

//Torgil
-------------- next part --------------
A non-text attachment was scrubbed...
Name: load_gen_iter.py
Type: text/x-python
Size: 2823 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070708/e60bc636/attachment.py>