[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Mon Jul 9 23:18:05 EDT 2007

On 7/9/07, Timothy Hochberg <tim.hochberg at ieee.org> wrote:
>
>
>
> On 7/9/07, Torgil Svensson <torgil.svensson at gmail.com> wrote:
> >
> > Elegant solution. Very readable and takes care of row0 nicely.
> >
> > I want to point out that this is much more efficient than my version
> > for random/late string representation changes throughout the
> > conversion but it suffers from 2*n memory footprint and large block
> > copying if the string rep changes arrives very early on huge datasets.
>
>
> Yep.
>
> I think we can't have best of both and Tims solution is better in the
> > general case.
>
>
> It probably would not be hard to do a hybrid version. One issue is that
> one doesn't, in general, know the size of the dataset in advance, so you'd
> have to use an absolute criteria (less than 100 lines) instead of a relative
> criteria (less than 20% done). I suppose you could stat the file or
> something, but that seems like overkill.
>
>
> Maybe "use one_alt if rownumber < xxx else use other_alt" can
> > fine-tune performance for some cases. but even ten, with many cols,
> > it's nearly impossible to know.
>
>
> That sounds sensible. I have an interesting thought on how to this that's
> a bit hard to describe. I'll try to throw it together and post another
> version today or tomorrow.
>

OK, as promised, here's an approach that rebuilds the array if the format
changes as long as the less than 'restart_length' lines have been processed.
Otherwise, it uses the old strategy. Perhaps not the most efficient way, but
it reuses what I'd already written with minimal changes. It's still pretty
rough -- once again I didn't bother to polish it.

def find_formats(items, last):
    formats = []
    for i, x in enumerate(items):
        dt, cvt = string_to_dt_cvt(x)
        if last is not None:
            last_cvt, last_dt = last[i]
            if last_cvt is float and cvt is int:
                cvt = float
        formats.append((dt, cvt))
    return formats

class LoadInfo(object):
    def __init__(self, row0):
        self.done = False
        self.lastcols = None
        self.row0 = row0
        self.predata = ()

def data_iterator(lines, converters, delim, info):
    for x in info.predata:
        yield x
    info.predata = ()
    yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim)))
    try:
        for row in lines:
            yield tuple(f(x) for f, x in zip(converters, row.split(delim)))
    except:
        info.row0 = row
    else:
        info.done = True

def load2(fname,delim = ',', has_varnm = True, prn_report = True,
restart_length=20):
    """
    Loading data from a file using the csv module. Returns a recarray.
    """
    f=open(fname,'rb')

    if has_varnm:
        varnames = [i.strip() for i in f.next().split(delim)]
    else:
        varnames = None

    info = LoadInfo(f.next())
    chunks = []

    while not info.done:

        row0 = info.row0.split(delim)
        formats = find_formats(row0, info.lastcols)
        if varnames is None:
            varnames = varnm = ['col%s' % str(i+1) for i, _ in
enumerate(formate)]
        descr=[]
        conversion_functions=[]
        for name, (dtype, cvt_fn) in zip(varnames, formats):
            descr.append((name,dtype))
            conversion_functions.append(cvt_fn)

        if len(chunks) == 1 and len(chunks[0]) < restart_length:
            info.predata = chunks[0].astype(descr)
            chunks = []

        chunks.append(N.fromiter(data_iterator(f, conversion_functions,
delim, info), descr))

    if len(chunks) > 1:
        n = sum(len(x) for x in chunks)
        data = N.zeros([n], chunks[-1].dtype)
        offset = 0
        for x in chunks:
            delta = len(x)
            data[offset:offset+delta] = x
            offset += delta
    else:
        [data] = chunks

    # load report
    if prn_report:
        print "##########################################\n"
        print "Loaded file: %s\n" % fname
        print "Nr obs: %s\n" % data.shape[0]
        print "Variables and datatypes:\n"
        for i in data.dtype.descr:
            print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1],
str(data[i[0]][0:3]))
            print "\n##########################################\n"

    return data
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070709/1ea58008/attachment.html>