[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names
Timothy Hochberg
tim.hochberg at ieee.org
Mon Jul 9 23:18:05 EDT 2007
On 7/9/07, Timothy Hochberg <tim.hochberg at ieee.org> wrote:
>
>
>
> On 7/9/07, Torgil Svensson <torgil.svensson at gmail.com> wrote:
> >
> > Elegant solution. Very readable and takes care of row0 nicely.
> >
> > I want to point out that this is much more efficient than my version
> > for random/late string representation changes throughout the
> > conversion but it suffers from 2*n memory footprint and large block
> > copying if the string rep changes arrives very early on huge datasets.
>
>
> Yep.
>
> I think we can't have best of both and Tims solution is better in the
> > general case.
>
>
> It probably would not be hard to do a hybrid version. One issue is that
> one doesn't, in general, know the size of the dataset in advance, so you'd
> have to use an absolute criteria (less than 100 lines) instead of a relative
> criteria (less than 20% done). I suppose you could stat the file or
> something, but that seems like overkill.
>
>
> Maybe "use one_alt if rownumber < xxx else use other_alt" can
> > fine-tune performance for some cases. but even ten, with many cols,
> > it's nearly impossible to know.
>
>
> That sounds sensible. I have an interesting thought on how to this that's
> a bit hard to describe. I'll try to throw it together and post another
> version today or tomorrow.
>
OK, as promised, here's an approach that rebuilds the array if the format
changes as long as the less than 'restart_length' lines have been processed.
Otherwise, it uses the old strategy. Perhaps not the most efficient way, but
it reuses what I'd already written with minimal changes. It's still pretty
rough -- once again I didn't bother to polish it.
def find_formats(items, last):
formats = []
for i, x in enumerate(items):
dt, cvt = string_to_dt_cvt(x)
if last is not None:
last_cvt, last_dt = last[i]
if last_cvt is float and cvt is int:
cvt = float
formats.append((dt, cvt))
return formats
class LoadInfo(object):
def __init__(self, row0):
self.done = False
self.lastcols = None
self.row0 = row0
self.predata = ()
def data_iterator(lines, converters, delim, info):
for x in info.predata:
yield x
info.predata = ()
yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim)))
try:
for row in lines:
yield tuple(f(x) for f, x in zip(converters, row.split(delim)))
except:
info.row0 = row
else:
info.done = True
def load2(fname,delim = ',', has_varnm = True, prn_report = True,
restart_length=20):
"""
Loading data from a file using the csv module. Returns a recarray.
"""
f=open(fname,'rb')
if has_varnm:
varnames = [i.strip() for i in f.next().split(delim)]
else:
varnames = None
info = LoadInfo(f.next())
chunks = []
while not info.done:
row0 = info.row0.split(delim)
formats = find_formats(row0, info.lastcols)
if varnames is None:
varnames = varnm = ['col%s' % str(i+1) for i, _ in
enumerate(formate)]
descr=[]
conversion_functions=[]
for name, (dtype, cvt_fn) in zip(varnames, formats):
descr.append((name,dtype))
conversion_functions.append(cvt_fn)
if len(chunks) == 1 and len(chunks[0]) < restart_length:
info.predata = chunks[0].astype(descr)
chunks = []
chunks.append(N.fromiter(data_iterator(f, conversion_functions,
delim, info), descr))
if len(chunks) > 1:
n = sum(len(x) for x in chunks)
data = N.zeros([n], chunks[-1].dtype)
offset = 0
for x in chunks:
delta = len(x)
data[offset:offset+delta] = x
offset += delta
else:
[data] = chunks
# load report
if prn_report:
print "##########################################\n"
print "Loaded file: %s\n" % fname
print "Nr obs: %s\n" % data.shape[0]
print "Variables and datatypes:\n"
for i in data.dtype.descr:
print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1],
str(data[i[0]][0:3]))
print "\n##########################################\n"
return data
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070709/1ea58008/attachment.html>
More information about the NumPy-Discussion
mailing list