[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Vincent Nijs v-nijs at kellogg.northwestern.edu
Wed Jul 18 21:47:39 EDT 2007


Hi Torgil,

1. I got an email from Tim about this issue:

"I finally got around to doing some more quantitative comparisons between
your code and the more complicated version that I proposed. The idea behind
my code was to minimize memory usage -- I figured that keeping the memory
usage low would make up for any inefficiencies in the conversion process
since it's been my experience that memory bandwidth dominates a lot of
numeric problems as problem sized get reasonably large. I was mostly wrong.
While it's true that for very large file sizes I can get my code to
outperform yours, in most instances it lags behind. And the range where it
does better is a fairly small range right before the machine dies with a
memory error. So my conclusion is that the extra hoops my code goes through
to avoid allocating extra memory isn't worth it for you to bother with.²

The approach in my code is simple and robust to most data issues I could
come-up with. It actually will do an appropriate conversion if there are
missing values or int¹s and float in the same column.  It will select an
appropriate string length as well. It may not be the most memory efficient
setup but given Tim¹s comments it is a pretty decent solution for the types
of data I have access to.

2. Fixed the spelling error :)

3. I guess that is the same thing. I am not very familiar with zip, izip,
map etc. just yet :) Thanks for the tip!

4. I called the function generated using exec, iter(). I need that function
to transform the data using the types provided by the user.

Best,

Vincent

On 7/18/07 7:57 PM, "Torgil Svensson" <torgil.svensson at gmail.com> wrote:

> Nice,
> 
> I haven't gone through all details. That's a nice new "missing"
> feature, maybe all instances where we can't find a conversion should
> be "nan". A few comments:
> 
> 1. The "load_search" functions contains all memory/performance
> overhead that we wanted to avoid with the fromiter function. Does this
> mean that you no longer have large text-files that change sting
> representation in the columns (aka "0" floats) ?
> 
> 2. ident=" "*4
> This has the same spelling error as in my first compile try .. it was
> meant to be "indent"
> 
> 3. types = list((i,j) for i, j in zip(varnm, types2))
> Isn't this the same as "types = zip(varnm, types2)" ?
> 
> 4.  return N.fromiter(iter(reader),dtype = types)
> Isn't "reader" an iterator already? What does the "iter()" operator do
> in this case?
> 
> Best regards,
> 
> //Torgil
> 
> 
> On 7/18/07, Vincent Nijs <v-nijs at kellogg.northwestern.edu> wrote:
>> 
>>  I combined some of the very useful comments/code from Tim and Torgil and
>> came-up with the attached program to read csv files and convert the data
>> into a recarray. I couldn't use all of their suggestions because, frankly, I
>> didn't understand all of them :)
>> 
>>  The program use variable names if provided in the csv-file and can
>> auto-detect data types. However, I also wanted to make it easy to specify
>> data types and/or variables names if so desired. Examples are at the bottom
>> of the file. Comments are very welcome.
>> 
>>  Thanks,
>> 
>>  Vincent
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>> 
>> 
>> 
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> 

-- 
Vincent R. Nijs
Assistant Professor of Marketing
Kellogg School of Management, Northwestern University
2001 Sheridan Road, Evanston, IL 60208-2001
Phone: +1-847-491-4574 Fax: +1-847-491-2498
E-mail: v-nijs at kellogg.northwestern.edu
Skype: vincentnijs


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070718/026a458b/attachment.html>


More information about the NumPy-Discussion mailing list