[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names
Torgil Svensson
torgil.svensson at gmail.com
Thu Jul 19 07:34:51 EDT 2007
Hi again,
On 7/19/07, Torgil Svensson <torgil.svensson at gmail.com> wrote:
> If memory really is an issue, you have the nice "load_spec" version
> and can always convert the files once by iterating over the file twice
> like the attached script does.
I discovered that my script was broken and too complex. The attached
script is much cleaner and has better error messages.
Best regards,
//Torgil
On 7/19/07, Torgil Svensson <torgil.svensson at gmail.com> wrote:
> Hi,
>
> 1. Your code is fast due to that you convert whole at once columns in
> numpy. The first step with the lists is also very fast (python
> implements lists as arrays). I like your version, I think it's as fast
> as it gets in pure python and has to keep only two versions of the
> data at once in memory (since the string versions can be garbage
> collected).
>
> If memory really is an issue, you have the nice "load_spec" version
> and can always convert the files once by iterating over the file twice
> like the attached script does.
>
>
> 4. Okay, that makes sense. I was confused by the fact that your
> generated function had the same name as the builtin iter() operator.
>
>
> //Torgil
>
>
> On 7/19/07, Vincent Nijs <v-nijs at kellogg.northwestern.edu> wrote:
> >
> > Hi Torgil,
> >
> > 1. I got an email from Tim about this issue:
> >
> > "I finally got around to doing some more quantitative comparisons between
> > your code and the more complicated version that I proposed. The idea behind
> > my code was to minimize memory usage -- I figured that keeping the memory
> > usage low would make up for any inefficiencies in the conversion process
> > since it's been my experience that memory bandwidth dominates a lot of
> > numeric problems as problem sized get reasonably large. I was mostly wrong.
> > While it's true that for very large file sizes I can get my code to
> > outperform yours, in most instances it lags behind. And the range where it
> > does better is a fairly small range right before the machine dies with a
> > memory error. So my conclusion is that the extra hoops my code goes through
> > to avoid allocating extra memory isn't worth it for you to bother with."
> >
> > The approach in my code is simple and robust to most data issues I could
> > come-up with. It actually will do an appropriate conversion if there are
> > missing values or int's and float in the same column. It will select an
> > appropriate string length as well. It may not be the most memory efficient
> > setup but given Tim's comments it is a pretty decent solution for the types
> > of data I have access to.
> >
> > 2. Fixed the spelling error :)
> >
> > 3. I guess that is the same thing. I am not very familiar with zip, izip,
> > map etc. just yet :) Thanks for the tip!
> >
> > 4. I called the function generated using exec, iter(). I need that function
> > to transform the data using the types provided by the user.
> >
> > Best,
> >
> > Vincent
> >
> >
> > On 7/18/07 7:57 PM, "Torgil Svensson" <torgil.svensson at gmail.com> wrote:
> >
> > > Nice,
> > >
> > > I haven't gone through all details. That's a nice new "missing"
> > > feature, maybe all instances where we can't find a conversion should
> > > be "nan". A few comments:
> > >
> > > 1. The "load_search" functions contains all memory/performance
> > > overhead that we wanted to avoid with the fromiter function. Does this
> > > mean that you no longer have large text-files that change sting
> > > representation in the columns (aka "0" floats) ?
> > >
> > > 2. ident=" "*4
> > > This has the same spelling error as in my first compile try .. it was
> > > meant to be "indent"
> > >
> > > 3. types = list((i,j) for i, j in zip(varnm, types2))
> > > Isn't this the same as "types = zip(varnm, types2)" ?
> > >
> > > 4. return N.fromiter(iter(reader),dtype = types)
> > > Isn't "reader" an iterator already? What does the "iter()" operator do
> > > in this case?
> > >
> > > Best regards,
> > >
> > > //Torgil
> > >
> > >
> > > On 7/18/07, Vincent Nijs <v-nijs at kellogg.northwestern.edu> wrote:
> > >>
> > >> I combined some of the very useful comments/code from Tim and Torgil
> > and
> > >> came-up with the attached program to read csv files and convert the data
> > >> into a recarray. I couldn't use all of their suggestions because,
> > frankly, I
> > >> didn't understand all of them :)
> > >>
> > >> The program use variable names if provided in the csv-file and can
> > >> auto-detect data types. However, I also wanted to make it easy to
> > specify
> > >> data types and/or variables names if so desired. Examples are at the
> > bottom
> > >> of the file. Comments are very welcome.
> > >>
> > >> Thanks,
> > >>
> > >> Vincent
> > >> _______________________________________________
> > >> Numpy-discussion mailing list
> > >> Numpy-discussion at scipy.org
> > >>
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> > >>
> > >>
> > >>
> > > _______________________________________________
> > > Numpy-discussion mailing list
> > > Numpy-discussion at scipy.org
> > >
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> > >
> >
> > --
> > Vincent R. Nijs
> > Assistant Professor of Marketing
> > Kellogg School of Management, Northwestern University
> > 2001 Sheridan Road, Evanston, IL 60208-2001
> > Phone: +1-847-491-4574 Fax: +1-847-491-2498
> > E-mail: v-nijs at kellogg.northwestern.edu
> > Skype: vincentnijs
> >
> >
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion at scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
> >
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix_tricky_columns.py
Type: text/x-python
Size: 2668 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070719/5ebbc4ea/attachment.py>
More information about the NumPy-Discussion
mailing list