[Tutor] Processing CSV files

Leena Gupta gupta.leena at gmail.com
Wed Oct 9 01:26:34 CEST 2013


Dave, Alan - thanks for replying.

We have a box with 16GB RAM so RAM should not be an issue hopefully.

The datastore is Cassandra and I'm hoping to use the pycassa library for
interaction.

I do have an additional question related to Cassandra & Python. As part of
data processing, I need to fetch slices of data from Cassandra and run
computations like sum and percentile calculation on it. The sum along with
other attributes needs to be stored back in another Cassandra table that
will be queried by end users of a reporting system.  This is because
Cassandra does not provide any aggregation functions, so we will precompute
the aggregations and store in cassandra.

So for calculating the sum & percentile in Python, some of the data slices
on Cassandra could fetch a lot of rows (e.g.750,000 to 1mill rows) … And
since I need to compute a sum and percentile, I need to consider all the
rows. I am planning to do this in Python.
Do you foresee any issues with this approach? Any advise on this will be
greatly appreciated.

Thanks a ton!


On Tue, Oct 8, 2013 at 2:28 PM, Dave Angel <davea at davea.name> wrote:

> On 8/10/2013 16:46, Leena Gupta wrote:
>
> > Hello,
> >
> > Looking for some inputs on Python's csv processing feature.
> >
> > I need to process a large csv file every 5-10 minutes. The file could
> > contain 3mill to 10 mill rows and size could be 6MB to 10MB(+). As part
> of
> > the processing, I need to sum up a number value by grouping on certain
> > attributes and store the output in a datastore. I wanted to know if
> Python
> > is recommended and can it be used for processing data in csv files of
> this
> > size? Any issues that we need to be aware of? I believe Python has a csv
> > library as well.
> >
> > Thanks!
> >
> >
> > <div dir="ltr">Hello,<br><br>Looking for some inputs on Python's csv
> processing feature.<br><br>I need to process a large csv file every 5-10
> minutes. The file could contain 3mill to 10 mill rows and size could be 6MB
> to 10MB(+). As part of the processing, I need to sum up a number value by
> grouping on certain attributes and store the output in a datastore. I
> wanted to know if Python is recommended and can it be used for processing
> data in csv files of this size? Any issues that we need to be aware of? I
> believe Python has a csv library as well.<br>
> > <br>Thanks!<br></div>
> >
>
> Please use text messages here, not html.  It not only wastes space, but
> frequently messes up formatting.
>
> Python's csv logic should have no problem dealing with a file of 10
> million rows.  As long as you're not trying to keep all 10 million of
> them in some internal data structure, the csv logic will deal you a row
> at a time, in a most incremental fashion.
>
> Just make sure the particular datastore you require is supported in
> Python.
>
>
> --
> DaveA
>
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20131008/961eebc1/attachment-0001.html>


More information about the Tutor mailing list