Numpy and Terabyte data

Irving Duran irving.duran at gmail.com
Tue Jan 2 14:32:03 EST 2018


I've never heard or done that type of testing for a large dataset solely on
python, so I don't know what's the cap from the memory standpoint that
python can handle base on memory availability.  Now, if I understand what
you are trying to do, you can achieve that by leveraging Apache Spark and
invoking "pyspark" where you can store data in memory and/or hard disk.
Also, if you are working with Hadoop, you can use spark to move/transfer
data back-and-forth.


Thank You,

Irving Duran

On Tue, Jan 2, 2018 at 12:06 PM, <jason at apkudo.com> wrote:

> I'm not sure if I'll be laughed at, but a statistical sampling of a
> randomized sample should resemble the whole.
>
> If you need min/max then min ( min(each node) )
> If you need average then you need sum( sum(each node)) sum(count(each
> node))*
>
> *You'll likely need to use log here, as you'll probably overflow.
>
> It doesn't really matter what numpy can nagle you just need to collate the
> data properly, defer the actual calculation until the node calculations are
> complete.
>
> Also, numpy should store values more densely than python itself.
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list