Numpy and Terabyte data

Wed Jan 3 15:03:30 EST 2018

On Jan 2, 2018 18:27, Rustom Mody <rustompmody at gmail.com> wrote:
>
> Someone who works in hadoop asked me:
>
> If our data is in terabytes can we do statistical (ie numpy pandas etc)
> analysis on it?
>
> I said: No (I dont think so at least!) ie I expect numpy (pandas etc)
> to not work if the data does not fit in memory
>
> Well sure *python* can handle (streams of) terabyte data I guess
> *numpy* cannot
>
> Is there a more sophisticated answer?
>
> ["Terabyte" is a just a figure of speech for "too large for main memory"]

Have a look at Pyspark and pyspark.ml. Pyspark has its own kind of DataFrame. Very, very cool stuff.

Dask DataFrames have been mentioned already.

numpy has memmapped arrays: https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.memmap.html