best way to read a huge ascii file.

Wed Nov 30 17:17:21 EST 2016

Hi,

Yes, working with binary formats is the way to go when you have large data.
But for further
reference, Dask[1] fits perfectly for your use case, see below how I
process a 7Gb
text file under 17 seconds (in a laptop: mbp + quad-core + ssd).

# Create roughly ~7Gb worth text data.

In [40]: import numpy as np

In [41]: x = np.random.random((60, 5000000))

In [42]: %time np.savetxt('data.txt', x)
CPU times: user 4min 28s, sys: 14.8 s, total: 4min 43s
Wall time: 5min

In [43]: %time y = np.loadtxt('data.txt')
CPU times: user 6min 31s, sys: 1min, total: 7min 31s
Wall time: 7min 44s

# Then we proceed to use dask to read the big file. The key here is to
# use a block size so we process the file in ~120Mb chunks (approx. one
line).
# Dask uses by default the line separator \n to ensure the partitions don't
break
# the lines.

In [1]: import dask.bag

In [2]: data = dask.bag.read_text('data.txt', blocksize=120*1024*1024)

In [3]: data
dask.bag<bag-fro..., npartitions=60>

# Rather than passing the entire 100+Mb line to np.loadtxt, we slice the
first 128 bytes
# which is enough to grab the first 4 columns.
# You could further speed up this by not reading the entire line but
instead read just
# 128 bytes from each line offset.

In [4]: from io import StringIO

In [5]: def to_array(line):
    ...:     return np.loadtxt(StringIO(line[:128]))[:4]
    ...:
    ...:

In [6]: %time y = np.asarray(data.map(to_array).compute())
y.shape
CPU times: user 190 ms, sys: 60.8 ms, total: 251 ms
Wall time: 16.9 s

In [7]: y.shape
(60, 4)

In [8]: y[:2, :]

array([[ 0.17329305,  0.36584998,  0.01356046,  0.6814617 ],
       [ 0.3352684 ,  0.83274823,  0.24399607,  0.30103352]])

You can also use dask to convert the entire file to hdf5.

Regards,

[1] http://dask.pydata.org/

Rolando

On Wed, Nov 30, 2016 at 1:16 PM, Heli <hemla21 at gmail.com> wrote:

> Hi all,
>
>  Writing my ASCII file once to either of pickle or npy or hdf data types
> and then working afterwards on the result binary file reduced the read time
> from 80(min) to 2 seconds.
>
> Thanks everyone for your help.
> --
> https://mail.python.org/mailman/listinfo/python-list
>