[SciPy-Dev] Cannot generate very large very sparse random matrix

Emanuele Olivetti olivetti at fbk.eu
Fri Nov 13 18:28:29 EST 2020


Thank you for your response. Indeed numpy.random.Generator.choice() solves
the problem:
----
rng = np.random.default_rng()
rng.choice(1_000_0000_000_000_000, size=10, replace=False)

array([7363643319410659, 1001129358099623, 7384908776761990,
       3610742892883208, 9484192959193500, 6273686405826185,
       1550972534180773, 1845765940909299,  144504113475750,
       7853188631204629])
----
while:
----
np.random.choice(1_000_0000_000_000_000, size=10, replace=False)
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-11-95b556ac15b9> in <module>
----> 1 np.random.choice(1_000_0000_000_000_000, size=10, replace=False)

mtrand.pyx in numpy.random.mtrand.RandomState.choice()

mtrand.pyx in numpy.random.mtrand.RandomState.permutation()

MemoryError: Unable to allocate 71.1 PiB for an array with shape
(10000000000000000,) and data type int64
----

According to the latest comment on the github issue you mentioned: "It
looks like np.random.Generator should be available from numpy 1.17 on, and
the current minimum numpy version is 1.16.5."... So this may require a
little while...

As a quick fix but also meaningful new feature, would it be possible to
extend the API of scipy.sparse.random() and to add the option
"replace=False" (then piped to np.random.choice()) which, if set to True,
would give the liberty to the user to solve the issue for very large very
sparse matrices at the cost of some (rare) collisions? I would gladly
accept it - and that's also my current fix on my local SciPy.

Best,

Emanuele



On Fri, Nov 13, 2020 at 4:23 PM CJ Carey <perimosocordiae at gmail.com> wrote:

> This is a known issue, see https://github.com/scipy/scipy/issues/9699.
>
> I haven't checked on the status of numpy.random.Generator.choice() in a
> while, so maybe the issue can be resolved now.
>
> On Wed, Nov 11, 2020 at 6:46 PM Emanuele Olivetti <olivetti at fbk.eu> wrote:
>
>> Hi,
>>
>> I've just noticed that it is not possible to generate very large very
>> sparse random matrices with scipy.sparse.random(). For example:
>>   scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11)
>> should create a sparse matrix with only 10 non-zero values... but instead
>> triggers a MemoryError:
>> ----
>> MemoryError                               Traceback (most recent call
>> last)
>> <ipython-input-8-eb81d3aec480> in <module>
>> ----> 1 scipy.sparse.random(1_000_000, 1_000_000, density = 1e-11)
>>
>> ~/miniconda3/envs/lap/lib/python3.8/site-packages/scipy/sparse/construct.py
>> in random(m, n, density, format, dtype, random_state, data_rvs)
>>     787             data_rvs = partial(random_state.uniform, 0., 1.)
>>     788
>> --> 789     ind = random_state.choice(mn, size=k, replace=False)
>>     790
>>     791     j = np.floor(ind * 1. / m).astype(tp, copy=False)
>>
>> mtrand.pyx in numpy.random.mtrand.RandomState.choice()
>>
>> mtrand.pyx in numpy.random.mtrand.RandomState.permutation()
>>
>> MemoryError: Unable to allocate 7.28 TiB for an array with shape
>> (1000000000000,) and data type int64
>> ----
>>
>> Here is the problematic line in current master branch of SciPy:
>>
>> https://github.com/scipy/scipy/blob/master/scipy/sparse/construct.py#L806
>>
>> In short, the issue is due to random_state.choice(... replace=False)
>> which needs to allocate the humongous array in order to pick the ten random
>> numbers...
>>
>> I understand the technical difficulty of generating random numbers
>> without replacement, but it is quite counterintuitive that in order to
>> generate a sparse random matrix it is necessary to create an equally large
>> but *dense* vector first.
>>
>> Is there a solution to this problem?
>>
>> Thanks in advance,
>>
>> Emanuele
>>
>>
>>
>>
>> --
>> Le informazioni contenute nella presente comunicazione sono di natura privata
>> e come tali sono da considerarsi riservate ed indirizzate esclusivamente
>> ai destinatari indicati e per le finalità strettamente legate al
>> relativo contenuto. Se avete ricevuto questo messaggio per errore, vi
>> preghiamo di eliminarlo e di inviare una comunicazione all’indirizzo
>> e-mail del mittente.
>> --
>> The information transmitted is intended only for the person or entity to
>> which it is addressed and may contain confidential and/or privileged
>> material. If you received this in error, please contact the sender and
>> delete the material.
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at python.org
>> https://mail.python.org/mailman/listinfo/scipy-dev
>>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>

-- 
--
Le informazioni contenute nella presente comunicazione sono di natura 
privata e come tali sono da considerarsi riservate ed indirizzate 
esclusivamente ai destinatari indicati e per le finalità strettamente 
legate al relativo contenuto. Se avete ricevuto questo messaggio per 
errore, vi preghiamo di eliminarlo e di inviare una comunicazione 
all’indirizzo e-mail del mittente.

--
The information transmitted is 
intended only for the person or entity to which it is addressed and may 
contain confidential and/or privileged material. If you received this in 
error, please contact the sender and delete the material.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scipy-dev/attachments/20201114/2a318145/attachment.html>


More information about the SciPy-Dev mailing list