[SciPy-User] 64-bit matrix indices (Nathaniel Smith)

Wed Nov 17 09:12:59 EST 2010

Hello Nathaniel,

First off -- thanks for your reply. Yep, I'm specifically asking about
indices in scipy.sparse matrices. I only need CSC sparse matrices; there
are a handful of other functions in SciPy I need but they aren't related
to sparse matrices.

> You're asking specifically about the indices in scipy.sparse matrices, yes?
> 
> At a quick glance, all the core matrix manipulation code in
> scipy.sparse seems to be templatized with respect to the type of the
> index -- you *might* be able to get 64-bit index support for all the
> core sparse matrix operations by editing
> scipy/sparse/sparsetools/sparsetools.i and adding the obvious stuff at
> around lines 145, 188, and 195, and then creating your matrices "by
> hand" (i.e., building your own indices and indptr arrays of 64-bit
> integers, and then passing them directly to the csc_matrix/csr_matrix
> constructors).

That ought to work -- I was thinking about something like this, but it
would have taken me quite a while to figure out editing sparsetools.i.
Thanks!

.mat file reading isn't exactly critical; I figure it would cause some
issues being a binary format. However, the matrices are generated by a
Matlab-based tool which is also written by the team I'm in, so adding
the option of generating Harwell-Boeing matrices wouldn't be much of a
problem.

> 
> You'd also need a way to call UMFPACK's 64-bit functions (the "zl/dl"
> variants instead of the "zi/di" variants). It looks like
> scikits.umfpack might let you do this easily, but I'm not sure.

scikits.umfpack already includes the option of using zl/dl, but I'm not
sure how well it works. Guess I'll soon find out :-).

In any case, I'll keep you posted. Thanks for your help!

Best regards,
Alex

> 
> -- Nathaniel
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Tue, 16 Nov 2010 11:20:08 -0800
> From: David Goldsmith <d.l.goldsmith at gmail.com>
> Subject: [SciPy-User] ImportError with Gohlke's 64-bit Windows build
> To: scipy-user at scipy.org
> Message-ID:
> 	<AANLkTik0KY_=n7KNaojt9ijWu31WxY7ZqLVqx0mStaMJ at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> Hi, folks.  I just installed C. Gohlke's 64-bit builds of Numpy and
> Scipy for Python 2.6.  The installations reported no errors, and I get
> no errors reported when simply importing the top-level packages:
> 
> C:\Users\Dad>python
> Python 2.6.6 (r266:84297, Aug 24 2010, 18:13:38) [MSC v.1500 64 bit (AMD64)] on
> win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import scipy as sp
> 
> But when I try to import optimize or interpolate, for example, I get:
> 
> >>> from scipy import optimize
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "C:\Python26\lib\site-packages\scipy\optimize\__init__.py", line 7, in <m
> odule>
>     from optimize import *
>   File "C:\Python26\lib\site-packages\scipy\optimize\optimize.py", line 28, in <
> module>
>     import linesearch
>   File "C:\Python26\lib\site-packages\scipy\optimize\linesearch.py", line 2, in
> <module>
>     from scipy.optimize import minpack2
> ImportError: DLL load failed: The specified module could not be found.
> 
> >>> from scipy import interpolate
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "C:\Python26\lib\site-packages\scipy\interpolate\__init__.py", line 7, in
>  <module>
>     from interpolate import *
>   File "C:\Python26\lib\site-packages\scipy\interpolate\interpolate.py", line 13
> , in <module>
>     import scipy.special as spec
>   File "C:\Python26\lib\site-packages\scipy\special\__init__.py", line 8, in <mo
> dule>
>     from basic import *
>   File "C:\Python26\lib\site-packages\scipy\special\basic.py", line 6, in <modul
> e>
>     from _cephes import *
> ImportError: DLL load failed: The specified module could not be found.
> 
> Anyone else have this problem?  Anyone have a solution?
> 
> (I just noticed: >python
> Python 2.6.6 (r266:84297, Aug 24 2010, 18:13:38) [MSC v.1500 _64 bit (AMD64)] on
> win32_, emphasis added: not sure what this means, but could it be the
> source of the problem?)
> 
> Thanks!
> 
> DG
> -- 
> In science it often happens that scientists say, 'You know that's a
> really good argument; my position is mistaken,' and then they would
> actually change their minds and you never hear that old view from them
> again. They really do it. It doesn't happen as often as it should,
> because scientists are human and change is sometimes painful. But it
> happens every day. I cannot recall the last time something like that
> happened in politics or religion.
> 
> - Carl Sagan, 1987 CSICOP address
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Tue, 16 Nov 2010 23:17:14 +0100
> From: braingateway <braingateway at gmail.com>
> Subject: Re: [SciPy-User] 64-bit matrix indices
> To: SciPy Users List <scipy-user at scipy.org>
> Message-ID: <4CE302EA.5090808 at gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Nathaniel Smith :
> > On Tue, Nov 16, 2010 at 8:09 AM, Ioan-Alexandru Lazar <alexlz at lmn.pub.ro> wrote:
> >   
> >> I am trying to use SciPy for one of my HPC projects. A problem I am
> >> currently facing is that 32-bit indices are too small for the matrix sizes
> >> we require. Is there any way to use 64-bit ints for the indices?
> >>     
> > [...]
> >   
> >> I only need a few sparse operations and .mat file reading from it.
> >>     
> >
> > You're asking specifically about the indices in scipy.sparse matrices, yes?
> >
> > At a quick glance, all the core matrix manipulation code in
> > scipy.sparse seems to be templatized with respect to the type of the
> > index -- you *might* be able to get 64-bit index support for all the
> > core sparse matrix operations by editing
> > scipy/sparse/sparsetools/sparsetools.i and adding the obvious stuff at
> > around lines 145, 188, and 195, and then creating your matrices "by
> > hand" (i.e., building your own indices and indptr arrays of 64-bit
> > integers, and then passing them directly to the csc_matrix/csr_matrix
> > constructors). The .mat file reader is potentially more tricky, but it
> > sounds like you could read them in with 32-bit indices and then just
> > convert them to 64-bit:
> >   mymat.indptr = np.array(mymat.indptr, dtype=np.int64)
> >   mymat.indices = np.array(mymat.indices, dtype=np.int64)
> >
> >   
> >> If anyone is interested on the background story: the matrices themselves
> >> aren't too big *at first*, but due to the peculiar structure they have,
> >> the fill-in is mind-blowing. UMFPACK complaints that it doesn't have
> >> enough memory for them; it does (our cluster's nodes have 24 GB of
> >> memory), but once the number of indices blows past the 32-bit limit, it
> >> hits the ceiling. Using a different solver is still not an option
> >>     
> >
> > You'd also need a way to call UMFPACK's 64-bit functions (the "zl/dl"
> > variants instead of the "zi/di" variants). It looks like
> > scikits.umfpack might let you do this easily, but I'm not sure.
> >
> > -- Nathaniel
> > _______________________________________________
> > SciPy-User mailing list
> > SciPy-User at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-user
> >   
> Humm, wait for the try-out result. I might encounter the same problem in 
> near future.
> 
> LittleBigBrain
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Tue, 16 Nov 2010 15:03:44 -0800
> From: Christoph Gohlke <cgohlke at uci.edu>
> Subject: Re: [SciPy-User] ImportError with Gohlke's 64-bit Windows
> 	build
> To: scipy-user at scipy.org
> Message-ID: <4CE30DD0.1060807 at uci.edu>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> 
> 
> 
> On 11/16/2010 11:20 AM, David Goldsmith wrote:
> > Hi, folks.  I just installed C. Gohlke's 64-bit builds of Numpy and
> > Scipy for Python 2.6.  The installations reported no errors, and I get
> > no errors reported when simply importing the top-level packages:
> >
> > C:\Users\Dad>python
> > Python 2.6.6 (r266:84297, Aug 24 2010, 18:13:38) [MSC v.1500 64 bit (AMD64)] on
> > win32
> > Type "help", "copyright", "credits" or "license" for more information.
> >>>> import numpy as np
> >>>> import scipy as sp
> >
> > But when I try to import optimize or interpolate, for example, I get:
> >
> >>>> from scipy import optimize
> > Traceback (most recent call last):
> >    File "<stdin>", line 1, in<module>
> >    File "C:\Python26\lib\site-packages\scipy\optimize\__init__.py", line 7, in<m
> > odule>
> >      from optimize import *
> >    File "C:\Python26\lib\site-packages\scipy\optimize\optimize.py", line 28, in<
> > module>
> >      import linesearch
> >    File "C:\Python26\lib\site-packages\scipy\optimize\linesearch.py", line 2, in
> > <module>
> >      from scipy.optimize import minpack2
> > ImportError: DLL load failed: The specified module could not be found.
> >
> >>>> from scipy import interpolate
> > Traceback (most recent call last):
> >    File "<stdin>", line 1, in<module>
> >    File "C:\Python26\lib\site-packages\scipy\interpolate\__init__.py", line 7, in
> >   <module>
> >      from interpolate import *
> >    File "C:\Python26\lib\site-packages\scipy\interpolate\interpolate.py", line 13
> > , in<module>
> >      import scipy.special as spec
> >    File "C:\Python26\lib\site-packages\scipy\special\__init__.py", line 8, in<mo
> > dule>
> >      from basic import *
> >    File "C:\Python26\lib\site-packages\scipy\special\basic.py", line 6, in<modul
> > e>
> >      from _cephes import *
> > ImportError: DLL load failed: The specified module could not be found.
> >
> > Anyone else have this problem?  Anyone have a solution?
> >
> > (I just noticed:>python
> > Python 2.6.6 (r266:84297, Aug 24 2010, 18:13:38) [MSC v.1500 _64 bit (AMD64)] on
> > win32_, emphasis added: not sure what this means, but could it be the
> > source of the problem?)
> >
> > Thanks!
> >
> > DG
> 
> You are likely using the non-MKL build (or an outdated build) of numpy. 
> Scipy-0.8.0.win-amd64-py2.6.?exe requires 
> numpy-1.5.0.win-amd64-py2.6-mkl.?exe.
> 
> Christoph
> 
> 
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Wed, 17 Nov 2010 08:10:19 +0800
> From: Ralf Gommers <ralf.gommers at googlemail.com>
> Subject: Re: [SciPy-User] Fisher exact test, anyone?
> To: SciPy Users List <scipy-user at scipy.org>
> Message-ID:
> 	<AANLkTimmt9YzsQ7Pe9A43YJhorTq=HKs5Yh0dZhvzdHJ at mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
> 
> On Tue, Nov 16, 2010 at 11:45 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> 
> >  On 11/16/2010 07:04 AM, Ralf Gommers wrote:
> >
> >
> >
> > On Mon, Nov 15, 2010 at 12:40 AM, Bruce Southey <bsouthey at gmail.com>wrote:
> >
> >> On Sat, Nov 13, 2010 at 8:50 PM,  <josef.pktd at gmail.com> wrote:
> >> > http://projects.scipy.org/scipy/ticket/956 and
> >> > http://pypi.python.org/pypi/fisher/ have Fisher's exact
> >> > testimplementations.
> >> >
> >> > It would be nice to get a version in for 0.9. I spent a few
> >> > unsuccessful days on it earlier this year. But since there are two new
> >> > or corrected versions available, it looks like it just needs testing
> >> > and a performance comparison.
> >> >
> >> > I won't have time for this, so if anyone volunteers for this, scipy
> >> > 0.9 should be able to get Fisher's exact.
> >>
> >>  https://github.com/rgommers/scipy/tree/fisher-exact
> > All tests pass. There's only one usable version (see below) so I didn't do
> > performance comparison. I'll leave a note on #956 as well, saying we're
> > discussing on-list.
> >
> >   I briefly looked at the code at pypi link but I do not think it is
> >> good enough for scipy. Also, I do not like when people license code as
> >> 'BSD' and there is a comment in cfisher.pyx  '# some of this code is
> >> originally from the internet. (thanks)'. Consequently we can not use
> >> that code.
> >>
> >
> > I agree, that's not usable. The plain Python algorithm is also fast enough
> > that there's no need to bother with Cython.
> >
> >>
> >> The code with ticket 956 still needs work especially in terms of the
> >> input types and probably the API (like having a function that allows
> >> the user to select either 1 or 2 tailed tests).
> >>
> >
> > Can you explain what you mean by work on input types? I used np.asarray and
> > forced dtype to be int64. For the 1-tailed test, is it necessary? I note
> > that pearsonr and spearmanr also only do 2-tailed.
> >
> > Cheers,
> > Ralf
> >
> >  I have no problem including this if we can agree on the API because
> > everything else is internal that can be fixed by release date. So I would
> > accept a place holder API that enable a user in the future to select which
> > tail(s) is performed.
> >
> 
> It is always possible to add a keyword "tail" later that defaults to
> 2-tailed. As long as the behavior doesn't change this is perfectly fine, and
> better than having a placeholder.
> 
> >
> > 1) It just can not use np.asarray() without checking the input first. This
> > is particularly bad for masked arrays.
> >
> > Don't understand this. The input array is not returned, only used
> internally. And I can't think of doing anything reasonable with a 2x2 table
> with masked values. If that's possible at all, it should probably just go
> into mstats.
> 
> 
> > 2) There are no dimension checking because, as I understand it, this can
> > only handle a '2 by 2' table. I do not know enough for general 'r by c'
> > tables or the 1-d case either.
> >
> > Don't know how easy it would be to add larger tables. I can add dimension
> checking with an informative error message.
> 
> 
> > 3) The odds-ratio should be removed because it is not part of the test. It
> > is actually more general than this test.
> >
> > Don't feel strongly about this either way. It comes almost for free, and R
> seems to do the same.
> 
> 4) Variable names such as min and max should not shadow Python functions.
> >
> 
> Yes, Josef noted this already, will change.
> 
> >
> > 5) Is there a reference to the algorithm implemented? For example, SPSS
> > provides a simple 2 by 2 algorithm:
> >
> > http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf
> >
> 
> Not supplied, will ask on the ticket and include it.
> 
> >
> > 6) Why exactly does the dtype need to int64? That is, is there something
> > wrong with hypergeom function? I just want to understand why the precision
> > change is required because the input should enter with sufficient precision.
> >
> > This test:
> fisher_exact(np.array([[18000, 80000], [20000, 90000]]))
> becomes much slower and gives an overflow warning with int32. int32 is just
> not enough. This is just an implementation detail and does not in any way
> limit the accepted inputs, so I don't see a problem here.
> 
> Don't know what the behavior should be if a user passes in floats though?
> Just convert to int like now, or raise a warning?
> 
> Cheers,
> Ralf
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20101117/953dd683/attachment-0001.html 
> 
> ------------------------------
> 
> Message: 6
> Date: Tue, 16 Nov 2010 19:38:09 -0500
> From: josef.pktd at gmail.com
> Subject: Re: [SciPy-User] Fisher exact test, anyone?
> To: SciPy Users List <scipy-user at scipy.org>
> Message-ID:
> 	<AANLkTi=vgFd3AF-R5Lmw=r0HfJhtDRuvMYBZWYeo24rN at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> On Tue, Nov 16, 2010 at 7:10 PM, Ralf Gommers
> <ralf.gommers at googlemail.com> wrote:
> >
> >
> > On Tue, Nov 16, 2010 at 11:45 PM, Bruce Southey <bsouthey at gmail.com> wrote:
> >>
> >> On 11/16/2010 07:04 AM, Ralf Gommers wrote:
> >>
> >> On Mon, Nov 15, 2010 at 12:40 AM, Bruce Southey <bsouthey at gmail.com>
> >> wrote:
> >>>
> >>> On Sat, Nov 13, 2010 at 8:50 PM, ?<josef.pktd at gmail.com> wrote:
> >>> > http://projects.scipy.org/scipy/ticket/956 and
> >>> > http://pypi.python.org/pypi/fisher/ have Fisher's exact
> >>> > testimplementations.
> >>> >
> >>> > It would be nice to get a version in for 0.9. I spent a few
> >>> > unsuccessful days on it earlier this year. But since there are two new
> >>> > or corrected versions available, it looks like it just needs testing
> >>> > and a performance comparison.
> >>> >
> >>> > I won't have time for this, so if anyone volunteers for this, scipy
> >>> > 0.9 should be able to get Fisher's exact.
> >>>
> >> https://github.com/rgommers/scipy/tree/fisher-exact
> >> All tests pass. There's only one usable version (see below) so I didn't do
> >> performance comparison. I'll leave a note on #956 as well, saying we're
> >> discussing on-list.
> >>
> >>> I briefly looked at the code at pypi link but I do not think it is
> >>> good enough for scipy. Also, I do not like when people license code as
> >>> 'BSD' and there is a comment in cfisher.pyx ?'# some of this code is
> >>> originally from the internet. (thanks)'. Consequently we can not use
> >>> that code.
> >>
> >> I agree, that's not usable. The plain Python algorithm is also fast enough
> >> that there's no need to bother with Cython.
> >>>
> >>> The code with ticket 956 still needs work especially in terms of the
> >>> input types and probably the API (like having a function that allows
> >>> the user to select either 1 or 2 tailed tests).
> >>
> >> Can you explain what you mean by work on input types? I used np.asarray
> >> and forced dtype to be int64. For the 1-tailed test, is it necessary? I note
> >> that pearsonr and spearmanr also only do 2-tailed.
> >>
> >> Cheers,
> >> Ralf
> >>
> >> I have no problem including this if we can agree on the API because
> >> everything else is internal that can be fixed by release date. So I would
> >> accept a place holder API that enable a user in the future to select which
> >> tail(s) is performed.
> >
> > It is always possible to add a keyword "tail" later that defaults to
> > 2-tailed. As long as the behavior doesn't change this is perfectly fine, and
> > better than having a placeholder.
> >>
> >> 1) It just can not use np.asarray() without checking the input first. This
> >> is particularly bad for masked arrays.
> >>
> > Don't understand this. The input array is not returned, only used
> > internally. And I can't think of doing anything reasonable with a 2x2 table
> > with masked values. If that's possible at all, it should probably just go
> > into mstats.
> >
> >>
> >> 2) There are no dimension checking because, as I understand it, this can
> >> only handle a '2 by 2' table. I do not know enough for general 'r by c'
> >> tables or the 1-d case either.
> >>
> > Don't know how easy it would be to add larger tables. I can add dimension
> > checking with an informative error message.
> 
> There is some discussion in the ticket about more than 2by2,
> additions would be nice (and there are some examples on the matlab
> fileexchange), but 2by2 is the most common case and has an unambiguous
> definition.
> 
> 
> >
> >>
> >> 3) The odds-ratio should be removed because it is not part of the test. It
> >> is actually more general than this test.
> >>
> > Don't feel strongly about this either way. It comes almost for free, and R
> > seems to do the same.
> 
> same here, it's kind of traditional to return two things, but in this
> case the odds ratio is not the test statistic, but I don't see that it
> hurts either
> 
> >
> >> 4) Variable names such as min and max should not shadow Python functions.
> >
> > Yes, Josef noted this already, will change.
> >>
> >> 5) Is there a reference to the algorithm implemented? For example, SPSS
> >> provides a simple 2 by 2 algorithm:
> >>
> >> http://support.spss.com/ProductsExt/SPSS/Documentation/Statistics/algorithms/14.0/app05_sig_fisher_exact_test.pdf
> >
> > Not supplied, will ask on the ticket and include it.
> 
> I thought, I saw it somewhere, but don't find the reference anymore,
> some kind of bisection algorithm, but having a reference would be
> good.
> Whatever the algorithm is, it's fast, even for larger values.
> 
> >>
> >> 6) Why exactly does the dtype need to int64? That is, is there something
> >> wrong with hypergeom function? I just want to understand why the precision
> >> change is required because the input should enter with sufficient precision.
> >>
> > This test:
> > fisher_exact(np.array([[18000, 80000], [20000, 90000]]))
> > becomes much slower and gives an overflow warning with int32. int32 is just
> > not enough. This is just an implementation detail and does not in any way
> > limit the accepted inputs, so I don't see a problem here.
> 
> for large numbers like this the chisquare test should give almost the
> same results, it looks pretty "asymptotic" to me. (the usual
> recommendation for the chisquare is more than 5 expected observations
> in each cell)
> I think the precision is required for some edge cases when
> probabilities get very small. The main failing case, I was fighting
> with for several days last winter, and didn't manage to fix had a zero
> at the first position. I didn't think about increasing the precision.
> 
> >
> > Don't know what the behavior should be if a user passes in floats though?
> > Just convert to int like now, or raise a warning?
> 
> I wouldn't do any type checking, and checking that floats are almost
> integers doesn't sound really necessary either, unless or until users
> complain. The standard usage should be pretty clear for contingency
> tables with count data.
> 
> Josef
> 
> >
> > Cheers,
> > Ralf
> >
> > _______________________________________________
> > SciPy-User mailing list
> > SciPy-User at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-user
> >
> >
> 
> 
> ------------------------------
> 
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
> 
> 
> End of SciPy-User Digest, Vol 87, Issue 38
> ******************************************