[SciPy-User] creating sparse indicator arrays

Nathaniel Smith njs at pobox.com
Tue Nov 29 14:07:32 EST 2011


On Tue, Nov 29, 2011 at 7:14 AM,  <josef.pktd at gmail.com> wrote:
> Is there a simple or fast way to create a sparse indicator array, `a`
> below, without going through the dense matrix first?

The standard way is to use the LIL or DOK sparse formats. If you want
to use them then you'll have to do your construction "by hand", though
-- you can't do the nice broadcasting tricks you're using below.
Alternatively, constructing CSC or CSR format directly is not that
hard, though it may take some time to wrap your head around the
definitions...

>>>> from scipy import sparse
>>>> g = np.array([0, 0, 1, 1])   #categories, integers,
>>>> u = np.arange(2)    #unique's,  range(number_categories)

If 'u' is *always* going to be np.arange(number_categories), then
actually this is quite trivial (untested code):

data = np.ones(len(g), dtype=np.int8)
indices = g
indptr = np.arange(len(g))
a = np.csr_matrix((data, indices, indptr))

This gives you a CSR matrix, which you can either use as is or convert to CSC.

If you want to build CSC directly, and want to support an arbitrary
'u' vector, then you could do something like (untested code):

data = np.ones(len(g), dtype=np.int8)
indices = np.empty(len(g), dtype=int)
write_offset = 0
indptr = np.empty(number_categories, dtype=int)
for col_i, category in enumerate(u):
  indptr[col_i] = write_offset
  rows = (data == category).nonzero()[0]
  indices[write_offset:write_offset + len(rows)] = rows
  write_offset += len(rows)

Or you could just use a loop that fills in an LIL matrix :-)

-- Nathaniel



More information about the SciPy-User mailing list