[SciPy-User] creating sparse indicator arrays
Nathaniel Smith
njs at pobox.com
Tue Nov 29 14:07:32 EST 2011
On Tue, Nov 29, 2011 at 7:14 AM, <josef.pktd at gmail.com> wrote:
> Is there a simple or fast way to create a sparse indicator array, `a`
> below, without going through the dense matrix first?
The standard way is to use the LIL or DOK sparse formats. If you want
to use them then you'll have to do your construction "by hand", though
-- you can't do the nice broadcasting tricks you're using below.
Alternatively, constructing CSC or CSR format directly is not that
hard, though it may take some time to wrap your head around the
definitions...
>>>> from scipy import sparse
>>>> g = np.array([0, 0, 1, 1]) #categories, integers,
>>>> u = np.arange(2) #unique's, range(number_categories)
If 'u' is *always* going to be np.arange(number_categories), then
actually this is quite trivial (untested code):
data = np.ones(len(g), dtype=np.int8)
indices = g
indptr = np.arange(len(g))
a = np.csr_matrix((data, indices, indptr))
This gives you a CSR matrix, which you can either use as is or convert to CSC.
If you want to build CSC directly, and want to support an arbitrary
'u' vector, then you could do something like (untested code):
data = np.ones(len(g), dtype=np.int8)
indices = np.empty(len(g), dtype=int)
write_offset = 0
indptr = np.empty(number_categories, dtype=int)
for col_i, category in enumerate(u):
indptr[col_i] = write_offset
rows = (data == category).nonzero()[0]
indices[write_offset:write_offset + len(rows)] = rows
write_offset += len(rows)
Or you could just use a loop that fills in an LIL matrix :-)
-- Nathaniel
More information about the SciPy-User
mailing list