Mapping with continguous ranges of keys

Fri Dec 16 13:04:05 EST 2016

On 12/15/2016 11:57 PM, Terry Reedy wrote:
> On 12/15/2016 4:30 PM, Thomas Nyberg wrote:
>> On 12/15/2016 12:48 PM, Terry Reedy wrote:
>>> A sparse array has at least half missing values.  This one has none on
>>> the defined domain, but contiguous dupicates.
>>>
>>
>> I'm sorry for devolving into semantics, but there certainly isn't a
>> single definition of "sparse array" out there. For example, the
>> definition in wikipedia (https://en.wikipedia.org/wiki/Sparse_array)
>> doesn't agree with you:
>
> I think it does ;-).
>
>> "In computer science, a sparse array is an array in which most of the
>> elements have the default value (usually 0 or null)."
>
> ...
 >
>> Personally my usage of sparse arrays in scipy has _always_ had all
>> defined values it's just that the default value was 0. I never deal
>> with "missing" values.
>
> Lucky you.  In statistics and analysis of real, experimental data,
> missing values are often a possibility.  They are always a nuisance,
> sometimes a major one.  When the data are represented by arrays,  rather
> than by some compacted form, some value has to be chosen to represent
> the absence of of data.
>

I thought that the definitions were different because yours said "at 
least half" and the other said "most" in addition to your saying 
"missing values" and the other said "default value". Ignoring the first 
minor point, if by "missing value" you basically mean "default value", 
then yes I agree that the definition is the same. Also I didn't mean 
that I was "lucky" to never have dealt with missing values (strictly 
speaking I have, I just do it rarely), but my point was that I deal with 
sparse matricies/arrays/structures quite often, but I rarely deal with 
"missing values" like nulls/nans/Nones. But that point is moot given 
that you apparently meant the same thing with "missing value" as I did 
with "default value" so I don't think we disagree here.

Taking a step back, the real point of sparse anything is: can you 
represent it in a space/speed efficient manner which is "stable" under 
whatever operations you care about. I.e. if you have a default value of 
0, then you have the property that 0 + x = x and 0 * x = 0 which means 
that sparse matrices with default value 0 are stable under addition and 
multiplication. If you have a default value of nan (or None or null) you 
usually have the convention that nan * x = nan (possibly depends on the 
sign of x) and nan + x = nan which means that a sparse matrix with nans 
as default is also stable under addition and multiplication. If you 
chose a default value of (say) 1 you would run into issues with the 
stability of these operations. It wouldn't mean it's not "sparse", but 
it would mean that the operations you care about might not work as well.

The extension to the thread at and is just that now you have multiple 
default values and the fact that they are not assigned at random, but 
are instead runs of constant values means that you can put a sparse 
structure on this (with a suitable definition of "sparse").

 > Let's devolve to a memory-based language like C. An physical array
 > consisting of sequential memory locations must have *some* bit pattern
 > in every byte and hence in every multibyte block.  If I remember
 > correctly, C malloc initialized bytes to all 0 bits, which is an int
 > value of 0 also.  If there is no meaningful value, there must be a
 > default value that means 'missing'.  0 may mean 'missing'. Null values
 > are by definition non-values.  Python uses None as a Null.  If the
 > example had been populated largely with Nones, I would have called it
 > 'sparse'.

It's not really super pertinent to this discussion, but malloc does not 
guarantee that the values are zeroed. That is guaranteed by calloc though:

	http://man7.org/linux/man-pages/man3/malloc.3.html
	http://man7.org/linux/man-pages/man3/calloc.3p.html

Also the point with a sparse representation is that your default value 
wouldn't exist in memory anywhere and that instead the operations would 
understand that it exists by other factors. For example, a sparse matrix 
with all 0s might* be represented by rows of the form

	i,j,v

where i and j are the row and column indices and v is the value at the 
position. So in this representation we would have as many rows as we 
would non-default values.

*I say might because there are usually more compact forms like this. 
This isn't the internal representation of scipy.sparse.csr_matrix, for 
example.

Regardless, I think I wrote too much to say basically that I don't think 
we're really disagreeing except possibly slightly on perspective.

Cheers,
Thomas