[SciPy-user] histogramdd bug fix

Fri Feb 15 08:47:19 EST 2008

Hi All,

There is a bug in histogramdd when more than three dimensions is used.
Essentially a permute operation must be performed at the end to align 
the output matrix with the input dimensions. Up to three dimensions, 
this can always be completed by a single round of axis swaps. However, 
in 4+ dimensions more than one round are required.
The following has been hacked so that it will not return until the 
correct permutation has been found. It is not smart and will certainly 
slow you down if you are using more than say 6 dimensions but otherwise 
it seems to work.
Since I know nothing about using svn or the rules associated with numpy, 
someone else will have to put the change in the numpy build

Cheers
Chris

def histogramdd(sample, bins=10, range=None, normed=False, weights=None):
    """histogramdd(sample, bins=10, range=None, normed=False, weights=None)

    Return the N-dimensional histogram of the sample.

    Parameters:

        sample : sequence or array
            A sequence containing N arrays or an NxM array. Input data.

        bins : sequence or scalar
            A sequence of edge arrays, a sequence of bin counts, or a scalar
            which is the bin count for all dimensions. Default is 10.

        range : sequence
            A sequence of lower and upper bin edges. Default is [min, max].

        normed : boolean
            If False, return the number of samples in each bin, if True,
            returns the density.

        weights : array
            Array of weights.  The weights are normed only if normed is 
True.
            Should the sum of the weights not equal N, the total bin 
count will
            not be equal to the number of samples.

    Returns:

        hist : array
            Histogram array.

        edges : list
            List of arrays defining the lower bin edges.

    SeeAlso:

        histogram

    Example

        >>> x = random.randn(100,3)
        >>> hist3d, edges = histogramdd(x, bins = (5, 6, 7))

    """

    try:
        # Sample is an ND-array.
        N, D = sample.shape
    except (AttributeError, ValueError):
        # Sample is a sequence of 1D arrays.
        sample = atleast_2d(sample).T
        N, D = sample.shape
    nbin = empty(D, int)
    edges = D*[None]
    dedges = D*[None]
    if weights is not None:
        weights = asarray(weights)

    try:
        M = len(bins)
        if M != D:
            raise AttributeError, 'The dimension of bins must be a equal 
to the dimension of the sample x.'
    except TypeError:
        bins = D*[bins]

    # Select range for each dimension
    # Used only if number of bins is given.
    if range is None:
        smin = atleast_1d(array(sample.min(0), float))
        smax = atleast_1d(array(sample.max(0), float))
    else:
        smin = zeros(D)
        smax = zeros(D)
        for i in arange(D):
            smin[i], smax[i] = range[i]

    # Make sure the bins have a finite width.
    for i in arange(len(smin)):
        if smin[i] == smax[i]:
            smin[i] = smin[i] - .5
            smax[i] = smax[i] + .5

    # Create edge arrays
    for i in arange(D):
        if isscalar(bins[i]):
            nbin[i] = bins[i] + 2 # +2 for outlier bins
            edges[i] = linspace(smin[i], smax[i], nbin[i]-1)
        else:
            edges[i] = asarray(bins[i], float)
            nbin[i] = len(edges[i])+1  # +1 for outlier bins
        dedges[i] = diff(edges[i])

    nbin =  asarray(nbin)

    # Compute the bin number each sample falls into.
    Ncount = {}
    for i in arange(D):
        Ncount[i] = digitize(sample[:,i], edges[i])

    # Using digitize, values that fall on an edge are put in the right bin.
    # For the rightmost bin, we want values equal to the right
    # edge to be counted in the last bin, and not as an outlier.
    outliers = zeros(N, int)
    for i in arange(D):
        # Rounding precision
        decimal = int(-log10(dedges[i].min())) +6
        # Find which points are on the rightmost edge.
        on_edge = where(around(sample[:,i], decimal) == 
around(edges[i][-1], decimal))[0]
        # Shift these points one bin to the left.
        Ncount[i][on_edge] -= 1

    # Flattened histogram matrix (1D)
    hist = zeros(nbin.prod(), int)

    # Compute the sample indices in the flattened histogram matrix.
    ni = nbin.argsort()
    shape = []
    xy = zeros(N, int)
    for i in arange(0, D-1):
        xy += Ncount[ni[i]] * nbin[ni[i+1:]].prod()
    xy += Ncount[ni[-1]]

    # Compute the number of repetitions in xy and assign it to the 
flattened histmat.
    if len(xy) == 0:
        return zeros(nbin-2, int), edges

    flatcount = bincount(xy, weights)
    a = arange(len(flatcount))
    hist[a] = flatcount
    # Shape into a proper matrix
    hist = hist.reshape(sort(nbin))
    if printOut:
        print ni
    mustPermute = True
    while mustPermute:
        nothingChanged = True
        for i in arange(nbin.size):
            j = ni[i]
            if j != i:
                nothingChanged = False
            hist = hist.swapaxes(i,j)
            ni[i],ni[j] = ni[j],ni[i]
        if nothingChanged:
            mustPermute = False
    #print "THis is after swapping the axis ", hist.shape
    # Remove outliers (indices 0 and -1 for each dimension).
    core = D*[slice(1,-1)]
    hist = hist[core]

    # Normalize if normed is True
    if normed:
        s = hist.sum()
        for i in arange(D):
            shape = ones(D, int)
            shape[i] = nbin[i]-2
            hist = hist / dedges[i].reshape(shape)
        hist /= s

    return hist, edges

-- 
**********************************************
*  Chris Lee                                 *
*  Laser physics and nonlinear optics group  *
*  MESA+ Institute                           *
*  University of Twente                      *
*  Phone: ++31 (0)53 489 3968                *
*  fax: ++31 (0) 53 489 1102                 *
**********************************************