[SciPy-User] Identify unique sequence data from array

Thu Dec 23 00:12:06 EST 2010

On Wed, Dec 22, 2010 at 11:49 PM, Skipper Seabold <jsseabold at gmail.com> wrote:
> On Wed, Dec 22, 2010 at 6:00 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
>>
>>
>> On Wed, Dec 22, 2010 at 2:51 PM, Robert Kern <robert.kern at gmail.com> wrote:
>>>
>>> On Wed, Dec 22, 2010 at 16:47, Paul Anton Letnes
>>> <paul.anton.letnes at gmail.com> wrote:
>>> >
>>> > On 22. des. 2010, at 21.18, otrov wrote:
>>> >
>>> >>>> The problem:
>>> >>
>>> >>>> I have 2D data sets (scipy/numpy arrays) of 10^7 to 10^8 rows, which
>>> >>>> consists of repeated sequences of one unique sequence, usually ~10^5 rows,
>>> >>>> but may differ in scale. Period is same for both columns, so there is not
>>> >>>> really difference if we consider 2D or 1D array.
>>> >>>> I want to track this data block.
>>> >>
>>> >>> for i in range(1, len(X)-1):
>>> >>>    if (X[i:] == X[:-i]).all():
>>> >>>        break
>>> >>
>>> >> Just look at that python beauty! Such a great language when in hand of
>>> >> a smart user.
>>> >> Thanks for you snippet, but unfortunately it takes forever to finish
>>> >> the task
>>> >
>>> > You could also check one element at a time. I think it will be faster,
>>> > because it will break if comparison of the first element doesn't hold. Then,
>>> > if you find such an occurrence, use Robert's method to double check that you
>>> > found the true repetition period.
>>>
>>> Excellent point.
>>>
>>
>> Why not do an FFT and look at the shape around the carrier frequency? The DC
>> level should probably be subtracted first. It shoud also be possible to
>> construct a Weiner filter to extract the sequences if they don't occur with
>> strict periods.
>>
>
> Could you give an example?  I've used a convolution to find the number
> of successive discrete events, but I'm not sure how to generalize it
> or if your suggestion is similar.
>
> For example, to count the number of three successes in a row for
> Bernoulli trials and to find where
>
> In [1]: import numpy as np
>
> In [2]: x = np.array([1,1,1,0,0,1,0,1,0,1,1,1,0,1,0,1])
>
> In [3]: y = np.array([1,1,1])
>
> In [4]: idx = np.convolve(x,y)
>
> In [5]: num_runs = len(np.where(idx==len(y))[0])
>
> In [6]: # Extract runs from original array
>
> In [6]: idx = np.where(idx==len(y))[0]-(len(y)-1)
>
> In [7]: idx = np.hstack([np.arange(i,i+3) for i in idx])
>
> In [8]: x[idx]
> Out[8]: array([1, 1, 1, 1, 1, 1])

It's different because in the original case you want to find the
periodicity, for example calculate the periodogram/fft and find the
spike. This should give the frequency and then the period length. If
there are small numerical problems, it would still narrow down the
range to search with direct comparison.

Josef
(finding runs sounds like a fun problem, more than multiple tests and
comparisons)

>
> Skipper
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>