[SciPy-User] Identify unique sequence data from array

Wed Dec 22 12:47:19 EST 2010

Hi,
I tried to seek for help on three other lists, but as this problem apparently can't be easily solved in matlab/octave(!?), I thought to try scipy/numpy and maybe gain advantage from python as more feature rich descriptive language

The problem:

I have 2D data sets (scipy/numpy arrays) of 10^7 to 10^8 rows, which consists of repeated sequences of one unique sequence, usually ~10^5 rows, but may differ in scale. Period is same for both columns, so there is not really difference if we consider 2D or 1D array.
I want to track this data block.

Simplified problem:

X = array([[1, 2],
           [1, 2],
           [2, 2],
           [3, 1],
           [2, 3],
           [1, 2],
           [1, 2],
           [2, 2],
           [3, 1],
           [2, 3],
           [1, 2],
           [1, 2],
           [2, 2],
           [3, 1],
           [2, 3],
           ...,
           [1, 2],
           [1, 2],
           [2, 2],
           [3, 1],
           [2, 3]]

I would like to extract repeated sequence data:

Y = array([[1, 2],
           [1, 2],
           [2, 2],
           [3, 1],
           [2, 3]]

as a result.

Or presented more visually:

I want to identify unique sequence data:

A B C D D D A B C D D D A B C D D D
|_________| |_________| |_________|
     |           |           |
   unique      unique      unique
  sequence    sequence    sequence
    data        data        data

Thanks for your time