Struggling for inspiration with lists
Tim Chase
python.list at tim.thechases.com
Tue Dec 17 23:50:36 EST 2013
On 2013-12-18 03:51, Denis McMahon wrote:
> I need to keep the timestamp / data association because I need to
> generate output that identifies (a) the longest repeated sequence
> (b) how many elements in the longest repeated sequence (c) at what
> timestamps each occurrence started.
>
> I'm not sure of the best way, programatically, to aproach this
> task, which means I'm unsure whether eg a list of tuples ( time,
> data ) or an OrderedDict keyed on the timestamp is the best
> starting point.
[snip]
> Is there a better way to do this?
You might have to define your criteria for "best". Things that
might play into it:
- what happens if there's more than one "longest" sequence?
- can you specify a minimum length that you don't have to track? (a
maximum sequence of length-1 might be uninteresting)
- how large can your input data grow to be? If it's huge, you're
looking at searching a problem space something like O(N^2) or
possibly O(N log N) if you're lucky. Which in turn can involve
lots of memory consumption.
- do you just want to know what that sequence is, or do you want to
know all the locations/offsets where it the longest sequence was
found?
I've played with two simple versions, the first is naive, while the
second tries to use a little bit of smarts to eliminate possibilities
when you know you have at least one match of a certain length. The
result of the function does allow for finding a plurality of longest
sequences, but my demo code just snags one with the max() function.
I leave the "find multiple maxes" as an exercise for the reader. :)
-tkc
def build1(d):
sequences = {} # sequence -> [list of start offsets]
for i in range(len(d)):
for j in range(i+1, len(d)):
key = tuple(value for timestamp, value in d[i:j])
sequences.setdefault(key, []).append(i)
return sequences
def build2(d):
sequences = {} # sequence -> [list of start offsets]
longest = 1
for i in range(len(d)):
for j in range(i+longest, len(d)):
key = tuple(value for timestamp, value in d[i:j])
if key in sequences:
sequences[key].append(i)
if len(sequences[key]) > longest:
longest = len(sequences[key])
print longest
else:
sequences[key] = [i]
return sequences
data = {
0: "d", 1: "x", 2: "y", 3: "t", 4: "d", 5: "y", 77: "g",
78: "h", 79: "x", 80: "y", 206: "t", 210: "d", 211: "x",
}
d = data.items()
d.sort()
for fn in (build1, build2):
print fn.__name__
sequences = fn(d)
longest_key, offsets = max((
pair
for pair in sequences.iteritems()
if len(pair[1]) > 1
), key=lambda (k,v): len(k))
print "Longest sequence %r found at %s" % (
longest_key,
[d[i][0] for i in offsets],
)
for k,v in sequences.iteritems():
if len(v) > 1:
print "%s: %r" % (k,v)
print '=' * 50
.
More information about the Python-list
mailing list