find overlapping lines & output times observed

Oscar Benjamin oscar.j.benjamin at gmail.com
Mon May 6 16:32:33 EDT 2013


On 6 May 2013 19:39, Linsey Raaijmakers <lm.raaijmakers at gmail.com> wrote:
> I have a file like this:
> action start    end
> 50      5321    5321
> 7       5323            5347
> 12      5339            5351
> 45      5373    5373
> 45      5420            5420
> 25      5425            5425
[snip]

your code below suggests that your file also has an "apex" column. If
that's correct what do you what is it and what do you want to do with
it?

[snip]
> I have a script now that identifies overlap between two actions (see bottom page), but how can I change this so that it outputs all possible combinations?

I looked at that code and I don't think it does what you describe. Are
you sure that it works?

> My desired output would be:
>
> action    times observed    apex
> 50         5                          5321, 5451, 5533, 5634,  5740
> 50,45    1                          5533;5533
> 7           4                          5347, 5689, 5688, 5845
> 7,25      2                          5347;5425, 5689;5682
> 7,25,26 1                          5689;5682;5690
>
> CODE:
>
> from collections import Counter
> f = open('and.txt','r');
>
> action_list = []
> onset_list = []
> apex_list = []
> offset_list = []
> action_counter = 0
> combination_list = []
>
>
> for line in f:
>   fields = line.split("\t")
>   for col in fields:
>     action = fields[0]
>     onset = fields[1]
>     apex = fields[2]
>     offset = fields[3]

The above code suggests that the file has four columns. Also since
you're not actually using the loop variable "col" you can just delete
the "for" line and unindent the rest. In fact the easiest way to do
all of this is just:

    action, onset, apex, offset = line.split()

>
>   action_list.append(action)
>   onset_list.append(onset)
>   apex_list.append(apex)
>   offset_list.append(offset)
>
> action_cnvrt = map(int, action_list)
> c = Counter(action_cnvrt)
>
> filtered = list(set(action_list))

There's no need to convert this back to a list if you're just going to
iterate over it again with map.

> filtered_cnvrt = map(int, filtered)
>
> for a in filtered_cnvrt:
>   action_count = str(a)+"\t"+str(c[a])
>   print action_count

The above counts the number of times each event occurs which is one
part of your desired output.

>
> for i in range (0,len(onset_list)):
>   combination_list.append(action_list[i])
>   for j in range(0,len(apex_list)):
>     if i != j:
>       if onset_list[j]>= onset_list[i] and apex_list[j] <= apex_list[i]:
>         print action_list[j]+","+action_list[i]+'\t'+onset_list[j]+'\t'+apex_list[j]+'\t'+onset_list[i]+'\t'+apex_list[i]

What is combination_list for? It should just end up containing the
same thing as action_list if I understand correctly.

It's generally better in Python to loop directly over things rather
than using indices so, instead of something like:

  for i in range(len(onset_list)):
    print offset_list[i] - onset_list[i]

you should do something like

  for offset, onset in zip(offset_list, onset_list):
    print offset - onset

and your code will be a lot clearer.

The algorithm you are using is to loop over all events and then to
loop over all other events comparing all pairs of events. This will
not scale very well if you want to look at a large file or to compare
simultaneous occurrences of more than two events.

It looks as if your input data is ordered by the onset column. Is that
the case? If so then you can use an algorithm that just loops once
over all the events. The way the algorithm works is that you store
which events are currently active and loop through the events keeping
track of the start time of the most recently added event adding and
removing events as they start and stop. In pseudocode:

now = start of first event
active = [first event]
for next_starting in events (not including first):
    next_ending = event that ends soonest from active
    while start of next_starting > end of next_ending:
        report active events from now to end of next_ending
        now = end of next_ending
        remove next_ending from active
    report active events from now until start of next_starting
    now = start of next_starting
    add next_starting to active

And some more code to deal with what happens when you get to the end
of the list of events...

The report steps probably mean adding to a Counter or dict to remember
that the currently active events were active during each particular
time window.


Oscar



More information about the Python-list mailing list