[SciPy-User] Again on Calculating (Conditional) Time Intervals

Fri Jun 4 06:16:53 EDT 2010

Dear Brandon,
Thanks a lot. I am reading carefully and tested your snippet; it seems
to be just doing fine.
Maybe a bit of clarification if in order
Let us say that an object A meets B at times t_1, t_2, t_3 (in
increasing order and there may be gaps between them). Then A meets C at
times t_5, t_6 and t_7. Then the quantity I am after is abs(t_1-t_5)
i.e. the time interval between the beginning of the A-B and A-C
interaction. This may sound obscure, but if A is capable of spreading
information, then this time interval is a measure of its activity (how
long before after talking to B it starts talking to C?).
Cheers

Lorenzo

On Tue, 2010-06-01 at 18:45 -0400, Nuttall, Brandon C wrote:
> Lorenzo,
> 
> I'm not sure I'm clear on what you want. However, the attached Python code produces a list of the observed delta times (time between meetings of different ID pairs). That list can then be analyzed using histogram or any of the probability density functions in scipy and numpy.
> 
> Hope this helps.
> 
> Brandon
> 
> -----Original Message-----
> From: scipy-user-bounces at scipy.org [mailto:scipy-user-bounces at scipy.org] On Behalf Of Lorenzo Isella
> Sent: Tuesday, June 01, 2010 11:17 AM
> To: scipy-user at scipy.org
> Subject: [SciPy-User] Again on Calculating (Conditional) Time Intervals
> 
> Dear All,
> I hope this is not too off-topic. I have digged up an old email I posted which went unanswered quite some time ago.
> I made some progress on a simpler problem than the one for which I initially asked for help and I am attaching at the end of the email 
> my own scripts. I anyone can help me to progress a bit further, I will be very grateful.
> Consider and array of this kind:
> 
> 1      12    45
> 2      7       12
> 2      15     37
> 3      25     89
> 3      8       13
> 3      13     44
> 4      77     89
> 4      77     89
> 5      12     22
> 8      12     22
> 9      15     22
> 11    22     37
> 23    3       12
> 24    18     37
> 25    1      12
> 
> 
> 
> where the first column is time measured in some units. The other two 
> columns are some ID's identifying  infected individuals establishing a 
> contact at the corresponding time. As you can see, there may be 
> time-gaps in my recorded times and there may be repeated times if 
> several contacts take place simultaneously. The ID's are always sorted 
> out in such a way the ID number of the 2nd column is always smaller than 
> the corresponding entry of the third column (I am obviously indexing 
> everything from 1).
> Now, this is my problem: I want to look at a specific ID I will call A 
> (let us say A is 12) and calculate all the time differences t_AC-t_AB 
> for B!=C, i.e. all the time intervals between the most recent contact 
> between A and B and the first subsequent contact between and A and C 
> (which has to be different from B).
> An example to fix the ideas: A=12, B=22, C=1, then
> t_AB=8 (pick the most recent one before t_AC)
> t_AC=25,
> hence t_AC-t_AB=25-8=17. (but let me say it again: I want to be able to 
> calculate all such intervals for any B and C on the fly).
> It should be clear at this point that the calculated t_AC-t_AB != 
> t_AB-t_AC as some time-ordering is implicit in the definition (in 
> t_AC-t_AB, AC contacts have to always be more recent than AB contacts).
> Even in the case of multiple disjointed AB and AC contacts, I always 
> have to look for the closest time intervals in time. E.g. if I had
> 
> 10    12       22
> 40    12       22
> 60    1         12
> 100  1         12
> 110  12       22
> 130  12       22
> 150  1         12 
> 
> then I would work out the time intervals 60-40=20 and 150-130=20.
> Now, thanks to the help I got from the list, I am able to calculate the 
> distribution of contact and interval durations between IDs (something simpler than the conditional 
> time interval). See the code at the end of the email which you can run on the first dataset I provide in this email to get 
> the contact/interval distributions. 
> Sorry for the long email, but any suggestion about how to calculate the conditional probability efficiently would help me a great deal.
> Many thanks
> 
> Lorenzo
> 
> 
> #!/usr/bin/env python
> import scipy as s
> import pylab as p
> import numpy as n
> import sys
> import string
> 
> 
> def single_tag_contact_times(sliced_data, tag_id):
> 
>  
>     #I can follow a given tag by selecting his ID number and looking
>     #for it through the data 
>     
>     sel=s.where(sliced_data==tag_id)
> 
>     #now I need to add a condition in case the id of the tag I have chosen is non-existing
>   
>     
>     if (len(sel[0])==0):
>         #print "the chosen tag does not exist"
>         return
>     
>     tag_contact_times=sliced_data[sel[0],0] #I select the times at which
>     #the tag I am tracking undergoes a contact.
> 
>    
> 
>     tag_no_rep=n.unique1d(tag_contact_times) #The idea is the following:
>     #in a given time interval delta_slice, a tag may undergo multiple contacts
>     #with different tags. This corresponds to different entries in the output
>     #of time_binned_interaction. That function does not allow for multiple contacts between
>     # the SAME two tags being reported in the same time slice, but it allows the same tag ID
>     #to appear twice in the same time window if it gets in touch with TWO different tags
>     #within the same delta_slice. It is fundamental to know that in a given time slice
>     #tag A has estabilished contact with tag B and tag C (if I discard any bit of this info,
>     #then I lose info about the state of the network at that time slice), but when it comes to
>     #simply having the time-dependent distribution of contact durations and intervals between
>     #any two contacts estabilished by packet A, I will simply say that tag A reported a contact
>     #in that given time-slice. More sophisticated statistics (e.g. the number of contacts
>     #estabilished by tag A in a given time slice), can be implemented if found useful/needed
>     #later on.
> 
> 
> 
>     #p.save("single_tag_contact_times_no_rep.dat",tag_no_rep,fmt='%d')
> 
>     return  tag_no_rep
> 
> 
> def contact_duration_and_interval_many_tags(sliced_interactions,\
>                                          delta_slice, counter):
> 
>     #I added this line since now there is no guarantee that in the edge list
>     # (contact list) tag_A---tag_B, the id of tag_A is <= id of tag_B.
> 
>     sliced_interactions[:,1:3]=s.sort(sliced_interactions[:,1:3])
> 
>     #This function iterates interval_between_contacts_single_tag on a all the tag ID`s
>     #thus outputting the distribution of time intervals between any two contacts in the system.
> 
>     tag_ids= n.unique1d(s.ravel(sliced_interactions[:,1:3])) #to get a list of
>     #all tag ID`s, which appear (repeated) on two rows of the matrix output by 
>     # time_binned_interaction
> 
> 
>     #n.savetxt("tag_IDs.dat", tag_ids ,  fmt='%d')
> 
>     
>     # tag_ids=tag_ids.astype('int')
> 
>     
> 
>     #print "tag_ids is, ", tag_ids
> 
>     overall_gaps=s.zeros(0) #this array will contain the time intervals between two consecutive
>     #contacts for all the tags in the system.
> 
> 
> 
>     overall_duration=s.zeros(0) #this array will contain the time duration of the
>     #contacts for all the tags in the system.
> 
> 
> 
>     for i in xrange(len(tag_ids)): 
>         track_tag_id=tag_ids[i]  #i.e. iterate on all tags
>         
>         contact_times=single_tag_contact_times(sliced_interactions, track_tag_id) #get
>         #an array with all the interactions of a given tag
> 
>         #print "contact_times is, ", contact_times
> 
>         results=contact_duration_and_interval_single_tag(contact_times, delta_slice)
> 
>         tag_duration=results[0]
> 
>         
>         tag_intervals=results[1] #get
>         #an array with the time intervals between two contacts for a given tag
> 
>         
>         #print "tag_intervals is, ", tag_intervals
>         
>         overall_gaps=s.hstack((overall_gaps,tag_intervals)) #collect
>         #the results on all tags
>         
>         
>         #print "overall_gaps is, ", overall_gaps
> 
>         overall_duration=s.hstack((overall_duration,tag_duration))
> 
>     #overall_gaps=overall_gaps[s.where(overall_gaps !=0)]
>     #overall_duration=overall_duration[s.where(overall_duration !=0)]
>     filename="many_tags_contact_interval_distr2_%01d"%(counter+1)
>     filename=filename+"_.dat"
> 
>     n.savetxt(filename, overall_gaps ,  fmt='%d')
> 
>     filename="many_tags_contact_duration_distr2_%01d"%(counter+1)
>     filename=filename+"_.dat"
> 
>     
>     n.savetxt(filename, overall_duration ,  fmt='%d')
> 
>     return overall_duration, overall_gaps
> 
> 
> 
> def contact_duration_and_interval_many_tags(sliced_interactions,\
>                                          delta_slice, counter):
> 
>     #I added this line since now there is no guarantee that in the edge list
>     # (contact list) tag_A---tag_B, the id of tag_A is <= id of tag_B.
> 
>     sliced_interactions[:,1:3]=s.sort(sliced_interactions[:,1:3])
> 
>     #This function iterates interval_between_contacts_single_tag on a all the tag ID`s
>     #thus outputting the distribution of time intervals between any two contacts in the system.
> 
>     tag_ids= n.unique1d(s.ravel(sliced_interactions[:,1:3])) #to get a list of
>     #all tag ID`s, which appear (repeated) on two rows of the matrix output by 
>     # time_binned_interaction
> 
> 
>     #n.savetxt("tag_IDs.dat", tag_ids ,  fmt='%d')
> 
>     
>     # tag_ids=tag_ids.astype('int')
> 
>     
> 
>     #print "tag_ids is, ", tag_ids
> 
>     overall_gaps=s.zeros(0) #this array will contain the time intervals between two consecutive
>     #contacts for all the tags in the system.
> 
> 
> 
>     overall_duration=s.zeros(0) #this array will contain the time duration of the
>     #contacts for all the tags in the system.
> 
> 
> 
>     for i in xrange(len(tag_ids)): 
>         track_tag_id=tag_ids[i]  #i.e. iterate on all tags
>         
>         contact_times=single_tag_contact_times(sliced_interactions, track_tag_id) #get
>         #an array with all the interactions of a given tag
> 
>         #print "contact_times is, ", contact_times
> 
>         results=contact_duration_and_interval_single_tag(contact_times, delta_slice)
> 
>         tag_duration=results[0]
> 
>         
>         tag_intervals=results[1] #get
>         #an array with the time intervals between two contacts for a given tag
> 
>         
>         #print "tag_intervals is, ", tag_intervals
>         
>         overall_gaps=s.hstack((overall_gaps,tag_intervals)) #collect
>         #the results on all tags
>         
>         
>         #print "overall_gaps is, ", overall_gaps
> 
>         overall_duration=s.hstack((overall_duration,tag_duration))
> 
>     #overall_gaps=overall_gaps[s.where(overall_gaps !=0)]
>     #overall_duration=overall_duration[s.where(overall_duration !=0)]
>     filename="many_tags_contact_interval_distr2_%01d"%(counter+1)
>     filename=filename+"_.dat"
> 
>     n.savetxt(filename, overall_gaps ,  fmt='%d')
> 
>     filename="many_tags_contact_duration_distr2_%01d"%(counter+1)
>     filename=filename+"_.dat"
> 
>     
>     n.savetxt(filename, overall_duration ,  fmt='%d')
> 
>     return overall_duration, overall_gaps
> 
> 
> def contact_duration_and_interval_single_tag(single_tag_no_rep, delta_slice):
> 
>     #the following if condition is useful only when I am really tracking a particular
>     #tag whose ID is given a priory but which may not exist at all (in the sense that
>     #it would not estabilish any contact) in the time window during which I am studying
>     #the system.
> 
> 
>     if (single_tag_no_rep==None):
>         print "The chosen tag does not exist hence no analysis can be performed on it"
>         return
> 
> 
> 
>     # delta_slice=int(delta_slice) #I do not need floating point arithmetic
> 
>     single_tag_no_rep=(single_tag_no_rep-single_tag_no_rep[0])/delta_slice
>     gaps=s.diff(single_tag_no_rep) #a bit more efficient than the line above
> 
>     #print "gaps is, ", gaps
> 
>     #gaps is now an array of integers. It either has a list of consecutive 1`s
>     # (which means a contact duration of delta_slice times the number of consecutive ones)
>     # of an entry higher than one which expresses (in units of delta_slice) the time during
>     #which the tag underwent no contact
>     
> 
>     #p.save("gaps.dat",gaps, fmt='%d')
> 
>     # find_gap=s.where(gaps != 1)[0]
> 
>     find_gap=s.where(gaps > 1)[0] #a better definition: a tag may estabilish
>     #several contacts within the same timeslice. So I may have some zeros in
>     #gaps due to different simultaneous contacts. a tag is truly disconnected
>     #from all the others when I see an increment larger than  one in the
>     #rescaled time.
> 
> 
>     gap_distr=(gaps[find_gap]-1)*delta_slice  #so, this is really the list of the
>     #time interval between two contacts for my tag. After the discussion with Ciro,
>     #I modified slightly the definition (now there is a -1) in the definition.
>     #It probably does not matter much for the calculated distribution.
> 
>     #print "gap_distr is, ", gap_distr
>     #NB: the procedure above does NOT break down is gap_distr is empty
> 
> 
>     #Now I calculate the duration of the contacts of my tag. I changed this bit since
>     #I had new discussions with Ciro
> 
>     #single_tag_no_rep=s.hstack((0,single_tag_no_rep))
> 
>     #print "single_tag_no_rep is, ", single_tag_no_rep, "and its length is, ", len(single_tag_no_rep)
>     
>     # e2=s.diff(single_tag_no_rep)
> 
>     # #print "e2 is, ", e2
>     
>     # sel=s.where(e2!=1)[0]
>     # #print "sel is, ", sel
> 
>     #sel=s.where(gaps!=1)[0]
> 
>     # res=0 #this will contain the results and will be overwritten
> 
>     #What is here needs to be tested very carefully! There may be some bugs
> 
> 
>     sol=s.hstack((0,find_gap,len(gaps)))
>     #print "sol is, ", sol
> 
>         
>     
>     res=s.diff(sol)
>     #print "res initially is, ", res
> 
> 
>     res[0]=res[0]+1 #to account for troubles I normally have at the beginning of the array
> 
>     #print "res is, ", res
>         
>         
> 
> 
>     res=res*delta_slice
> 
> 
>     #print "the sum of all the durations is, ", res.sum()
> 
>     return [res,gap_distr] 
> 
> 
> f = open(sys.argv[1])
> sliced_interactions = [map(int, string.split(line)) for line in f.readlines()]
> f.close()
> 
> print ("sliced_interactions is, ", sliced_interactions)
> 
> sliced_interactions = s.array(sliced_interactions, dtype="int64")
> 
> print ("sliced_interactions is now, ", sliced_interactions)
> 
> counter=0
> 
> delta_slice=1
> 
> contact_duration_and_interval_many_tags(sliced_interactions,\
>                                             delta_slice,counter)
> 
> 
> 
> 
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user