[Baypiggies] Fwd: manipulating lists question

Martin Falatic martin at falatic.com
Thu Dec 5 13:52:35 CET 2013


Actually, I'll amend that: if you want to "compress" the data in this way
you need to intelligently expand a given field into a list when
conflicting data enters (this all assumes you cannot contain all the data
in full-blown form prior to trying to compress it in this way - in effect,
taking an input stream that might not be ordered by keys to begin with
(the 1302 set comes in, then the 6558 set, then maybe way later another
1302 set...) If you could contain all of a given collection of data sets
(say all the 1302s) in memory before processing them, it's relatively easy
to expand the fields into lists when necessary.

If you cannot, then there will need to be more code to expand a given flat
field into a list.

Suppose we have n=4 input lists:
['cat', 'NM123', 12]
['cat', 'NM123', 12]
['cat', 'NM123', 65]
['cat', 'NM456', 34]

You want to end up with the following (step-wise):
{'cat': ['NM123', 12]}
{'cat': ['NM123', 12]}
{'cat': ['NM123', [12,12,65]]}
{'cat': [['NM123','NM123','NM123','NM456'], [12,12,65,34]]}


It sounds easy at first but you have to keep track of how many unique data
sets you have so far for a given key to properly inflate a given field.

Again, this all assumes the object is a lossless compaction of a set of
large and mostly-repetitive data lists. If mangling or discarding some of
the data is OK then it's a simpler matter, but clearly there's a desire to
retain the differing data and thus you probably need to be able to
reconstruct it in the original order too.

Note that if you have 30 sets of the same data in a given field for a
given key and one if different you still get all 29 identical + 1
different values in a list for that field, but the other fields stay as
single values, saving space. As long as the data sets for a given key are
all unique you're being as efficient as possible without sacrificing data
fidelity, and it's still human-readable.

All that said, here's code that compacts your data without losing any of
it (that is, you could reinflate each non-identical data set accurately).
You can easily convert the sub-lists to comma/semicolon-delimited text
strings as desired. I only left the debug prints in case you want to more
quickly inspect the flow of how it works. *It worked with your inputs as
well.*

    x = [['cat', 'NM123', 12], ['cat', 'NM123', 12], ['cat', 'NM123', 65],
['cat', 'NM456', 34], ['dog', 'NM56', 65]]

    y = dict() # Output values
    k = dict() # List of counts of a given key
    for parts in x:
        #print()
        key = parts[0]
        if key in y:
            k[key] += 1
        else:
            k[key] = 1
            y[key] = []
        #print ("Processing key", key, k[key])
        for idx in xrange(len(parts)-1):
            new_data = parts[idx+1]
            #print ("Processing index", idx, len(y[key]), new_data)
            if len(y[key]) <= idx:
                #print("new element")
                y[key].append(new_data)
            else:
                cur_data = y[key][idx]
                if type(cur_data) is list:
                    #print("Appending to list")
                    cur_data.append(new_data)
                else:
                    if not cur_data == new_data:
                        cur_data = [cur_data] * (k[key]-1)
                        cur_data.append(new_data)
                        y[key][idx] = cur_data
                        #print("promoting elements to list", cur_data)
                    else:
                        pass # Not a contradicting data point
    print(x)
    print(y)

for x = [['cat', 'NM123', 12], ['cat', 'NM123', 12], ['cat', 'NM123', 65],
['cat', 'NM456', 34], ['dog', 'NM56', 65]]

I got y = {'dog': ['NM56', 65], 'cat': [['NM123', 'NM123', 'NM123',
'NM456'], [12, 12, 65, 34]]}

Which matches what we would expect to see.

 - Marty


On Thu, December 5, 2013 03:21, Martin Falatic wrote:
> One can leave that as an exercise for the reader. :-)
>
>
> I'm not sure why this gets a ';' versus a ',', nor is it clear if these
> field lists are supposed to be deduped or ordered or what... consider if
> you have three sets of data for 1302, and two ONLY vary by this field 28.
>  If you go to reconstruct the data set you end up with a somewhat mangled
>  thing.
>
> I take it this is an effort to compress the original data set to a more
> manageable size for output / internal representation, which suggests
> deduping isn't desirable (but which also suggests that you should simply
> include every instance for fields 1 (which we already do) and 28, and
> hope none of the other fields vary).
>
> On that note I'll throw this idea out there: given key field 0 as
> identical for n sets of data, for every subsequent field [1:] if the item
> is a str or int, consider it duplicated for all n sets. If the item is a
> list then it much have exactly n elements (in the order the n sets were
> parsed). That way if another fields is found to vary unexpectedly, it'll
> simply become a list of n elements (many might be the same).
>
> You can always take that and stringify the elements and lists for storage
>  or whatever. The idea is that your internal data representation is a
> much easier to work with set of lists/strs/ints within dictionary entries.
>
>
>
> - Marty
>
>
>
> On Thu, December 5, 2013 03:06, Vikram K wrote:
>
>> Good catch. All the other elements remain the same except this one.
>> Element
>> 28 needs to be changed (in the merged/collapsed list) so that when we
>> fuse or merge two elements of the larger list into one then Element 28
>> of the new element is (just combine whatever is present in element 28 in
>> both the lists keeping a ';' as delimiter):
>>
>> '1302:NM_080680.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC'; '
>> 1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC'
>>
>>
>>
>>
>>
>> On Thu, Dec 5, 2013 at 5:51 AM, Martin Falatic <martin at falatic.com>
>> wrote:
>>
>>
>>
>>> My solution works for the first three elements as stated, but what
>>> you do with the rest of the elements is tricky if they differ for a
>>> given key.
>>>
>>> For 1302 all the fields in the slice [3:] match each other *except*
>>> element 28: '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC'
>>> '1302:NM_080680.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC'
>>>
>>>
>>>
>>> Does this potentially happen with other elements at times? At this
>>> point you're faced with either discarding data or mangling data
>>> together. The "collapse" just takes the last [3:] slice encountered
>>> (for
>>> that remainder of data). Is that acceptable?
>>>
>>> - Marty
>>>
>>>
>>>
>>>
>>> On Thu, December 5, 2013 02:33, Vikram K wrote:
>>>
>>>
>>>> In the example i have given, the second and third elements of the
>>>> larger list (comp[7] and comp[8]) have a 1:1 mapping after the
>>>> second element.
>>> So
>>>
>>>
>>>> i would like to keep the first element as it is and then collapse
>>>> or merge the second and third elements (comp[7] and comp[8]) into a
>>>> single element:
>>>>
>>>>
>>>>
>>>>>>> comp[6]
>>>> ['6558', 'NM_001046.2', 'SLC12A2', '6037226', '2', 'chr5',
>>>> '127502453',
>>>> '127502454', 'het-ref', 'snp', 'A', 'T', 'A', '185', '113', '184',
>>>> '112',
>>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '259974', '9', '6', '6', '15',
>>>> '6558:NM_001046.2:SLC12A2:CDS:MISSENSE',
>>>> '6558:NM_001046.2:SLC12A2:CDS:NO-CHANGE', 'PFAM:PF01490:Aa_trans',
>>>> '',
>>>>
>>>>
>>> '',
>>>
>>>
>>>> '', '0.99', '2', '0.99', '0.998', '1.01', '1.000', '0.5', '0.46',
>>>> '0.5',
>>>> '1', '18', '18', '19', 'ref-identical;onlyA', 'snp', '0.072', '-1',
>>>> 'SQHIGH']
>>>>
>>>>
>>>>
>>>>
>>>>>>> comp[7]
>>>> ['1302', 'NM_080679.2', 'COL11A2', '6525172', '2', 'chr6',
>>>> '33271374',
>>>> '33271376', 'het-ref', 'del', 'GT', '', 'GT', '542', '542', '458',
>>>> '458',
>>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '71150', '34', '106', '106',
>>>> '140',
>>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC',
>>>>
>>>>
>>>>
>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080680.2:C
>>> OL
>>> 11A
>>>
>>>
>>>> 2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080681.2:COL11A2:TSS-UPSTREAM:UN
>>>> KN
>>>> OWN-
>>>> INC;6257:NM_021976.3:RXRB:CDS:NO-CHANGE',
>>>> '', '', '', '', '0.95', '2', '0.98', '0.998', '0.99', '1.000',
>>>> '0.46',
>>>> '0.42', '0.5', '0', '102', '102', '102', 'ref-identical;onlyA',
>>>> 'del',
>>>> '0.990', '6', 'SQHIGH']
>>>>
>>>>
>>>>
>>>>
>>>>>>> comp[8]
>>>> ['1302', 'NM_080680.2', 'COL11A2', '6525172', '2', 'chr6',
>>>> '33271374',
>>>> '33271376', 'het-ref', 'del', 'GT', '', 'GT', '542', '542', '458',
>>>> '458',
>>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '71150', '34', '106', '106',
>>>> '140',
>>>> '1302:NM_080680.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC',
>>>>
>>>>
>>>>
>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080680.2:C
>>> OL
>>> 11A
>>>
>>>
>>>> 2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080681.2:COL11A2:TSS-UPSTREAM:UN
>>>> KN
>>>> OWN-
>>>> INC;6257:NM_021976.3:RXRB:CDS:NO-CHANGE',
>>>> '', '', '', '', '0.95', '2', '0.98', '0.998', '0.99', '1.000',
>>>> '0.46',
>>>> '0.42', '0.5', '0', '102', '102', '102', 'ref-identical;onlyA',
>>>> 'del',
>>>> '0.990', '6', 'SQHIGH']
>>>>
>>>>
>>>>
>>>>
>>>> After collapsing comp[7] and comp[8] i  get:
>>>>
>>>>
>>>>
>>>>
>>>>>>> collapsed = ['1302', 'NM_080679.2,NM_080680.2', 'COL11A2',
>>>>>>> '6525172',
>>>>>>>
>>>>>>>
>>>>>>>
>>>> '2', 'chr6', '33271374', '33271376', 'het-ref', 'del', 'GT', '',
>>>> 'GT',
>>>> '542', '542', '458', '458', 'VQHIGH', 'VQHIGH', '', '', '', '',
>>>> '71150',
>>>> '34', '106', '106', '140',
>>>> '1302:NM_080680.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC',
>>>>
>>>>
>>>>
>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080680.2:C
>>> OL
>>> 11A
>>>
>>>
>>>> 2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080681.2:COL11A2:TSS-UPSTREAM:UN
>>>> KN
>>>> OWN-
>>>> INC;6257:NM_021976.3:RXRB:CDS:NO-CHANGE',
>>>> '', '', '', '', '0.95', '2', '0.98', '0.998', '0.99', '1.000',
>>>> '0.46',
>>>> '0.42', '0.5', '0', '102', '102', '102', 'ref-identical;onlyA',
>>>> 'del',
>>>> '0.990', '6', 'SQHIGH']
>>>>
>>>>
>>>>
>>>>
>>>> So in my larger list, after the modification, comp[6] is the first
>>>> element and collapsed the second element.
>>>>>>>
>>>>
>>>>
>>>> On Thu, Dec 5, 2013 at 5:22 AM, Martin Falatic <martin at falatic.com>
>>>>  wrote:
>>>>
>>>>
>>>>
>>>>
>>>>> Ah, genetics! Intriguing...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Do you need anything beyond the third elements of each list? Does
>>>>>  the third element always map 1:1 with the first, or could it
>>>>> vary? If so,
>>>>> what then?
>>>>>
>>>>> To refer to the simplified example, could you have this?
>>>>> x = [['cat', 'NM123', 12], ['cat', 'NM234', 43], ['dog', 'NM56',
>>>>> 65]]
>>>>>
>>>>>
>>>>>
>>>>> If so, what is the expected output?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> - Marty
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, December 5, 2013 02:11, Vikram K wrote:
>>>>>
>>>>>
>>>>>
>>>>>> i am having some difficulty in applying this to my actual
>>>>>> problem although i love the dictionary method. Imagine the
>>>>>> following three lists are the first, second and third elements
>>>>>> of a larger list:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>> comp[6]
>>>>>> ['6558', 'NM_001046.2', 'SLC12A2', '6037226', '2', 'chr5',
>>>>>> '127502453',
>>>>>> '127502454', 'het-ref', 'snp', 'A', 'T', 'A', '185', '113',
>>>>>> '184',
>>>>>> '112',
>>>>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '259974', '9', '6', '6',
>>>>>> '15',
>>>>>> '6558:NM_001046.2:SLC12A2:CDS:MISSENSE',
>>>>>> '6558:NM_001046.2:SLC12A2:CDS:NO-CHANGE',
>>>>>> 'PFAM:PF01490:Aa_trans',
>>>>>> '',
>>>>>>
>>>>>>
>>>>>>
>>>>> '',
>>>>>
>>>>>
>>>>>
>>>>>> '', '0.99', '2', '0.99', '0.998', '1.01', '1.000', '0.5',
>>>>>> '0.46',
>>>>>> '0.5',
>>>>>> '1', '18', '18', '19', 'ref-identical;onlyA', 'snp', '0.072',
>>>>>> '-1',
>>>>>> 'SQHIGH']
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>> comp[7]
>>>>>> ['1302', 'NM_080679.2', 'COL11A2', '6525172', '2', 'chr6',
>>>>>> '33271374',
>>>>>> '33271376', 'het-ref', 'del', 'GT', '', 'GT', '542', '542',
>>>>>> '458',
>>>>>> '458',
>>>>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '71150', '34', '106', '106',
>>>>>>  '140',
>>>>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC',
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080680
>>>>> .2
>>>>> :COL
>>>>> 11A
>>>>>
>>>>>
>>>>>
>>>>>> 2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080681.2:COL11A2:TSS-UPSTREA
>>>>>> M:
>>>>>> UNKN
>>>>>> OWN-
>>>>>> INC;6257:NM_021976.3:RXRB:CDS:NO-CHANGE',
>>>>>> '', '', '', '', '0.95', '2', '0.98', '0.998', '0.99', '1.000',
>>>>>> '0.46',
>>>>>> '0.42', '0.5', '0', '102', '102', '102', 'ref-identical;onlyA',
>>>>>> 'del',
>>>>>> '0.990', '6', 'SQHIGH']
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>> comp[8]
>>>>>> ['1302', 'NM_080680.2', 'COL11A2', '6525172', '2', 'chr6',
>>>>>> '33271374',
>>>>>> '33271376', 'het-ref', 'del', 'GT', '', 'GT', '542', '542',
>>>>>> '458',
>>>>>> '458',
>>>>>> 'VQHIGH', 'VQHIGH', '', '', '', '', '71150', '34', '106', '106',
>>>>>>  '140',
>>>>>> '1302:NM_080680.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC',
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> '1302:NM_080679.2:COL11A2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080680
>>>>> .2
>>>>> :COL
>>>>> 11A
>>>>>
>>>>>
>>>>>
>>>>>> 2:TSS-UPSTREAM:UNKNOWN-INC;1302:NM_080681.2:COL11A2:TSS-UPSTREA
>>>>>> M:
>>>>>> UNKN
>>>>>> OWN-
>>>>>> INC;6257:NM_021976.3:RXRB:CDS:NO-CHANGE',
>>>>>> '', '', '', '', '0.95', '2', '0.98', '0.998', '0.99', '1.000',
>>>>>> '0.46',
>>>>>> '0.42', '0.5', '0', '102', '102', '102', 'ref-identical;onlyA',
>>>>>> 'del',
>>>>>> '0.990', '6', 'SQHIGH']
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> ------
>>>>>> Can we apply the dictionary method to the problem where the key
>>>>>> of the dictionary is the first element of the three smaller
>>>>>> lists
>>>>> ('6558','1302',
>>>>>
>>>>>
>>>>>
>>>>>> '1302'). The second and third elements of the larger list
>>>>>> (starting
>>>>>> with '1302') need to be collapsed into a single element, based
>>>>>> on their second element ( 'NM_080679.2') and ('NM_080680.2') in
>>>>>> a way similar to how we had tackled the toy problem:
>>>>>>
>>>>>> x = [['cat', 'NM123', 12], ['cat', 'NM234', 12], ['dog',
>>>>>> 'NM56',
>>>>>> 65]]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 5, 2013 at 4:18 AM, Michiel Overtoom
>>>>>> <motoom at xs4all.nl>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Dec 5, 2013, at 10:09, Vikram K wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> another option could have been to obtain a dictionary like
>>>>>>>> so:
>>>>>>>> {'dog':
>>>>>>>> ['NM56', 65], 'cat': ['NM123,NM234', 12]}
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Oh, in that case the code can become somewhat simpler:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> x = [['cat', 'NM123', 12], ['cat', 'NM234', 12], ['dog',
>>>>>>> 'NM56',
>>>>>>> 65]]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> d = {} for key, label, quant in x: if key in d: d[key][0] +=
>>>>>>> ",
>>>>>>> " +
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> label
>>>>>>> else:
>>>>>>> d[key] = [label, quant]
>>>>>>>
>>>>>>> print d
>>>>>>>
>>>>>>>
>>>>>>> I agree with Michael that the problem is somewhat
>>>>>>> underspecified, but it's a starting point.
>>>>>>>
>>>>>>> Greetings,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> "If you don't know, the thing to do is not to get scared, but
>>>>>>> to learn." - Ayn Rand
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Baypiggies mailing list
>>>>>> Baypiggies at python.org
>>>>>> To change your subscription options or unsubscribe:
>>>>>> https://mail.python.org/mailman/listinfo/baypiggies
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> https://mail.python.org/mailman/listinfo/baypiggies
>
>




More information about the Baypiggies mailing list