sampling items from a nested list

Thu Feb 17 01:53:00 EST 2005

Steven Bethard wrote:
> Michael Spencer wrote:
> 
>> Steven Bethard wrote:
>>
>>> So, I have a list of lists, where the items in each sublist are of 
>>> basically the same form.  It looks something like:
>>>
>> ...
>>
>>>
>>> Can anyone see a simpler way of doing this?
>>>
>>> Steve
>>
>>
>> You just make these up to keep us amused, don't you? ;-)
> 
> 
> Heh heh.  I wish.  It's actually about resampling data read in the 
> Yamcha data format:
> 
> http://chasen.org/~taku/software/yamcha/
> 
> So each sublist is a "sentence" and each tuple is the feature vector for 
> a "word".  The point is to even out the number of positive and negative 
> examples because support vector machines typically work better with 
> balanced data sets.
> 
>> If you don't need to preserve the ordering, would the following work?:
>>
> [snip]
> 
>>
>>  >>> def resample2(data):
>>  ...     bag = {}
>>  ...     random.shuffle(data)
>>  ...     return [[(item, label)
>>  ...                 for item, label in group
>>  ...                     if bag.setdefault(label,[]).append(item)
>>  ...                         or len(bag[label]) < 3]
>>  ...                            for group in data if not 
>> random.shuffle(group)]
> 
> 
> It would be preferable to preserve ordering, but it's not absolutely 
> crucial.  Thanks for the suggestion!
> 
> STeVe
Maybe combine this with a DSU pattern?  Not sure whether the result would be 
better than what you started with

Michael