Strange behavior

Virgil Stokes vs at it.uu.se
Thu Aug 16 07:18:59 EDT 2012


On 15-Aug-2012 02:19, Steven D'Aprano wrote:
> On Tue, 14 Aug 2012 21:40:10 +0200, Virgil Stokes wrote:
>
>> You might find the following useful:
>>
>> def testFunc(startingList):
>>       xOnlyList = []; j = -1
>>       for xl in startingList:
>>           if (xl[0] == 'x'):
> That's going to fail in the starting list contains an empty string. Use
> xl.startswith('x') instead.
Yes, but this was by design (tacitly assumed that startingList was both a list 
and non-empty).
>
>
>>               xOnlyList.append(xl)
>>           else:
>>               j += 1
>>               startingList[j] = xl
> Very cunning, but I have to say that your algorithm fails the "is this
> obviously correct without needing to study it?" test. Sometimes that is
> unavoidable, but for something like this, there are simpler ways to solve
> the same problem.
Sorry, but I do not sure what you mean here.
>
>
>>       if j == -1:
>>           startingList = []
>>       else:
>>           del startingList[j:-1]
>>       return(xOnlyList)
>
>> And here is another version using list comprehension that I prefer
>> def testFunc2(startingList):
>>       return([x for x in startingList if x[0] == 'x'], [x for x in
>> startingList if x[0] != 'x'])
> This walks over the starting list twice, doing essentially the same thing
> both times. It also fails to meet the stated requirement that
> startingList is modified in place, by returning a new list instead.
This can meet the requirement that startingList is modified in place via the 
call to this function (see the attached code).
> Here's an example of what I mean:
>
> py> mylist = mylist2 = ['a', 'x', 'b', 'xx', 'cx']  # two names for one
> list
> py> result, mylist = testFunc2(mylist)
> py> mylist
> ['a', 'b', 'cx']
> py> mylist2  # should be same as mylist
> ['a', 'x', 'b', 'xx', 'cx']
Yes, I had a typo in my original posting --- sorry about that!
>
> Here is the obvious algorithm for extracting and removing words starting
> with 'x'. It walks the starting list only once, and modifies it in place.
> The only trick needed is list slice assignment at the end.
>
> def extract_x_words(words):
>      words_with_x = []
>      words_without_x = []
>      for word in words:
>          if word.startswith('x'):
>              words_with_x.append(word)
>          else:
>              words_without_x.append(word)
>      words[:] = words_without_x  # slice assignment
>      return words_with_x
Suppose words was not a list --- you have tacitly assumed that words is a list.
>
> The only downside of this is that if the list of words is so enormous
> that you can fit it in memory *once* but not *twice*, this may fail. But
> the same applies to the list comprehension solution.
But, this is not the only downside if speed is important --- it is slower than 
the list comprehension method (see results that follows).

Here is a summary of three algorithms (algorithm-1, algorithm-2, algorithm-2A) 
that I tested (see attached code). Note, algorithm-2A was obtained by removing 
the slice assignment in the above code and modifying the return as follows

def extract_x_words(words):
     words_with_x = []
     words_without_x = []
     for word in words:
         if word.startswith('x'):
             words_with_x.append(word)
         else:
             words_without_x.append(word)
     #words[:] = words_without_x  # slice assignment
     return words_with_x, words_without_x

Of course, one needs to modify the call for "in-place" update of startingList as 
follows:

    xOnlyList,startingList = extract_x_words(startingList)

Here is a summary of my timing results obtained for 3 different algorithms for 
lists with 100,000 strings of length 4 in each list:

Method
	average (sd) time in seconds
algorithm-1 (list comprehension)
	0.11630 (0.0014)
algorithm-2 (S. D'Aprano)
	0.17594 (0.0014)
algorithm-2A (modified S. D'Aprano)
	0.18217 (0.0023)


These values  were obtained from 100 independent runs (MC simulations) on lists 
that contain 100,000 strings. Approximately 50% of these strings contained a 
leading 'x'. Note, that the results show that algorithm-2 (suggested by S. 
D'Aprano) is approximately 51% slower than algorithm-1 (list comprehensions) and 
algorithm-2A (simple modification of algorithm-2) is approximately 57% slower 
than algorithm-1. Why is algorithm-2A slower than algorithm-2?

I would be interested in seeing code that is faster than algorithm-1 --- any 
suggestions are welcomed.  And of course, if there are any errors in my attached 
code please inform me of them and I will try to correct them as soon as 
possible. Note, some of the code is actually irrelevant for the original 
"Strange behavior" post.

Have a good day!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20120816/9a1f6a64/attachment.html>
-------------- next part --------------
'''
  Purpose: Time three different algorithms for the same task
  
  Author: V. Stokes (vs at it.uu.se, 2012-08-16)
  Refs:
   python-list at python.org list
    * Strange behavior, 14-Aug-2012 17:38, light1quark at gmail.com
    * Re: Strange behavior, 14-Aug-2012 21:40, Stokes, Virgil
    * Re: Strange behavior, 15-Aug-2012 02:19, Steven D'Aprano
  
  Note:
   1. The mean and standard deviation over the runs (MC simulations)
      are estimated using recursive equations.
   2. A seed (syd) is used with the RNG for repeatability. Each run is 
      started with a new seed to force the generation of independent 
      random sequences.
   3. Warning! No checks are made on the parameters passed to
      the functions (this was by design).
   4. No effort has been made to make this code elegant. My focus was
      to make the code clear and easy to understand.
   5. This was executed on a Windows Vista 32-bit platform with Python 2.6.6
       Processor: Intel(R) core(TM)2 Duo CPU E8500 at 3.16GHz 3.17GHz
   6. The estimated time to completion is displayed after each run.    
      
'''
import random as random
import math as math
from time import clock # clock gives good resolution on MS Windows

def testFunc1(startingList): 
    '''
      Algorithm-1
      Note: 
        One should check for an empty startingList before 
        calling testFunc1 -- If this possibility exists!
    '''
    return([x for x in startingList if x[0] == 'x'], 
           [x for x in startingList if x[0] != 'x'])

def testFunc2(words):
    '''
      Algorithm-2
    '''
    words_with_x = []
    words_without_x = []
    for word in words:
        if word.startswith('x'):
            words_with_x.append(word)
        else:
            words_without_x.append(word)
    words[:] = words_without_x  # slice assignment
    return words_with_x

def testFunc2A(words):
    '''
      Algorithm-2A
    '''
    words_with_x = []
    words_without_x = []
    for word in words:
        if word.startswith('x'):
            words_with_x.append(word)
        else:
            words_without_x.append(word)
    #words[:] = words_without_x  # slice assignment
    return words_with_x, words_without_x


def genStrList(NChar,NStrng,Alph,leadChr):
    '''
     Purpose: Generate a list of NStrng elements with each element a string
              of length NChar and constrained such that approx. 50% of the 
              strings will begin with the character (symbol) leadChr
       Inputs: 
           NChar -- number of characters in each element
          NStrng -- number of elements in list (strgList)
            Alph -- list of characters to be used (must contain the leadChr)
         leadChr -- leading character for strings to be generated from Alph
       Otputs:
        strgList -- list with NString strings of NChar characters (from Alph) in each                    
    '''
    NSym = len(Alph)
    strgList = []
    for i in range(NStrng): 
        if random.random() < 0.5: 
            char = leadChr
        else: 
            while True:
                indx = random.randint(0,NSym-1)            
                char = Alph[indx]
                if char != leadChr: break
            
        for j in range(1,NChar):
            indx = random.randint(0,NSym-1)            
            char = char + Alph[indx]
            
        strgList.append(char)   
        
    return strgList

#++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Alph = ['a','b','c','d','e','f','g','h','i','j','k','l','m',
        'n','o','p','q','r','s','t','u','v','w','x','y','z']

NChar = 4         # 4 chars per string
NStrng = 100000   # 100,000 strings in list 
NRuns = 100       # 100 random lists (number of MC simulations)

# For quick testing
#NChar = 4        # 4 chars per string
#NStrng = 10000   # 10,000 strings in list 
#NRuns = 20       # 20 random lists

syd = 1071       # initial seed for RNG (for repeatability)

printRunningMean = False

mu1,mu2,mu3,mug = 0.0,0.0,0.0,0.0
x1,x2,x3 = 0.0,0.0,0.0
print '%5d runs with lists containing %7d elements' %(NRuns,NStrng)
for irun in range(NRuns):
    tic = clock()
    k = irun + 1
    print ' run - %2d' %k
    syd += 19    # new seed for each run
    random.seed(syd)
    #print 'Genrating list of %7d strings...' %NStrng
    strgList = genStrList(NChar,NStrng,Alph,'x')   
    
    t0 = clock()
    xOnlyList,strgList = testFunc1(strgList) # algorithm-1
    t1 = clock()    
    
    #print 'Length of xOnlyList = %7d' %len(xOnlyList)
    #print ' Length of strgList = %7d' %len(strgList)    
    etime = t1 - t0
    delta = etime - mu1
    mu1 += delta/k
    if printRunningMean:
        print '  algorithm-1 -- %9.5f' %mu1
    x1 += delta*(etime-mu1)
    
    #print
    random.seed(syd)
    #print 'Genrating list of %7d strings...' %NStrng
    strgList = genStrList(NChar,NStrng,Alph,'x')
    
    t0 = clock()
    xOnlyList = testFunc2(strgList)          # algorithm-2   
    t1 = clock()
    
    #print 'Length of xOnlyList = %7d' %len(xOnlyList)
    #print ' Length of strgList = %7d' %len(strgList)    
    etime = t1 - t0
    delta = etime - mu2
    mu2 += delta/k
    if printRunningMean:
        print '  algorithm-2 -- %9.5f' %mu2
    x2 += delta*(etime-mu2)    
    
    random.seed(syd)    
    #print 'Genrating list of %7d strings...' %NStrng
    strgList = genStrList(NChar,NStrng,Alph,'x')
    
    t0 = clock()
    xOnlyList,strgList = testFunc2A(strgList)          # algorithm-2A
    t1 = clock()
    
    #print 'Length of xOnlyList = %7d' %len(xOnlyList)
    #print ' Length of strgList = %7d' %len(strgList)    
    etime = t1 - t0
    delta = etime - mu3
    mu3 += delta/k
    if printRunningMean:
        print '  algorithm-2A -- %9.5f' %mu3
    x3 += delta*(etime-mu3)
    
    toc = clock()
    loopTime = toc - tic
    delta = loopTime - mug
    mug += delta/k
    runsR = NRuns - k
    estToComplete = mug*runsR
    print '   ETTC = %8.2f seconds' %estToComplete
    

print
print '        Timing in seconds'
print ' ************************************'
print '    method     |     avg time (sd)'
print ' ------------------------------------'
print ' algorithm-1   | %9.5f (%8.6f)' %(mu1,math.sqrt(x1/NRuns))
print ' algorithm-2   | %9.5f (%8.6f)' %(mu2,math.sqrt(x2/NRuns))
print ' algorithm-2A  | %9.5f (%8.6f)' %(mu3,math.sqrt(x3/NRuns))

print
print "Th-tha-tha-that's all folks!"


More information about the Python-list mailing list