os walk() and threads problems (os.walk are thread safe?)

Marcus Alves Grando marcus at sbh.eng.br
Tue Nov 13 15:55:31 EST 2007


I make one new version more equally to original version:

--code--
#!/usr/bin/python

import os, sys, time
import glob, random, Queue
import threading

EXIT = False
BRANDS = {}
LOCK=threading.Lock()
EV=threading.Event()
POOL=Queue.Queue(0)
NRO_THREADS=20

def walkerr(err):
	print err

class Worker(threading.Thread):
	def run(self):
		EV.wait()
		while True:
			try:
				mydir=POOL.get(timeout=1)
				if mydir == None:
					continue

				for root, dirs, files in os.walk(mydir, onerror=walkerr):
					if EXIT:
						break

					terra_user = 'test'
					terra_brand = 'test'
					user_du = '0 a'
					user_total_files = 0

					LOCK.acquire()
					if not BRANDS.has_key(terra_brand):
						BRANDS[terra_brand] = {}
						BRANDS[terra_brand]['COUNT'] = 1
						BRANDS[terra_brand]['SIZE'] = int(user_du.split()[0])
						BRANDS[terra_brand]['FILES'] = user_total_files
					else:
						BRANDS[terra_brand]['COUNT'] = BRANDS[terra_brand]['COUNT'] + 1
						BRANDS[terra_brand]['SIZE'] = BRANDS[terra_brand]['SIZE'] + 
int(user_du.split()[0])
						BRANDS[terra_brand]['FILES'] = BRANDS[terra_brand]['FILES'] + 
user_total_files
					LOCK.release()

			except Queue.Empty:
				if EXIT:
					break
				else:
					continue
			except KeyboardInterrupt:
				break
			except Exception:
				print mydir
				raise

if len(sys.argv) < 2:
	print 'Usage: %s dir...' % sys.argv[0]
	sys.exit(1)

glob_dirs = []
for i in sys.argv[1:]:
	glob_dirs = glob_dirs + glob.glob(i+'/[a-z_]*')
random.shuffle(glob_dirs)

for x in xrange(NRO_THREADS):
	Worker().start()

try:
	for i in glob_dirs:
		POOL.put(i)

	EV.set()
	while not POOL.empty():
		time.sleep(1)
	EXIT = True

	while (threading.activeCount() > 1):
		time.sleep(1)
except KeyboardInterrupt:
	EXIT=True

for b in BRANDS:
	print '%s:%i:%i:%i' % (b, BRANDS[b]['SIZE'], BRANDS[b]['COUNT'], 
BRANDS[b]['FILES'])
--code--

And run in make servers:

# uname -r
2.6.18-8.1.15.el5
# python test.py /usr
test:0:2267:0
# python test.py /usr
test:0:2224:0
# python test.py /usr
test:0:2380:0
# python -V
Python 2.4.3

# uname -r
7.0-BETA2
# python test.py /usr
test:0:1706:0
# python test.py /usr
test:0:1492:0
# python test.py /usr
test:0:1524:0
# python -V
Python 2.5.1

# uname -r
2.6.9-42.0.8.ELsmp
# python test.py /usr
test:0:1311:0
# python test.py /usr
test:0:1486:0
# python test.py /usr
test:0:1520:0
# python -V
Python 2.3.4

I really don't know what's happen.

Another ideia?

Regards

Chris Mellon wrote:
> On Nov 13, 2007 1:06 PM, Marcus Alves Grando <marcus at sbh.eng.br> wrote:
>> Diez B. Roggisch wrote:
>>> Marcus Alves Grando wrote:
>>>
>>>> Diez B. Roggisch wrote:
>>>>> Marcus Alves Grando wrote:
>>>>>
>>>>>> Hello list,
>>>>>>
>>>>>> I have a strange problem with os.walk and threads in python script. I
>>>>>> have one script that create some threads and consume Queue. For every
>>>>>> value in Queue this script run os.walk() and printing root dir. But if i
>>>>>> increase number of threads the result are inconsistent compared with one
>>>>>> thread.
>>>>>>
>>>>>> For example, run this code plus sort with one thread and after run again
>>>>>> with ten threads and see diff(1).
>>>>> I don't see any difference. I ran it with 1 and 10 workers + sorted the
>>>>> output. No diff whatsoever.
>>>> Do you test in one dir with many subdirs? like /usr or /usr/ports (in
>>>> freebsd) for example?
>>> Yes, over 1000 subdirs/files.
>> Strange, because to me accurs every time.
>>
>>>>> And I don't know what you mean by diff(1) - was that supposed to be some
>>>>> output?
>>>> No. One thread produce one result and ten threads produce another result
>>>> with less lines.
>>>>
>>>> Se example below:
>>>>
>>>> @@ -13774,8 +13782,6 @@
>>>>   /usr/compat/linux/proc/44
>>>>   /usr/compat/linux/proc/45
>>>>   /usr/compat/linux/proc/45318
>>>> -/usr/compat/linux/proc/45484
>>>> -/usr/compat/linux/proc/45532
>>>>   /usr/compat/linux/proc/45857
>>>>   /usr/compat/linux/proc/45903
>>>>   /usr/compat/linux/proc/46
>>> I'm not sure what that directory is, but to me that looks like the
>>> linux /proc dir, containing process ids. Which incidentially changes
>>> between the two runs, as more threads will have process id aliases.
>> My example are not good enough. I run this script in ports directory of
>> freebsd and imap folders in my linux server, same thing.
>>
>> @@ -182,7 +220,6 @@
>>   /usr/ports/archivers/p5-POE-Filter-Bzip2
>>   /usr/ports/archivers/p5-POE-Filter-LZF
>>   /usr/ports/archivers/p5-POE-Filter-LZO
>> -/usr/ports/archivers/p5-POE-Filter-LZW
>>   /usr/ports/archivers/p5-POE-Filter-Zlib
>>   /usr/ports/archivers/p5-PerlIO-gzip
>>   /usr/ports/archivers/p5-PerlIO-via-Bzip2
>> @@ -234,7 +271,6 @@
>>   /usr/ports/archivers/star-devel
>>   /usr/ports/archivers/star-devel/files
>>   /usr/ports/archivers/star/files
>> -/usr/ports/archivers/stuffit
>>   /usr/ports/archivers/szip
>>   /usr/ports/archivers/tardy
>>   /usr/ports/archivers/tardy/files
>>
>>
> 
> Are you just diffing the output? There's no guarantee that
> os.path.walk() will always have the same order, or that your different
> working threads will produce the same output in the same order. On my
> system, for example, I get a different order of subdirectory output
> when I run with 10 threads than with 1.
> 
> walk() requires that stat() works for the next directory that will be
> walked. It might be remotely possible that stat() is failing for some
> reason and some directories are being lost (this is probably not going
> to be reproducible). If you can reproduce it, trying using pdb to see
> what's going on inside walk().

-- 
Marcus Alves Grando
marcus(at)sbh.eng.br | Personal
mnag(at)FreeBSD.org  | FreeBSD.org



More information about the Python-list mailing list