[Python-ideas] Speed up os.walk() 5x to 9x by using file attributes from FindFirst/NextFile() and readdir()

Mike Meyer mwm at mired.org
Fri Nov 16 12:03:22 CET 2012


I'm pretty much convinced that - if the primary goal is to speed up
os.walk by leveraging the Windows calls and the existence of d_type on
some posix file systems - the proposed iterdir_stat interface is about
as good as we can do.

However, as a tool for making it easy to iterate through files in a
directory getting some/all stat information, I think it's ugly. It's
designed specifically for one system, with another common case sort of
wedged in. There's no telling how well it will be handle any other
systems, but I can see that they might be problematical. Worse yet,
you wind up with stat information you can't trust, so have to
basically write code to access multiple attributes like:

	  if st.attr1 is None:
	     st = os.stat(...)
	  func(st.attr1)
	  if st.attr2 is None:
	     st = os.stat(...)
	  func(st.attr2)

Not bad if you only want one or two values, but ugly if you want four
or more.

I can see a number of alternatives to improve this situation:

1) wrap the return partial stat info in a proxy object that will do a
real stat if a request is made for a value that isn't there. This has
already been rejected.

2) Make iterdir_stat an os.walk internal tool, and don't export it.

3) Add some kind of "we have a full stat" indicator, so that clients
that want to use lots of attributes can just check that and do the
stat if needed.

4) Pick and document one of the a stat values as a "we have a full
stat" indicator, to use like case 3.

5) Add a keyword argument to iterdir_stat that causes it to always
just do the full stat. Actually, having three modes might be useful:
the default is None, which is the currently proposed behavior. Setting
it to True causes the full stat always be done, and setting it to
False just returns file names.

6) Depreciate os.walk, and provide os.itertree with an interface that
lets us leverage the available tools better. That's a whole other can
of worms, though.

   Thanks,
   <mike



More information about the Python-ideas mailing list