[issue38764] Deterministic globbing.

Nathaniel Smith report at bugs.python.org
Sun Nov 10 23:26:19 EST 2019


Nathaniel Smith <njs at pobox.com> added the comment:

> I saw the article as well, but think auto-sorting would have just provided a thin mask over their serious data pipeline bugs.

This seems like an inappropriately elitist attitude. I'm sure their code has bugs; so does mine and yours. But in fact they did test their code thoroughly, demonstrated that it produced correct results on their system, and it would have produced correct results for their downstream users too if Python's 'glob' had behaved consistently across platforms. Python isn't just for folks who can "git gud" to meet your arbitrary standards.

In addition, auto-sorting has a number of ergonomic benefits, beyond this one case: it makes output more deterministic, makes it easy to estimate progress given just a list of filenames being processed ("hmm, this script has been running for ten minutes and it's only on the Cs, I guess this will take a few hours"), and so on.

> Also, this isn't the only pathway to seeing file lists:  os.listdir, fnmatch, etc.

os.listdir is a genuinely low-level OS-level tool, as indicated by the name, versus 'glob' which is designed as a convenient interface for high-level scripting.

fnmatch doesn't deal with either files or lists, so I'm not sure why you're bringing it up here.

> The reasons for the previous rejections still apply.

The reasons I see in the previous issues are:

- Sorting adds overhead
- It's hard to implement, esp. if iglob uses scandir internally
- iglob can't sort, so glob should be consistent
- it's easy to sort after the fact

But these are all false.

- The overhead added by sorting is negligible. In some informal benchmarks I measured it as ~4% extra CPU time in the worst case, if you're (a) running on Linux with it's ultra-optimized VFS caching, (b) all the directories are already cached in RAM, (c) you're not actually doing anything with the files except calling glob() and then discarding the result. Of course, it's always possible my benchmarks missed an important case; if you have benchmarks that show otherwise please post them so we can discuss.

- The implementation is trivial, as shown from the PR – literally just replacing two calls to 'list' with calls to 'sorted'.

- iglob being incremental doesn't actually pose any obstacles to sorting. Even if it's using scandir(), it still has to load each individual directory list into memory in one batch, to avoid the risk of leaking file descriptors.

- It's actually not trivial to sort filenames in the natural order afterwards, because you have to split each filename into components. It's also asymptotically slower than doing the sort as-you-go, O(total-files log total-files), versus O(total-files log maximum-individual-directory).

Given that there's at least one extremely clear case where not-sorting caused substantial harm, that there are lots of other minor cases where it's a nice win, and that the costs are negligible, sorting seems like the obviously correct default.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue38764>
_______________________________________


More information about the Python-bugs-list mailing list