how to remove oldest files up to a limit efficiently

Wed Jul 9 18:19:41 EDT 2008

On Jul 9, 8:46 am, Dan Stromberg <dstrombergli... at gmail.com> wrote:
> On Tue, 08 Jul 2008 15:18:23 -0700, linux... at gmail.com wrote:
> > I need to mantain a filesystem where I'll keep only the most recently
> > used (MRU) files; least recently used ones (LRU) have to be removed to
> > leave space for newer ones. The filesystem in question is a clustered fs
> > (glusterfs) which is very slow on "find" operations. To add complexity
> > there are more than 10^6 files in 2 levels: 16³ dirs with equally
> > distributed number of files inside.
>
> > My first idea was to "os.walk" the filesystem, find  oldest files and
> >removethem until I reach the threshold. But find proves to be too slow.
>
> > My second thought was to run find -atime several times toremovethe
> > oldest ones, and repeat the process with most recent atime until
> > threshold is reached. Again, this needs several walks through the fs.
>
> > Then I thought about tmpwatch, but it needs, as find, a date to start
> > removing.
>
> > The ideal way is to keep a sorted list if files by atime, probably in a
> > cache, something like updatedb.
> > This list could be also be built based only on the diratime of the first
> > level of dirs, seek them in order and so on, but it still seems
> > expensive to get his first level of dir sorted.
>
> > Any suggestions of how to do it effectively?

> os.walk once.
>
> Build a list of all files in memory.

I was thinking of reuising updatedb but it does not contain atime.
Reimplementing it seems overkill to only remove a few  files
regularily. Keeping this list easily would help a lot as old files
would be always updated, the daily run (the one used to reupdate the
db) would only add new ones which, in this case, are not interesting.

> Sort them by whatever time you prefer - you can get times from os.stat.
>
> Then figure out how many you need to delete from one end of your list,
> and delete them.
>
> If the filesystem is especially slow (or the directories especially
> large), you might cluster the files to delete into groups by the
> directories they're contained in, and cd to those directories prior to
> removing them.

4096 dirs with equally distributed number of files inside. I'd
probably play trick with diratime and then search inside in order and
remove until threshold is reached, sorting seems too expensive, at the
end this will run often and it should only need to remove a few tenths/
hundreths of files.