how to remove oldest files up to a limit efficiently

Wed Jul 9 02:46:09 EDT 2008

On Tue, 08 Jul 2008 15:18:23 -0700, linuxnow at gmail.com wrote:

> I need to mantain a filesystem where I'll keep only the most recently
> used (MRU) files; least recently used ones (LRU) have to be removed to
> leave space for newer ones. The filesystem in question is a clustered fs
> (glusterfs) which is very slow on "find" operations. To add complexity
> there are more than 10^6 files in 2 levels: 16³ dirs with equally
> distributed number of files inside.
> 
> My first idea was to "os.walk" the filesystem, find  oldest files and
> remove them until I reach the threshold. But find proves to be too slow.
> 
> My second thought was to run find -atime several times to remove the
> oldest ones, and repeat the process with most recent atime until
> threshold is reached. Again, this needs several walks through the fs.
> 
> Then I thought about tmpwatch, but it needs, as find, a date to start
> removing.
> 
> The ideal way is to keep a sorted list if files by atime, probably in a
> cache, something like updatedb.
> This list could be also be built based only on the diratime of the first
> level of dirs, seek them in order and so on, but it still seems
> expensive to get his first level of dir sorted.
> 
> Any suggestions of how to do it effectively?

os.walk once.

Build a list of all files in memory.

Sort them by whatever time you prefer - you can get times from os.stat.

Then figure out how many you need to delete from one end of your list, 
and delete them.

If the filesystem is especially slow (or the directories especially 
large), you might cluster the files to delete into groups by the 
directories they're contained in, and cd to those directories prior to 
removing them.