Processing huge datasets

Wed May 12 00:13:58 EDT 2004

Anders S. Jensen <doozer at freakout.dk> wrote:
> I'm trying to produce a system that will analyze a filesystem for
> 'dead leafs', large data consumption, pr. UID volume consumption,
> trend analysis, etc.
> 
> To do that I first save each directory along with sum of C, M and A
> timestamps for the files in the directory (all in epoch secs.), number
> of files, volume of files, list of UID's, list of GID's, dict of
> volume pr. UID, dict of volume pr. GID and newest file.
> 
> Then I build a hierarchy of instances that knows it's parents,
> children and siblings.  Each object is populated with the summarized
> file information.  When that is done, I traverse the hierarchy from
> the bottom up, accumulating average C, M and A times, Volumes and
> number of files.
> 
> This hierarchy allows me to instantly query, say, the average
> modification time for any given point in the directory structure and
> below. That'll show where files that havent been modified in a long
> time hides, and how much space they take amongst other things.
> 
> The LCRS tree lends itself very well for recursion in terms of beauty
> and elegance of the code. However keeping that amount of data in
> memory obviously just doesn't fly.
> 
> I'm at a point where the system actually just *barely* might work.
> I'll know tomorrow when it's done. But the system might be used to
> work on much larger filesystems, and then the party is over.
> 
> I'm looking for the best suited way of attacking the problem. Is it a
> memory map file? A Berkeley DB? A specially crafted metadata
> filesystem?  (that would be fun but probably overkill..:) or something
> completely different.

How about good old textfile in filesystem, ie.
    /some/dir/.timestamp
?  This way, you don't use up memory, you get hierarchial data structure
automatically, no need to mess with database interface, simple search
mechanism, etc.  And, most of all, no need to ask other people who have
absolutely no idea what you're talking about... :-)

> 
> The processing might be switched to iterative, but that's a 'minor'
> concern.  The main problem is how to handle the data in the fastest
> possible way.
> 
> Thanks for your pointers!
> 
> Cheers, Anders

-- 
William Park, Open Geometry Consulting, <opengeometry at yahoo.ca>
Linux solution/training/migration, Thin-client