How to get the size of a file?

Bengt Richter bokr at oz.net
Sun Oct 17 02:29:36 EDT 2004


On Sun, 17 Oct 2004 03:13:46 GMT, User <1 at 2.3> wrote:

>Anyone have ideas which os command could be used to get the size of a
>file without actually opening it?  My intention is to write a script
>that identifies duplicate files with different names.  I have no
>trouble getting the names of all the files in the directory using the
>os.listdir() command, but that doesn't return the file size.  In order
>to be identical, files must be the same size, so I want to use file
>size as the first criteria, then, if they are the same size, actually
>open them up and compare the contents.  
>
>I have written such a script in the past, but had to resort to
>something like:
>
>os.system('dir *.* >> trash.txt')
>
>The next step was then to open up 'trash.txt', and piece together the
>information I need compare file sizes.  The problems with this
>approach are that it is very platform dependent (worked on WIN 95, but
>don't know what else it will work on) and 8.3 filename limitations
>that apply within this environment.  That is the reason I'm looking
>for some other command to obtain file size before the files are ever
>opened.

This should list duplicate files in the specified directory:
You can hack to suit. Not very tested. Just what you see ;-)
------------------------------------------------
# get_dupes.py
import os, md5
def get_dupes(thedir):
    finfo = {}
    for f in os.listdir(thedir):
        if os.path.isfile(f):
            finfo.setdefault(os.path.getsize(f), []).append(f)

    result = []
    for size, flist in finfo.items():
        if len(flist)>1:
            dupes = {}
            for name in flist:
                dupes.setdefault(md5.new(open(name, 'rb').read()).hexdigest(),[]).append(name)
            for digest, names in dupes.items():
                if len(names)>1: result.append((size, digest, names))
    return result

if __name__ == '__main__':
    import sys
    try:
        dupes = get_dupes(sys.argv[1])
        if dupes:
            print
            print '%8s %32s %s' % ('size','md5 digest','files with the given size, digest')
            print '%8s %32s %s' % ('----','-'*32      ,'---------------------------------')
            for duped in dupes:
                print '%8s %32s %s' % duped
        else:
            print 'No duplicate files in %r' % sys.argv[1]
    except:
        raise SystemExit, 'Usage: python get_dupes.py directory'
-------------------------------------------

(I was surprised at the amount of duplicated stuff ;-)

[23:23] C:\pywk\clp>python get_dupes.py .

    size                       md5 digest files with the given size, digest
    ---- -------------------------------- ---------------------------------
       0 d41d8cd98f00b204e9800998ecf8427e ['z3', 'zero_len.py']
     111 ea70a0f814917ef8861bebc085e5e7d0 ['MyConsts.py', 'MyConsts.py~']
     163 f8e4add20e45bb253bd46963f25a7057 ['ramb.txt', 'rambxx.txt']
    4096 d96633a4b58522ce5787ef80a18e9c7b ['yyy2', 'yyy3']
     786 05956208d5185259b47362afcf1812fd ['startmore.py', 'startmore.py~']
     851 3845f161fa93cbb9119c16fc43e7b62a ['quadratic.py', 'quadratic.py~']
    1536 72f5c05b7ea8dd6059bf59f50b22df33 ['virtest.txt', '~DF30EC.tmp']
    1028 fbedc511f9556a8a1dc2ecfa3d859621 ['PaulMoore.py', 'PaulMoore.py~']
    1515 568f9732866a9de698732616ae4f9c3b ['loopbreak.py', 'loopbreak.py~']
    1662 f54414637ed420fe61b78eeba59737b7 ['for_grodrigues.py', 'for_grodrigues.r1.py']
    1702 23fa57926e7fcf2487943acb10db7e2a ['bitfield.py', 'bitfield.py~', 'packbits.py']
    3765 e69bf6b018ba305cc3e190378f93e421 ['pythonHi.gif', 'showgif.gif']
    5874 bae87bbed53c1e6908bb5c37db9c4292 ['testyenc.py', 'testyenc.py~']
    3990 4a5096efaf136f901603a2e1be850eb3 ['pns.py', 'pns.r1.py']

Regards,
Bengt Richter



More information about the Python-list mailing list