How to get the size of a file?
Bengt Richter
bokr at oz.net
Sun Oct 17 02:29:36 EDT 2004
On Sun, 17 Oct 2004 03:13:46 GMT, User <1 at 2.3> wrote:
>Anyone have ideas which os command could be used to get the size of a
>file without actually opening it? My intention is to write a script
>that identifies duplicate files with different names. I have no
>trouble getting the names of all the files in the directory using the
>os.listdir() command, but that doesn't return the file size. In order
>to be identical, files must be the same size, so I want to use file
>size as the first criteria, then, if they are the same size, actually
>open them up and compare the contents.
>
>I have written such a script in the past, but had to resort to
>something like:
>
>os.system('dir *.* >> trash.txt')
>
>The next step was then to open up 'trash.txt', and piece together the
>information I need compare file sizes. The problems with this
>approach are that it is very platform dependent (worked on WIN 95, but
>don't know what else it will work on) and 8.3 filename limitations
>that apply within this environment. That is the reason I'm looking
>for some other command to obtain file size before the files are ever
>opened.
This should list duplicate files in the specified directory:
You can hack to suit. Not very tested. Just what you see ;-)
------------------------------------------------
# get_dupes.py
import os, md5
def get_dupes(thedir):
finfo = {}
for f in os.listdir(thedir):
if os.path.isfile(f):
finfo.setdefault(os.path.getsize(f), []).append(f)
result = []
for size, flist in finfo.items():
if len(flist)>1:
dupes = {}
for name in flist:
dupes.setdefault(md5.new(open(name, 'rb').read()).hexdigest(),[]).append(name)
for digest, names in dupes.items():
if len(names)>1: result.append((size, digest, names))
return result
if __name__ == '__main__':
import sys
try:
dupes = get_dupes(sys.argv[1])
if dupes:
print
print '%8s %32s %s' % ('size','md5 digest','files with the given size, digest')
print '%8s %32s %s' % ('----','-'*32 ,'---------------------------------')
for duped in dupes:
print '%8s %32s %s' % duped
else:
print 'No duplicate files in %r' % sys.argv[1]
except:
raise SystemExit, 'Usage: python get_dupes.py directory'
-------------------------------------------
(I was surprised at the amount of duplicated stuff ;-)
[23:23] C:\pywk\clp>python get_dupes.py .
size md5 digest files with the given size, digest
---- -------------------------------- ---------------------------------
0 d41d8cd98f00b204e9800998ecf8427e ['z3', 'zero_len.py']
111 ea70a0f814917ef8861bebc085e5e7d0 ['MyConsts.py', 'MyConsts.py~']
163 f8e4add20e45bb253bd46963f25a7057 ['ramb.txt', 'rambxx.txt']
4096 d96633a4b58522ce5787ef80a18e9c7b ['yyy2', 'yyy3']
786 05956208d5185259b47362afcf1812fd ['startmore.py', 'startmore.py~']
851 3845f161fa93cbb9119c16fc43e7b62a ['quadratic.py', 'quadratic.py~']
1536 72f5c05b7ea8dd6059bf59f50b22df33 ['virtest.txt', '~DF30EC.tmp']
1028 fbedc511f9556a8a1dc2ecfa3d859621 ['PaulMoore.py', 'PaulMoore.py~']
1515 568f9732866a9de698732616ae4f9c3b ['loopbreak.py', 'loopbreak.py~']
1662 f54414637ed420fe61b78eeba59737b7 ['for_grodrigues.py', 'for_grodrigues.r1.py']
1702 23fa57926e7fcf2487943acb10db7e2a ['bitfield.py', 'bitfield.py~', 'packbits.py']
3765 e69bf6b018ba305cc3e190378f93e421 ['pythonHi.gif', 'showgif.gif']
5874 bae87bbed53c1e6908bb5c37db9c4292 ['testyenc.py', 'testyenc.py~']
3990 4a5096efaf136f901603a2e1be850eb3 ['pns.py', 'pns.r1.py']
Regards,
Bengt Richter
More information about the Python-list
mailing list