program to generate data helpful in finding duplicate large files

David Alban extasia at extasia.org
Thu Sep 18 14:11:11 EDT 2014


greetings,

i'm a long time perl programmer who is learning python.  i'd be interested
in any comments you might have on my code below.  feel free to respond
privately if you prefer.  i'd like to know if i'm on the right track.  the
program works, and does what i want it to do.  is there a different way a
seasoned python programmer would have done things?  i would like to learn
the culture as well as the language.  am i missing anything?  i know i'm
not doing error checking below.  i suppose comments would help, too.

i wanted a program to scan a tree and for each regular file, print a line
of text to stdout with information about the file.  this will be data for
another program i want to write which finds sets of duplicate files larger
than a parameter size.  that is, using output from this program, the sets
of files i want to find are on the same filesystem on the same host
(obviously, but i include hostname in the data to be sure), and must have
the same md5 sum, but different inode numbers.

the output of the code below is easier for a human to read when paged
through 'less', which on my mac renders the ascii nuls as "^@" in reverse
video.

thanks,
david


*usage: dupscan [-h] [--start-directory START_DIRECTORY]*

*scan files in a tree and print a line of information about each regular
file*

*optional arguments:*
*  -h, --help            show this help message and exit*
*  --start-directory START_DIRECTORY, -d START_DIRECTORY*
*                        specifies the root of the filesystem tree to be*
*                        processed*




*#!/usr/bin/python*

*import argparse*
*import hashlib*
*import os*
*import re*
*import socket*
*import sys*

*from stat import **

*ascii_nul = chr(0)*

*     # from:
http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
<http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python>*
*     # except that i use hexdigest() rather than digest()*
*def md5_for_file(f, block_size=2**20):*
*  md5 = hashlib.md5()*
*  while True:*
*    data = f.read(block_size)*
*    if not data:*
*      break*
*    md5.update(data)*
*  return md5.hexdigest()*

*thishost = socket.gethostname()*

*parser = argparse.ArgumentParser(description='scan files in a tree and
print a line of information about each regular file')*
*parser.add_argument('--start-directory', '-d', default='.',
help='specifies the root of the filesystem tree to be processed')*
*args = parser.parse_args()*

*start_directory = re.sub( '/+$', '', args.start_directory )*

*for directory_path, directory_names, file_names in os.walk(
start_directory ):*
*  for file_name in file_names:*
*    file_path = "%s/%s" % ( directory_path, file_name )*

*    lstat_info = os.lstat( file_path )*

*    mode = lstat_info.st_mode*

*    if not S_ISREG( mode ) or S_ISLNK( mode ):*
*      continue*

*    f = open( file_path, 'r' )*
*    md5sum = md5_for_file( f )*

*    dev   = lstat_info.st_dev*
*    ino   = lstat_info.st_ino*
*    nlink = lstat_info.st_nlink*
*    size  = lstat_info.st_size*

*    sep = ascii_nul*

*    print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep,
dev, sep, ino, sep, nlink, sep, size, sep, file_path )*

*exit( 0 )*



-- 
Our decisions are the most important things in our lives.
***
Live in a world of your own, but always welcome visitors.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20140918/2be5a825/attachment.html>


More information about the Python-list mailing list