program to generate data helpful in finding duplicate large files

Thu Sep 18 18:58:13 EDT 2014

thanks for the responses.   i'm having quite a good time learning python.

On Thu, Sep 18, 2014 at 11:45 AM, Chris Kaynor <ckaynor at zindagigames.com>
wrote:
>
> Additionally, you may want to specify binary mode by using open(file_path,
> 'rb') to ensure platform-independence ('r' uses Universal newlines, which
> means on Windows, Python will convert "\r\n" to "\n" while reading the
> file). Additionally, some platforms will treat binary files differently.
>

would it be good to use 'rb' all the time?

On Thu, Sep 18, 2014 at 11:48 AM, Chris Angelico <rosuav at gmail.com> wrote:

> On Fri, Sep 19, 2014 at 4:11 AM, David Alban <extasia at extasia.org> wrote:
> > exit( 0 )
>
> Unnecessary - if you omit this, you'll exit 0 implicitly at the end of
> the script.
>

aha.  i've been doing this for years even with perl, and apparently it's
not necessary in perl either.  i was influenced by shell.

this shell code:

*     if [[ -n $report_mode ]] ; then*
*        do_report*
*     fi*

*     exit 0*

is an example of why you want the last normally executed shell statement to
be "exit 0".  if you omit the exit statement it in this example, and
$report_mode is not set, your shell program will give a non-zero return
code and appear to have terminated with an error.  in shell the last
expression evaluated determines the return code to the os.

ok, i don't need to do this in python.

On Thu, Sep 18, 2014 at 1:23 PM, Peter Otten <__peter__ at web.de> wrote:
>
> file_path may contain newlines, therefore you should probably use "\0" to
> separate the records.

i chose to stick with ascii nul as the default field separator, but i added
a --field-separator option in case someone wants human readable output.

style question:  if there is only one, possibly short statement in a block,
do folks usually move it up to the line starting the block?

  *if not S_ISREG( mode ) or S_ISLNK( mode ):*
*    return*

vs.

    *if not S_ISREG( mode ) or S_ISLNK( mode ): return*

or even:

   *with open( file_path, 'rb' ) as f: md5sum = md5_for_file( file_path )*

fyi, here are my changes:

*usage: dupscan [-h] [--start-directory START_DIRECTORY]*
*               [--field-separator FIELD_SEPARATOR]*

*scan files in a tree and print a line of information about each regular
file*

*optional arguments:*
*  -h, --help            show this help message and exit*
*  --start-directory START_DIRECTORY, -d START_DIRECTORY*
*                        Specify the root of the filesystem tree to be*
*                        processed. The default is '.'*
*  --field-separator FIELD_SEPARATOR, -s FIELD_SEPARATOR*
*                        Specify the string to use as a field separator in*
*                        output. The default is the ascii nul character.*

*#!/usr/bin/python*

*import argparse*
*import hashlib*
*import os*

*from platform import node*
*from stat import S_ISREG, S_ISLNK*

*ASCII_NUL = chr(0)*

*     # from:
http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
<http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python>*
*     # except that i use hexdigest() rather than digest()*
*def md5_for_file( path, block_size=2**20 ):*
*  md5 = hashlib.md5()*
*  with open( path, 'rb' ) as f:*
*    while True:*
*      data = f.read(block_size)*
*      if not data:*
*        break*
*      md5.update(data)*
*  return md5.hexdigest()*

*def file_info( directory, basename, field_separator=ASCII_NUL ):*
*  file_path = os.path.join( directory, basename )*
*  st = os.lstat( file_path )*

*  mode = st.st_mode*
*  if not S_ISREG( mode ) or S_ISLNK( mode ): *
*    return*

*  with open( file_path, 'rb' ) as f:*
*    md5sum = md5_for_file( file_path )*

*  return field_separator.join( [ thishost, md5sum, str( st.st_dev ), str(
st.st_ino ), str( st.st_nlink ), str( st.st_size ), file_path ] )*

*if __name__ == "__main__":*
*  parser = argparse.ArgumentParser(description='scan files in a tree and
print a line of information about each regular file')*
*  parser.add_argument('--start-directory', '-d', default='.',
help='''Specify the root of the filesystem tree to be processed.  The
default is '.' ''')*
*  parser.add_argument('--field-separator', '-s', default=ASCII_NUL,
help='Specify the string to use as a field separator in output.  The
default is the ascii nul character.')*
*  args = parser.parse_args()*

*  start_directory = args.start_directory.rstrip('/')*
*  field_separator = args.field_separator*

*  thishost = node()*
*  if thishost == '':*
*    thishost='[UNKNOWN]'*

*  for directory_path, directory_names, file_names in os.walk(
start_directory ):*
*    for file_name in file_names:*
*      print file_info( directory_path, file_name, field_separator )*

-- 
Live in a world of your own, but always welcome visitors.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20140918/9e7dba86/attachment.html>