program to generate data helpful in finding duplicate large files
bizcor at gmail.com
bizcor at gmail.com
Thu Sep 18 18:58:13 EDT 2014
thanks for the responses. i'm having quite a good time learning python.
On Thu, Sep 18, 2014 at 11:45 AM, Chris Kaynor <ckaynor at zindagigames.com>
wrote:
>
> Additionally, you may want to specify binary mode by using open(file_path,
> 'rb') to ensure platform-independence ('r' uses Universal newlines, which
> means on Windows, Python will convert "\r\n" to "\n" while reading the
> file). Additionally, some platforms will treat binary files differently.
>
would it be good to use 'rb' all the time?
On Thu, Sep 18, 2014 at 11:48 AM, Chris Angelico <rosuav at gmail.com> wrote:
> On Fri, Sep 19, 2014 at 4:11 AM, David Alban <extasia at extasia.org> wrote:
> > exit( 0 )
>
> Unnecessary - if you omit this, you'll exit 0 implicitly at the end of
> the script.
>
aha. i've been doing this for years even with perl, and apparently it's
not necessary in perl either. i was influenced by shell.
this shell code:
* if [[ -n $report_mode ]] ; then*
* do_report*
* fi*
* exit 0*
is an example of why you want the last normally executed shell statement to
be "exit 0". if you omit the exit statement it in this example, and
$report_mode is not set, your shell program will give a non-zero return
code and appear to have terminated with an error. in shell the last
expression evaluated determines the return code to the os.
ok, i don't need to do this in python.
On Thu, Sep 18, 2014 at 1:23 PM, Peter Otten <__peter__ at web.de> wrote:
>
> file_path may contain newlines, therefore you should probably use "\0" to
> separate the records.
i chose to stick with ascii nul as the default field separator, but i added
a --field-separator option in case someone wants human readable output.
style question: if there is only one, possibly short statement in a block,
do folks usually move it up to the line starting the block?
*if not S_ISREG( mode ) or S_ISLNK( mode ):*
* return*
vs.
*if not S_ISREG( mode ) or S_ISLNK( mode ): return*
or even:
*with open( file_path, 'rb' ) as f: md5sum = md5_for_file( file_path )*
fyi, here are my changes:
*usage: dupscan [-h] [--start-directory START_DIRECTORY]*
* [--field-separator FIELD_SEPARATOR]*
*scan files in a tree and print a line of information about each regular
file*
*optional arguments:*
* -h, --help show this help message and exit*
* --start-directory START_DIRECTORY, -d START_DIRECTORY*
* Specify the root of the filesystem tree to be*
* processed. The default is '.'*
* --field-separator FIELD_SEPARATOR, -s FIELD_SEPARATOR*
* Specify the string to use as a field separator in*
* output. The default is the ascii nul character.*
*#!/usr/bin/python*
*import argparse*
*import hashlib*
*import os*
*from platform import node*
*from stat import S_ISREG, S_ISLNK*
*ASCII_NUL = chr(0)*
* # from:
http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
<http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python>*
* # except that i use hexdigest() rather than digest()*
*def md5_for_file( path, block_size=2**20 ):*
* md5 = hashlib.md5()*
* with open( path, 'rb' ) as f:*
* while True:*
* data = f.read(block_size)*
* if not data:*
* break*
* md5.update(data)*
* return md5.hexdigest()*
*def file_info( directory, basename, field_separator=ASCII_NUL ):*
* file_path = os.path.join( directory, basename )*
* st = os.lstat( file_path )*
* mode = st.st_mode*
* if not S_ISREG( mode ) or S_ISLNK( mode ): *
* return*
* with open( file_path, 'rb' ) as f:*
* md5sum = md5_for_file( file_path )*
* return field_separator.join( [ thishost, md5sum, str( st.st_dev ), str(
st.st_ino ), str( st.st_nlink ), str( st.st_size ), file_path ] )*
*if __name__ == "__main__":*
* parser = argparse.ArgumentParser(description='scan files in a tree and
print a line of information about each regular file')*
* parser.add_argument('--start-directory', '-d', default='.',
help='''Specify the root of the filesystem tree to be processed. The
default is '.' ''')*
* parser.add_argument('--field-separator', '-s', default=ASCII_NUL,
help='Specify the string to use as a field separator in output. The
default is the ascii nul character.')*
* args = parser.parse_args()*
* start_directory = args.start_directory.rstrip('/')*
* field_separator = args.field_separator*
* thishost = node()*
* if thishost == '':*
* thishost='[UNKNOWN]'*
* for directory_path, directory_names, file_names in os.walk(
start_directory ):*
* for file_name in file_names:*
* print file_info( directory_path, file_name, field_separator )*
--
Live in a world of your own, but always welcome visitors.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20140918/9e7dba86/attachment.html>
More information about the Python-list
mailing list