program to generate data helpful in finding duplicate large files

Thu Sep 18 14:48:51 EDT 2014

On Fri, Sep 19, 2014 at 4:11 AM, David Alban <extasia at extasia.org> wrote:
> i'm a long time perl programmer who is learning python.  i'd be interested
> in any comments you might have on my code below.  feel free to respond
> privately if you prefer.  i'd like to know if i'm on the right track.

Sure! Happy to help out. But first, a comment about your English, as
shown above: It's conventional to capitalize in certain places, and it
does make your prose more readable. Just as PEP 8 does for Python
code, tidy spelling and linguistic conventions make it easier for
other "English programmers" (wordsmiths?) to follow what you're doing.

I've trimmed out any parts of your code that I don't have comments on.

> ascii_nul = chr(0)

You use this in exactly one place, to initialize sep. I'd either set
sep up here, or use ascii_nul down below (probably the former). But if
you're going to have a named constant, the Python convention is to
upper-case it: ASCII_NUL = chr(0).

> def md5_for_file(f, block_size=2**20):
>   md5 = hashlib.md5()
>   while True:
>     data = f.read(block_size)
>     if not data:
>       break
>     md5.update(data)
>   return md5.hexdigest()

This is used by opening a file and then passing the open file to this
function. Recommendation: Pass a file name, and have this function
open and close the file itself. Then it'll also be a candidate for the
obvious tidyup of using the file as a context manager - the 'with'
statement guarantees that the file will be promptly closed.

> thishost = socket.gethostname()

(Be aware that this won't always give you a useful value. You may want
to provide a default.)

> start_directory = re.sub( '/+$', '', args.start_directory )

Hello, Perl programmer :) When there's text to manipulate, you first
reach for a regular expression. What's this looking to do, exactly?
Trim off any trailing slashes? Python has a clearer way to write that:

start_directory = args.start_directory.rstrip("/")

Lots of text manipulation in Python is done with methods on the string
itself. Have an explore - there's all sorts of powerful stuff there.

> for directory_path, directory_names, file_names in os.walk( start_directory
> ):

Personally, I believe in Huffman coding my names. For locals that
exist for the sole purpose of loop iteration (particularly the third
one, which you use in one place), I'd use shorter names:

for path, dirs, files in os.walk(start_directory):
    for fn in files:

>     lstat_info = os.lstat( file_path )
>     dev   = lstat_info.st_dev
>     ino   = lstat_info.st_ino
>     nlink = lstat_info.st_nlink
>     size  = lstat_info.st_size
>
>     sep = ascii_nul
>
>     print "%s%c%s%c%d%c%d%c%d%c%d%c%s" % ( thishost, sep, md5sum, sep, dev,
> sep, ino, sep, nlink, sep, size, sep, file_path )

Python 3 turns print into a function, which is able to do this rather
more cleanly. You can call on the new behaviour in a Python 2 program
by putting this immediately under your shebang:

from __future__ import print_function

And then you can write the print call like this:

print(thishost, md5sum, dev, ino, nlink, size, file_path, sep=chr(0))

Or, eliminating all the assignment to locals:

    st = os.lstat( file_path )
    print(thishost, md5sum, st.st_dev, st.st_ino, st.st_nlink,
st.st_size, file_path, sep=chr(0))

That's shorter than your original line, *and* it doesn't have all the
assignments :)

> exit( 0 )

Unnecessary - if you omit this, you'll exit 0 implicitly at the end of
the script.

Standard rule of programming: There's always another way to do it.
Standard rule of asking for advice: There's always someone who will
advocate another way of doing it. It's up to you to decide which
advice is worth following, which is worth taking note of but not
actually following, and which is to be thrown out as rubbish :)

All the best!

ChrisA