[Tutor] improvements on a renaming script

Mon Mar 10 10:01:29 CET 2014

On Sun, Mar 09, 2014 at 03:22:35PM -0400, street.sweeper at mailworks.org wrote:
> - In the get_long_names() function, the for/if thing is reading
> the whole fileNames.tab file every time, isn't it?  In reality,
> the file was only a few dozen lines long, so I suppose it doesn't
> matter, but is there a better way to do this?
Yes, read the file once and build a dictionary mapping the first column to the
second column.

> - Really, I wanted to create a new sequence number at the end of
> each file name, but I thought this would be difficult.  In order
> for it to count from 01 to whatever the last file is per set p01,
> p02, etc, it would have to be aware of the set name and how many
> files are in it.  So I settled for getting the last 3 digits of
> the original file name using splitext().  The strings were unique,
> so it worked out.  However, I can see this being useful in other
> places, so I was wondering if there is a good way to do this.
> Is there a term or phrase I can search on?
Buckets? You'll need a function to differentiate the files into the desired
groups(buckets). You can use that function to generate a dictionary mapping a
bucket name to a list of files(the bucket). Iterate over all the files, sorting
them into buckets. Then iterate over the buckets, increasing the counter every
file, and reseting the counter in between buckets.

Here's how I'd do it for your script:

def get_bucket_key(filename):
    prefix = filename.split('_')[0]
    return prefix

def sort_files_into_buckets(file_list):
    buckets_to_files = {}
    for filename in file_list:
        key = get_bucket_key(filename)
        if key in buckets_to_files:
            buckets_to_files[key].append(filename)
        else:
            buckets_to_files[key] = [filename]
    return buckets_to_files

file_list = os.listdir(...)
buckets_to_files = sort_files_into_buckets(file_list)

for (bucket_key, bucket_file_list) in buckets_to_files.items():
    counter = 1
    for file in bucket_file_list:
        print "{0}_{1}.jpg".format(bucket_key, counter)
        counter += 1

> - I'd be interested to read any other comments on the code.
> I'm new to python and I have only a bit of computer science study,
> quite some time ago.
You should read PEP8[http://legacy.python.org/dev/peps/pep-0008/] for the basic
Python style guide. If you'll be writing things to share or come back to after
a while, it would be good to learn about documentation, start by reading
PEP257[http://legacy.python.org/dev/peps/pep-0257/].

Note that I've written my own versions alongside my comments. I'm not sure if
that's OK on this mailing list, and if you want a spoiler-free version with
just the commentary and no code let me know and I'll resend it:

> # get longnames from fileNames.tab
> def get_long_name(glnAbbrev):
Don't be afraid to use longer function/argument names and skip the
comment, your code will end up much more readable.

Does the "gln" in "glnAbbrev" stand for "get_long_name"? It seems
repetitive to include the name of the function in the name of it's
parameters.

And I think we're passing the function a prefix, not an abbreviation:

    def get_long_names_from_file(prefix):

>     with open(
>           os.path.join(os.path.expanduser('~'),'temp2','fileNames.tab')
>           ) as filenames:
Good, pythonic usage of context managers but it took a second longer to
click than splitting it out into 2 lines. Also you should add spaces
after each parameter in a function call:

    filenames_path = os.path.join(os.path.expanduser('~'), 'temp2', 'fileNames.tab')
    with open(filenames_path) as input_file:

>         filenamesdata = csv.reader(filenames, delimiter='\t')
>         for row in filenamesdata:
>             if row[0] == glnAbbrev:
>                 return row[1]
This is where you would build a dictionary.

You could skip the csv module and just iterate through the file,
splitting each line to get the prefix and long_name. The split()
function splits a string into a list of strings. If every line in this
file has only 2 columns, we can unpack the list directly into 2 variables.

I usually name my dictionary variables "key_to_value":

    prefix_to_long_name = {}
    for line in input_file:
        prefix, long_name = line.split()
        prefix_to_long_name[prefix] = long_name
    return prefix_to_long_name

This could also be done by dictionary comprehensions:

    return {prefix: long_name for (prefix, long_name) in
            (line.split() for line in input_file)}

Then just return your finished dictionary. Note that since we return a
dictionary instead of searching for a specific prefix, we do not need our
"prefix" parameter:

    def get_long_names_from_file():

But this sounds like we are returning a list of long_names, not a dictionary so
I would rename it again to:

    def get_prefix_to_long_name_from_file()

> # find shortname from slice in picture filename
> def get_slice(fn):
>     threeColSlice = fn[1:4]
>     return threeColSlice
I agree that this is overkill but with the suggestions above we can at least
make it nicer. Variables should be lower_case_with_underscores:

    def get_prefix_from_filename(filename):
        prefix = filename[1:4]
        return prefix

I try to avoid hardcoded splices, so I would opt to split by underscores and
keep the leading "p" (at least document the expected filenames somewhere in the
script so people understand what those indexes represent):

    def get_prefix_from_filename(filename):
        prefix = filename.split('_')[0]
        return prefix

> # get 3-digit sequence number from basename
> def get_bn_seq(fn):
>     seq = os.path.splitext(fn)[0][-3:]
>     return seq
Again you could use better names and a split instead of a splice.

The fact that the function uses the basename instead of the full filename is
not something that needs to be exposed in the function name:

    def get_number_from_filename(filename):
        base_filename = os.path.splitext(filename)[0]
        number = filename.split('_')[-1]
        return number

I would also write 2 more functions. The first to get paths to the temp2/4/5
folders and prevent code repetition(Don't Repeat Yourself):

    def get_folder_in_home(folder):
        folder_path = os.path.join(os.path.expanduser('~'), folder)

Then we can change the relevant line in get_long_names_from_file():

        folder_path = get_folder_in_home('temp2')
        filenames_path = os.path.join(folder_path, 'fileNames.tab')

The second function would return the new name of a file:

    def get_new_filename(filename, prefix_to_long_name):
        prefix = get_prefix_from_filename(filename)
        long_name = prefix_to_long_name[prefix]
        number = get_number_from_filename(filename)
        new_filename = long_name + "_-_" + number + ".jpeg"
        return new_filename

Using String Formatting is safer than concatenation(which would fail if number
was an int):

    new_filename = "{0}_-_{1}.jpeg".format(long_name, number)

Now that we're past the definitions we should build our prefix->long_name
dictionary:

    prefix_to_long_name = get_prefix_to_long_name_from_file()

> # directory locations
> indir = os.path.join(os.path.expanduser('~'),'temp4')
> outdir = os.path.join(os.path.expanduser('~'),'temp5')
The input and output directories are set and then never change. We can indicate
this to the reader by writing the variable in ALL_CAPS:

    INPUT_DIRECTORY = get_folder_in_home('temp4')
    OUTPUT_DIRECTORY = get_folder_in_home('temp5')

> # rename
> for f in os.listdir(indir):
>     if f.endswith(".jpg"):
>         os.rename(
>             os.path.join(indir,f),os.path.join(
>                 outdir,
>                 get_long_name(get_slice(f))+"_-_"+get_bn_seq(f)+".jpeg")
>                 )
I would at least rename "f" to "input_file". We could also use a list
comprehension to generate the list and rename in one step:

    [os.rename(os.path.join(INPUT_DIRECTORY,
                            image_file),
               os.path.join(OUTPUT_DIRECTORY,
                            get_new_filename(image_file, prefix_to_long_name))
               )
     for image_file in os.listdir(INPUT_DIRECTORY)
     if image_file.endswith(".jpg")]

Although that's a bit much:

    for input_file in os.listdir(INPUT_DIRECTORY):
        if input_file.endswith(".jpg"):
            input_path = os.path.join(INPUT_DIRECTORY, input_file)
            new_filename= get_new_filename(input_file, prefix_to_long_name)
            output_path = os.path.join(OUTPUT_DIRECTORY, new_filename)
            os.rename(input_path, output_path)

Some would also stick all of the top-level scripting into a function called
main. You set the main() function to run automatically only if the script is
called as a file. It won't execute if imported as a module in another script:

    if __name__ == '__main__':
        main()

Here's the fully revised script:

import os
from os.path import join

def main():
    INPUT_DIRECTORY = get_folder_in_home('temp4')
    OUTPUT_DIRECTORY = get_folder_in_home('temp5')

    prefix_to_long_name = get_prefix_to_long_name_from_file()

    for input_file in os.listdir(INPUT_DIRECTORY):
        if input_file.endswith(".jpg"):
            new_filename = get_new_filename(input_file, prefix_to_long_name)

            input_path = join(INPUT_DIRECTORY, input_file)
            output_path = join(OUTPUT_DIRECTORY, new_filename)

            os.rename(input_path, output_path)

def get_folder_in_home(folder):
    folder_path = join(os.path.expanduser('~'), folder)
    return folder_path

def get_prefix_to_long_name_from_file():
    prefix_to_long_name = {}
    filenames_path = join(get_folder_in_home('temp2'), 'fileNames.tab')

    with open(filenames_path) as input_file:
        for line in input_file:
            prefix, long_name = line.split()
            prefix_to_long_name[prefix] = long_name

    return prefix_to_long_name

def get_prefix_from_filename(filename):
    prefix = filename.split('_')[0]
    return prefix

def get_number_from_filename(filename):
    base_filename = os.path.splitext(filename)[0]
    number = base_filename.split('_')[-1]
    return number

def get_new_filename(filename, prefix_to_long_name):
    prefix = get_prefix_from_filename(filename)
    long_name = prefix_to_long_name[prefix]
    number = get_number_from_filename(filename)
    new_filename = "{0}_-_{1}.jpeg".format(long_name, number)
    return new_filename

if __name__ == '__main__':
    main()
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 966 bytes
Desc: GnuPG Digital Signature
URL: <http://mail.python.org/pipermail/tutor/attachments/20140310/ce792912/attachment.sig>