Organizing function calls once files have been moved to a directory

Wed Jun 24 07:59:26 EDT 2015

On Tuesday, June 23, 2015 at 10:18:43 PM UTC-4, Steven D'Aprano wrote:
> On Wed, 24 Jun 2015 06:16 am, kbtyo wrote:
> 
> > I am working on a workflow module that will allow one to recursively check
> > for file extensions and if there is a match move them to a folder for
> > processing (parsing, data wrangling etc).
> > 
> > I have a simple search process, and log for the files that are present
> > (see below). However, I am puzzled by what the most efficient
> > method/syntax is to call functions once the selected files have been
> > moved? 
> 
> The most efficient syntax is the regular syntax that you always use when
> calling a file:
> 
>     function(arg, another_arg)
> 
> 
> What else would you use?
> 
> 
> > I have the functions and classes written in another file. Should I 
> > import them or should I include them in the same file as the following
> > mini-script?
> 
> That's entirely up to you. Some factors you might consider:
> 
> - Are these functions and classes reusable by other code? then you might
> want to keep them separate in another file, treated as a library, and
> import the library into your application.
> 
> - If you merge the two files together, will it be so big that it is
> difficult to work with? Then don't merge them together. My opinion is that
> the decimal module from the standard library is about as big as a single
> module should every be, and it is almost 6,500 lines. So if your
> application is bigger than that, you might want to split it.
> 
> 
> 
> > Moreover, should I create another log file for processing? If so, what is
> > an idiomatically correct method to do so?
> 
> I don't know. Do you want a second log file? How will it be different from
> the first?
> 
> As for creating another log file, I guess the most correct way to do so
> would be the same way you created the first log file.
> 
> I'm not sure I actually understand your questions so far.
> 
> Some further comments on your code:
> 
> > if __name__ == '__main__':
> > 
> > # The top argument for name in files
> >     topdir = '.'
> >     dest = 'C:\\Users\\wynsa2\\Desktop\\'
> 
> Rather than escaping backslashes, you can use regular forward slashes:
> 
> dest = 'C:/Users/wynsa2/Desktop/'
> 
> 
> Windows will accept either.
> 
> 
> >     extens = ['docs', 'docx', 'pdf'] # the extensions to search for
> >     found = {x: [] for x in extens} # lists of found files
> >  
> >     # Directories to ignore
> >     ignore = ['docs', 'doc', 'py', 'pdf']
> >     logname = "file_search.log"
> >     print('Beginning search for files in %s' % os.path.realpath(topdir))
> >   
> >     # Walk the tree
> >     for dirpath, dirnames, files in os.walk(topdir):
> >         # Remove directories in ignore
> >         # directory names must match exactly!
> >         for idir in ignore:
> >             if idir in dirnames:
> >                 dirnames.remove(idir)
> >      
> >         # Loop through the file names for the current step
> >         for name in files:
> >      #Calling str.rsplit on name then
> >     #splits the string into a list (from the right)
> >     #with the first argument "."" delimiting it,
> >     #and only making as many splits as the second argument (1).
> >     #The third part ([-1]) retrieves the last element of the list--we
> >     #use this instead of an index of 1 because if no splits are made
> >     #(if there is no "."" in name), no IndexError will be raised
> > 
> >             ext = name.lower().rsplit('.', 1)[-1]
> 
> The better way to split the extension from the file name is to use
> os.path.splitext(name):
> 
> 
> py> import os
> py> os.path.splitext("this/file.txt")
> ('this/file', '.txt')
> py> os.path.splitext("this/file")  # no extension
> ('this/file', '')
> py> os.path.splitext("this/file.tar.gz")
> ('this/file.tar', '.gz')
> 
> 
> -- 
> Steven

On Tuesday, June 23, 2015 at 10:18:43 PM UTC-4, Steven D'Aprano wrote:
> On Wed, 24 Jun 2015 06:16 am, kbtyo wrote:
> 
> > I am working on a workflow module that will allow one to recursively check
> > for file extensions and if there is a match move them to a folder for
> > processing (parsing, data wrangling etc).
> > 
> > I have a simple search process, and log for the files that are present
> > (see below). However, I am puzzled by what the most efficient
> > method/syntax is to call functions once the selected files have been
> > moved? 
> 
> The most efficient syntax is the regular syntax that you always use when
> calling a file:
> 
>     function(arg, another_arg)
> 
> 
> What else would you use?
> 
> 
> > I have the functions and classes written in another file. Should I 
> > import them or should I include them in the same file as the following
> > mini-script?
> 
> That's entirely up to you. Some factors you might consider:
> 
> - Are these functions and classes reusable by other code? then you might
> want to keep them separate in another file, treated as a library, and
> import the library into your application.

I think I will do just that.  
> 
> - If you merge the two files together, will it be so big that it is
> difficult to work with? Then don't merge them together. My opinion is that
> the decimal module from the standard library is about as big as a single
> module should every be, and it is almost 6,500 lines. So if your
> application is bigger than that, you might want to split it.
> 
 Yes, I think that is the best course of action. 
> 
> > Moreover, should I create another log file for processing? If so, what is
> > an idiomatically correct method to do so?
> 
> I don't know. Do you want a second log file? How will it be different from
> the first?

The second log file would provide information to the user on the files that:

1. Have been processed (specifically - parsed from XML to CSV (3 functions conduct this)

2. Files that did not pass whichever function 

3. Files that have malformed XML (this is particularly essential)

> As for creating another log file, I guess the most correct way to do so
> would be the same way you created the first log file.
> 

This is a little difficult, for me to conceptualize, as I am wondering if the above 1,2,3 should have separate functions for each type of error that ties to the specific functions that parse XML and write to CSV, respectively or should be a specific second error log. 

I am wondering how I should organize this such that I can "automate" this process.

FYI The first error log is simply this:

def log_results(logpath, results):
	# The header in our logfile
	loghead = 'Search log from filefind for files in {}\n\n'.format(
	              os.path.realpath(topdir)
	          )
	# The body of our log file
	logbody = ''

	# loop thru results
	for searchResult in results:
	    # Concatenate the result from the found dict
	    logbody += "<< Results with the extension '%s' >>" % searchResult
	    logbody += '\n\n%s\n\n' % '\n'.join(found[searchResult])

	# Write results to the logfile
	with open(logpath, 'w') as logfile:
	    logfile.write(loghead)
	    logfile.write(logbody)

Calling it can be instantiated via the below (which you already commented upon - thank you for the feedback). 

> I'm not sure I actually understand your questions so far.
> 
> Some further comments on your code:
> 
> > if __name__ == '__main__':
> > 
> > # The top argument for name in files
> >     topdir = '.'
> >     dest = 'C:\\Users\\wynsa2\\Desktop\\'
> 
> Rather than escaping backslashes, you can use regular forward slashes:
> 
> dest = 'C:/Users/wynsa2/Desktop/'
> 
> 
> Windows will accept either.

Ah - good to know!
> 
> 
> >     extens = ['docs', 'docx', 'pdf'] # the extensions to search for
> >     found = {x: [] for x in extens} # lists of found files
> >  
> >     # Directories to ignore
> >     ignore = ['docs', 'doc', 'py', 'pdf']
> >     logname = "file_search.log"
> >     print('Beginning search for files in %s' % os.path.realpath(topdir))
> >   
> >     # Walk the tree
> >     for dirpath, dirnames, files in os.walk(topdir):
> >         # Remove directories in ignore
> >         # directory names must match exactly!
> >         for idir in ignore:
> >             if idir in dirnames:
> >                 dirnames.remove(idir)
> >      
> >         # Loop through the file names for the current step
> >         for name in files:
> >      #Calling str.rsplit on name then
> >     #splits the string into a list (from the right)
> >     #with the first argument "."" delimiting it,
> >     #and only making as many splits as the second argument (1).
> >     #The third part ([-1]) retrieves the last element of the list--we
> >     #use this instead of an index of 1 because if no splits are made
> >     #(if there is no "."" in name), no IndexError will be raised
> > 
> >             ext = name.lower().rsplit('.', 1)[-1]
> 
> The better way to split the extension from the file name is to use
> os.path.splitext(name):
> 
> 
> py> import os
> py> os.path.splitext("this/file.txt")
> ('this/file', '.txt')
> py> os.path.splitext("this/file")  # no extension
> ('this/file', '')
> py> os.path.splitext("this/file.tar.gz")
> ('this/file.tar', '.gz')
> 
> 
> -- 
> Steven

Thank you for your feedback on utilizing the OS module more efficiently.