Best way to parse file into db-type layout?

Sat Apr 30 19:16:22 EDT 2005

On Sat, 30 Apr 2005 09:23:16 -0400, Steve Holden <steve at holdenweb.com>
wrote:

>John Machin wrote:
>[...]
>> 
>> I wouldn't use fileinput for a "commercial data processing" exercise,
>> because it's slow, and (if it involved using the Python csv module) it
>> opens the files in text mode, and because in such exercises I don't
>> often need to process multiple files as though they were one file.
>> 
>If the process runs once a month, and take ten minutes to process the 
>required data, isn't that fast enough.

Depends: (1) criticality: could it have been made to run in 5 minutes,
avoiding the accountant missing the deadline to EFT the taxes to the
government (or, worse, missing the last train home)?

(2) "Many a mickle makes a muckle": the total of all run times could
be such that overnight processing doesn't complete before the day
shift turns up ...

> It's unwise to act as though 
>"slow" is an absolute term.

>
>> When I am interested in multiple files -- more likely a script that
>> scans source files -- even though I wouldn't care about the speed nor
>> the binary mode, I usually do something like:
>> 
>> for pattern in args: # args from an optparse parser
>>     for filename in glob.glob(pattern):
>>         for line in open(filename):
>> 
>> There is also an "on principle" element to it as well -- with
>> fileinput one has to use the awkish methods like filelineno() and
>> nextfile(); strikes me as a tricksy and inverted way of doing things.
>> 
>But if it happens to be convenient for the task at hand why deny the OP 
>the use of a tool that can solve a problem? We shouldn't be so purist 
>that we create extra (and unnecessary) work :-), and principles should 
>be tempered with pragmatism in the real world.

If the job at hand is simulating awk's file reading habits, yes then
fileinput is convenient. However if the job at hand involves anything
like real-world commercial data processing requirements then fileinput
is NOT convenient.

Example 1: Requirement is, for each input file, to display name of
file, number of records, and some data totals.

Example 2: Requirement is, if end of file occurs when not expected
(including, but not restricted to, the case of zero records) display
an error message and terminate abnormally.

I'd like to see some code for example 1 that used fileinput (on a list
of filenames) and didn't involve "extra (and unnecessary) work"
compared to the "for filename in alist / f = open(filename) / for line
in f" way of doing it.

If fileinput didn't exist, what do you think the reaction would be if
you raised a PEP to include it in the core?