Best way to parse file into db-type layout?

John Machin sjmachin at lexicon.net
Sat Apr 30 08:26:43 EDT 2005


On Sat, 30 Apr 2005 11:35:05 +0100, Michael Hoffman
<cam.ac.uk at mh391.invalid> wrote:

>John Machin wrote:

>> Real-world data is not "text".
>
>A lot of real-world data is. For example, almost all of the data I deal with
>is text.

OK, depends on one's definitions of "data" and "text". In the domain
of commercial database applications, there is what's loosely called
"text": entity names, and addresses, and product descriptions, and the
dreaded free-text "note" columns -- all of which (not just the
"notes") one can end up parsing trying to extract extraneous data
that's been dumped in there ... sigh ...

>
>>>That's nice. Well I agree with you, if the OP is concerned about embedded
>>>CRs, LFs and ^Zs in his data (and he is using Windows in the latter case),
>>>then he *definitely* shouldn't use fileinput.
>> 
>> And if the OP is naive enough not to be concerned, then it's OK, is
>> it?
>
>It simply isn't a problem in some real-world problem domains. And if there
>are control characters the OP didn't expect in the input, and csv loads it
>without complaint, I would say that he is likely to have other problems once
>he's processing it.

Presuming for the moment that the reason for csv not complaining is
that the data meets the csv non-spec and that the csv module is
checking that: then at least he's got his data in the structural
format he's expecting; if he doesn't do any/enough validation on the
data, we can't save him from that.

>
>> Except, perhaps, the reason stated in fileinput.py itself: 
>> 
>> """
>> Performance: this module is unfortunately one of the slower ways of
>> processing large numbers of input lines.
>> """
>
>Fair enough, although Python is full of useful things that save the
>programmer's time at the expense of that of the CPU, and this is
>frequently considered a Good Thing.
>
>Let me ask you this, are you simply opposed to something like fileinput
>in principle or is it only because of (1) no binary mode, and (2) poor
>performance? Because those are both things that could be fixed. I think
>fileinput is so useful that I'm willing to spend some time working on it
>when I have some.

I wouldn't use fileinput for a "commercial data processing" exercise,
because it's slow, and (if it involved using the Python csv module) it
opens the files in text mode, and because in such exercises I don't
often need to process multiple files as though they were one file.

When I am interested in multiple files -- more likely a script that
scans source files -- even though I wouldn't care about the speed nor
the binary mode, I usually do something like:

for pattern in args: # args from an optparse parser
    for filename in glob.glob(pattern):
        for line in open(filename):

There is also an "on principle" element to it as well -- with
fileinput one has to use the awkish methods like filelineno() and
nextfile(); strikes me as a tricksy and inverted way of doing things.

Cheers,
John




More information about the Python-list mailing list