Best way to parse file into db-type layout?

Sat Apr 30 20:29:03 EDT 2005

On Sat, 30 Apr 2005 14:31:08 +0100, Michael Hoffman
<cam.ac.uk at mh391.invalid> wrote:

>John Machin wrote:
>
>>>>>That's nice. Well I agree with you, if the OP is concerned about embedded
>>>>>CRs, LFs and ^Zs in his data (and he is using Windows in the latter case),
>>>>>then he *definitely* shouldn't use fileinput.
>>>>
>>>>And if the OP is naive enough not to be concerned, then it's OK, is
>>>>it?
>>>
>>>It simply isn't a problem in some real-world problem domains. And if there
>>>are control characters the OP didn't expect in the input, and csv loads it
>>>without complaint, I would say that he is likely to have other problems once
>>>he's processing it.
>> 
>> Presuming for the moment that the reason for csv not complaining is
>> that the data meets the csv non-spec and that the csv module is
>> checking that: then at least he's got his data in the structural
>> format he's expecting; if he doesn't do any/enough validation on the
>> data, we can't save him from that.
>
>What if the input is UTF-16? Your solution won't work for that. And there
>are certainly UTF-16 CSV files out in the wild.

The csv module docs do say that Unicode is not supported.

This does appear to work, however, at least for data that could in
fact be encoded as ASCII:

>>> import codecs
>>> j = codecs.open('utf16junk.txt', 'rb', 'utf-16')
>>> rdr = csv.reader(j, delimiter='\t')
>>> rows = list(rdr)

The usual trick to smuggle righteous data past the heathen (recode as
UTF-8, cross the border, decode) should work. However the OP's data is
coming from an MF, not from Excel "save as Unicode text" (which
produces a tab-delimited .txt file -- how do you get a UTF-16 CSV
file?) and if it's not in ASCII it may have a bit more chance of being
in EBCDIC than UTF-16 -- unless MFs have come a long way since I last
had anything to do with them :-)

In any case, my "solution" was a sketch, and stated to be such. We
don't know, and I suspect the OP doesn't know, exactly (1) what
encoding is being used (2) what the rules are about quoting the
delimiter, and quoting the quote character. It's very possible even if
it's encoded in ASCII and the delimiter is a comma that the quoting
system being used is not the expected Excel-like method but something
else and hence the csv module can't be used.

>
>I think at some point you have to decide that certain kinds of data
>are not sensible input to your program, and that the extra hassle in
>programming around them is not worth the benefit.

I prefer to decide at a very early point what is sensible input to a
program, and then try to ensure that nonsensible input neither goes
unnoticed nor crashes with an unhelpful message.

>
>> There is also an "on principle" element to it as well -- with
>> fileinput one has to use the awkish methods like filelineno() and
>> nextfile(); strikes me as a tricksy and inverted way of doing things.
>
>Yes, indeed. I never use those, and would probably do something akin to what
>you are suggesting rather than doing so. I simply enjoy the no-hassle
>simplicity of fileinput.input() rather than worrying about whether my data
>will be piped in, or in file(s) specified on the command line.

Good, now we're singing from the same hymnbook :-)