CSV(???)

David C. Ullrich ullrich at math.okstate.edu
Sat Feb 24 07:26:31 EST 2007


On 23 Feb 2007 07:31:35 -0800, "John Machin" <sjmachin at lexicon.net>
wrote:

>On Feb 23, 10:11 pm, David C. Ullrich <ullr... at math.okstate.edu>
>wrote:
>> Is there a csvlib out there somewhere?
>
>I can make available the following which should be capable of running
>on 1.5.2 -- unless they've suffered bitrot :-)
>
>(a) a csv.py which does simple line-at-a-time hard-coded-delimiter-etc
>pack and unpack i.e. very similar to your functionality *except* that
>it doesn't handle newline embedded in a field. You may in any case be
>interested to see a different way of writing this sort of thing: my
>unpack does extensive error checking; it uses a finite state machine
>so unexpected input in any state is automatically an error.

Actually a finite-state machine was the first thing I thought of.
Then while I was thinking about what states would be needed, etc,
it ocurred to me that I could get something working _now_ by
just noticing that (assuming valid input) a quoted field
would be terminated by '",' or '"[eos]'.

A finite-state machine seems like the "right" way to do it,
but there are plenty of other parts of the project where
doing it right is much more important - yes, in my experience
doing it "right" saves time in the long run, but that
finite-state machine would have taken more time 
_yesterday_.

>(b) an extension module (i.e. written in C) with the same API. The
>python version (a) imports and uses (b) if it exists.
>
>(c) an extension module which parameterises everything including the
>ability to handle embedded newlines.
>
>The two extension modules have never been compiled & tested on other
>than Windows but they both should IIRC be compilable with both gcc
>(MinGW) and the free Borland 5.5 compiler -- in other words vanilla C
>which should compile OK on Linux etc.
>
>If you are interested in any of the above, just e-mail me.

Keen.

>>
>> And/or does anyone see any problems with
>> the code below?
>>
>> What csvline does is straightforward: fields
>> is a list of strings. csvline(fields) returns
>> the strings concatenated into one string
>> separated by commas. Except that if a field
>> contains a comma or a double quote then the
>> double quote is escaped to a pair of double
>> quotes and the field is enclosed in double
>> quotes.
>>
>> The part that seems somewhat hideous is
>> parsecsvline. The intention is that
>> parsecsvline(csvline(fields)) should be
>> the same as fields. Haven't attempted
>> to deal with parsecsvline(data) where
>> data is in an invalid format - in the
>> intended application data will always
>> be something that was returned by
>> csvline.
>
>"Always"? Famous last words :-)

Heh. 

Otoh, having read about all the existing variations
in csv files, I don't think I'd attempt to write
something that parses csv provided from an
external source.

>> It seems right after some
>> testing... also seems blechitudinous.
>
>I agree that it's bletchworthy, but only mildly so. If it'll make you
>feel better, I can send you as a yardstick csv pack and unpack written
>in awk -- that's definitely *not* a thing of beauty and a joy
>forever :-)
>
>I presume that you don't write csvline() output to a file, using
>newline as a record terminator and then try to read them back and pull
>them apart with parsecsvline() -- such a tactic would of course blow
>up on the first embedded newline. 

Indeed. Thanks - this is exactly the sort of problem I was hoping
people would point out (although in fact this one is irrelevant,
since I already realized this). In fact the fields will not
contain linefeeds (the data is coming from <INPUT type="text">
on an html form, which means that unless someone's _trying_
to cause trouble a linefeed is impossible, right? Regardless,
incoming data is filtered. Fields containing newlines are
quoted just to make the thing usable in other situations - I
wouldn't use parsecsvline without being very careful, but there's
no reason csvline shouldn't have general applicability.) And in 
any case, no, I don't intend to be parsing multi-record csv files.

Although come to think of it one could modify the above
to do that without too much trouble, at least assuming
valid input - end-of-field followed by linefeed must
be end-of-record, right?

>So as a matter of curiosity, where/
>how are you storing multiple csvline() outputs?

Since you ask: the project is to allow alumni to
store contact information on a web site, and then
let office staff access the information for various
purposes. So each almunus' data is stored as a
csvline in an anydbm "database" - when someone in
the office requests the information it's dumped
into a csv file, the idea being that the office
staff opens that in Excel or whatever.

(Why not simply provide a suitable interface
to the data instead of just giving them the
csv file? So they can use the data in ways I
haven't anticipated. Why not give them access
to a real database? They know how to use Excel.

I do think I'll provide a few access thingies
in addition to the csv file, for example an
automatic mass mailer...)

So why put csv data into an anydbm thing instead
of using shelve or something? Laughably or not,
the reason is to speed up what seems like the
main bottleneck:

If I use my parsecsvline() that will be very slow.
But that doesn't matter, since that only happens
once or twice a day on one record, when an alumnus
logs in and edits his contact information.

But when the office requests the data we run through
the entire database - if we store the data as csv
then we don't have any conversion to do at that
point, we just write the raw data in the database
to a file. Should be much quicker than converting
something else to csv at that point.

(So why not just store the data in a csv file?
Random access.)

Since you asked, if you had any comments on
what's silly about the general plan there by
all means say so.

Hmm. Why not use one of the many Python
web tools out there? 

(i) Doing it myself is more interesting. I'm
not getting paid for this.

(ii) If I do it muself it's going to be easier
for me to be certain I know exactly where user
input is at all times.

The boss wanted me to use php because Python
was going to be too hard for someone else to
read. That's nonsense, of course. Anyway, he
gave me a book on php security. The book
raised a lot of issues that I wouldn't have
thought of, but it also convinced me I 
wouldn't want to use php - all through
the book we're warned that php will do this
or that bad thing if you're not careful.
Don't want to have to learn all the things
you need not to do with whatever tool I
use.

Here, the only write access to the database is
through an Alum object; Alum objects filter their
data on creation, and they're read-only (via
the magic of ___setattr__), so a maintainer
would have _try_ if he wanted to insert unfiltered
data - wouldn't be hard to do, but he can't do it
by accident.

And the only html output is through PostHTML, which
filters everything through cgi.escape(). In particular
print statements raise exceptions (via 
sys.stdout = PrintExploder().) Again, a maintainer
could easily write to sys.__stdout__ to get around
this, but that's not going to happen by accident.

Altogether seems much cleaner than the php stuff
I saw in that book - the way he does things you need
to be careful every time you do something, with
the current setup I only need to be careful twice,
in Alum.__init__ and in PostHTML.

Could be I'm being arrogant putting more trust in
asetup like that instead of some well-known
Python web thingie. But I don't see anyplace
things can leak out, and using someone else's
thing I'd either have to just believe them
or read a lot of code.

That'll teach you to express curiosity
about something I'm doing. Been thinking
about all this for a few weeks, you asked
a question and the fingers started ty[ing.

>>
>> (Um: Believe it or not I'm _still_ using
>> python 1.5.7. So comments about iterators,
>> list comprehensions, string methods, etc
>> are irrelevent. Comments about errors in
>> the algorithm would be great. Thanks.)
>
>1.5.7 ?

Well I _said_ you wouldn't believe it...

>[big snip]
>
>Cheers,
>John


************************

David C. Ullrich



More information about the Python-list mailing list