DSVWizard.py

Tue Jan 28 00:22:28 CET 2003

>>>>> "Skip" == Skip Montanaro <skip at pobox.com> writes:

Skip> (Dave, should we continue to use the csv at object-craft address
Skip> for you or your djc email?)

Use the csv at object-craft.com.au address as it will ensure that Andrew
gets messages as well.  Andrew has spent considerable effort making
the CSV module conform to Excel behaviour.

Skip> I think we should aim for Excel2000 compatibility as a bare
Skip> minimum, and at least document any supported extensions and try
Skip> to tie them to specific other applications.  It is indeed
Skip> unfortunate that the CSV file format is only operationally
Skip> defined.

Skip> Wild-ass idea: Maybe the API should include a query function or
Skip> a data attribute which lists (as strings) the variants of CSV
Skip> supported by a module (which should be supported by test cases)?
Skip> The default variant would be listed first, and the constructor
Skip> would take any of the listed variants as an optional argument.
Skip> Something like:

Skip>     variants = csv.get_variants()

Skip>     csvl = csv.parser(variant="lotus123")
Skip>     csve = csv.parser(variant="excel2000")

What I think we should do is implement two layers; a Python layer and
an extension module.  The extension module should contain only the
functions which are necessary to implement a fast parser.

The Python layer would be the registry of variants and would configure
and tweak the parser.  This would allow all tweaking intelligence to
be hidden from the user while keeping implementation details out of
the parser.

Skip> We could create an informal "registry" of valid variant names.
Skip> If support for an existing variant is added, you use that name.
Skip> If support for an unknown variant is added, you register a
Skip> string.

I suppose a torture test is the first step in defining the variants.
Instead of trying to formally specify the variants up front we could
define them by the way they process the torture test.

Skip> That's true.  Perhaps selecting by variant name would do nothing
Skip> more than set those specific values behind the scenes, much the
Skip> same way that when you choose a particular C coding style in
Skip> Emacs a number of low-level variable values are set.

My thoughts exactly.

Cliff> Another problem with specifying styles by application name is
Cliff> that many apps allow the user to specify portions of the style
Cliff> (usually the delimiter), so that's not set in stone either.

In the first instance we have to assume that people are going to
choose styles which are not ambiguous.  This is a big assumption - I
have seen applications (database bulkcopy tools) which happily allow
you to export data which cannot be unambiguously parsed back into the
original fields/columns.

Cliff> I think what I'm leaning towards at this time, if everyone is
Cliff> in agreement, is for Dave or myself to reimplement Dave's code
Cliff> (and API) in Python so that there is a pure Python
Cliff> implementation, and then provide Dave's C module as a faster
Cliff> alternative (much like Pickle and cPickle).  The heuristics of
Cliff> DSV would be an optional feature, along with the GUI.

Shouldn't we first come up with a project plan.  If the eventual goal
is to get this into Python we are going to have to write a PEP.

Rather than trying to do everything ourselves we should try to think
of a method whereby we will get people to run a torture test against
the applications they need to interact with.

The steps would include (not sure about the order):

* Develop CSV torture test.

* Develop format by which people can submit results of torture test
  which will allow us to eventually regression test the parser against
  those results.

* Define Python API for CSV parser.

* Define extension module API.

* Write PEP.

* Develop CSV module.

Skip> This sounds like a reasonable idea.  I also agree the GUI stuff
Skip> will probably not make it into the core.

I agree.

Cliff> As far as DSV's current API, I'm not too attached to it, and I
Cliff> think that it could be mimicked sufficiently by adding a
Cliff> parser.parseall() method to Dave's API so the programmer would
Cliff> have the option of getting the entire file as a list without
Cliff> having to write a loop.

I think that we should be prepared to go back to the drawing board on
the API if necessary.  Once we have enough variants registered we will
be in a better position to come up with the "right" API.

- Dave

-- 
http://www.object-craft.com.au