[Csv] csv.utils.Sniffer notes

Thu Apr 24 23:13:29 CEST 2003

Sorry for the late notice on this.  The 2.3b1 release snuck up on me.

I sent this back on the 12th.  It's in my outgoing mail archive, but I
didn't see it in the mailing list archives and never received any
responses.  Maybe my mailman installation is broken.  The last message
archived appears on the 11th.

Note also that I just checked in a change recommended by the PythonLabs
folks - it's once again a csv module (no longer a package).  Cliff's sniffer
class is now csv.Sniffer.  2.3b1 is scheduled to be frozen tomorrow at noon.
After that, the API can't change.  If I don't hear from anyone about this
real soon I'll go ahead and implement the change.

Skip

---------------------------------------------------------------------- I
guess this is mostly for Cliff, but everyone should feel free to chime in.
I went to write a subsection describing the Sniffer class and began to
wonder about a few things.

  * It's not clear to me that passing a file object to Sniffer.sniff() is
    the correct way to give it data to operate on.  First, because you can
    perform multiple operations (sniff, hasHeaders), it requires the file
    object to be rewindable.  Second, it doesn't seem to me that setting
    self.fileobj in sniff() is the right thing.  What if all the user is
    interested in is whether the CSV file has headers?  I think it makes
    more sense to simply pass in a chunk of data to the constructor to use
    as the sample.  The caller can then worry about rewindability in his own
    code.

  * The mixture of camelCase and underscore separators in the method names.
    I believe it's more usual (especially in the Python core) to use an
    underscore to separate words in attribute names.

  * The use of eval().  I think the only things we can reasonably have in
    CSV files are strings, ints and floats, so code to determine types can
    look like:

        try:
            thisType = type(int(row[col]))
        except ValueError:
            try:
                thisType = type(float(row[col]))
            except ValueError:
                thisType = str

    OverFlowError doesn't need to be considered in 2.3 because int()
    silently coerces to longs:

        >>> int(6e23)
        600000000000000016777216L

    2.2 and earlier probably still require the OverflowError check.

  * I don't think the sniffer needs to offer a register_dialect() method.
    The sniff() method returns a dialect.  The programmer can then call the
    normal dialect registration function if need be.

Attached is a context diff against the current CSV version of Lib/csv.py and
Lib/test/test_csv.py which implements the various changes except for the
eval() stuff and adds a couple simple sniffer tests.  The logic for the
eval() stuff was complex enough that I didn't want to risk screwing it up at
this point.

Skip