[Csv] Re: [PEP305] Python 2.3: a small change request in CSV module

Cliff Wells LogiplexSoftware at earthlink.net
Fri May 16 22:13:12 CEST 2003


On Thu, 2003-05-15 at 13:15, Skip Montanaro wrote:

>     Bernard> The CSV module only allows a single character as delimiter,
>     Bernard> which does not (easily) allow one to write generic code that
>     Bernard> would not be at the mercy of whatever the current locale is of
>     Bernard> the user who sends you a csv file. Fortunately the Sniffer
>     Bernard> class is provided for guessing the most likely delimiter, and
>     Bernard> seems to work fine, from my limited tests.

As Skip mentioned, the sniffer isn't guaranteed to determine the
dialect.  Given reasonably sane CSV files, my confidence is good that it
will do the right thing.  Feed it something bizarre and you might get
bit.  There are even a couple of reasonable cases that might toss it. 
Feed "01/01/2003?10:10:56?10:15:02?hello, dolly" to it and see what you
get <wink>.  As you can see, it isn't certain what the delimiter might
be, even though the data is well-formed.

That bit of doubt, no matter how small, is enough to warrant human
intervention/confirmation prior to parsing and importing a couple of MB
of garbage into your SQL server.  You might feel confident in *your*
data, but we don't want to encourage other people to blindly trust the
sniffer.

Come to think of it, perhaps the sniffer should be raising an exception
rather than returning None when it fails...


>     Bernard> I feel though, that this unfortunately forces one to write more
>     Bernard> code than is really needed, typically in the following form:
> 
>     Bernard>     sample = file( 'data.csv' ).read( 8192 )
>     Bernard>     dialect = csv.Sniffer().sniff( sample )
>     Bernard>     infile = file( 'data.csv' )
>     Bernard>     for fields in csv.reader( infile, dialect ):
>     Bernard>         # do something with fields
> 
>     Bernard> That's a tad ugly, having to open the same file twice in
>     Bernard> particular.
> 
> I recognize the issue you raise.  As originally written, the Sniffer class
> also took a file-like object, however, it relied on being able to rewind the
> stream.  This would, for example, prevent you from feeding sys.stdin to the
> sniffer.  I also felt the decision of rewinding the stream belonged with the
> caller.  I decided to change it to accepting a small data sample instead.
> You can avoid multiple opens by rewinding the stream yourself (in the common
> case where the stream can be rewound):
> 
    infile = file('data.csv')
>     sample = infile.read(8192)
>     infile.seek(0)
>     dialect = csv.Sniffer().sniff( sample )
>     for fields in csv.reader( infile, dialect ):
>         # do something with fields

    infile = file('data.csv')
    sample = infile.read(8192)
    infile.seek(0)
    dialect = csv.Sniffer().sniff( sample )
    for fields in csv.reader( infile, dialect ):
        # do something with fields


Or even:

    infile = file('data.csv')
    dialect = csv.Sniffer().sniff( infile.read(8192) )
    if dialect:
        infile.seek(0)
        for fields in csv.reader( infile, dialect ):
            # do something with fields

Doesn't seem too bad.  There really doesn't seem to be a universal
solution to this.  If you use the sniffer you're forced to rewind.

>     Bernard> (2)
>     Bernard>     for fields in csv.reader( infile, dialect='sniff' ):
>     Bernard>         # do something with fields
> 
> Do you mean to imply that the csv.reader object should call the sniffer
> implicitly and use the values it returns?  That's an interesting idea but
> the sniffer isn't guaranteed to always guess right.

Yes.  It looks elegant but it's far too dangerous.  Especially just to
save a couple of lines of code.

You might also take a look at http://python-dsv.sf.net.  The code from
the sniffer was derived to a great extent from that code.  I'm planning
(some dreamy day) to rewrite DSV to take advantage of the Python CSV
module.  The point is that this is the sort of thing the sniffer was
meant to help with:  giving the user a preview of the data that they can
*confirm* is correct before actual importing and destruction of your
existing data begins <wink>.

Regards,

-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308



More information about the Csv mailing list