[Csv] What's our status?

Cliff Wells LogiplexSoftware at earthlink.net
Thu Feb 27 02:10:24 CET 2003


On Wed, 2003-02-26 at 14:10, Cliff Wells wrote:
> On Wed, 2003-02-26 at 09:32, Cliff Wells wrote:
> 
> > I'm working on csvutils.py right now.  The guessDelimiter() function
> > from DSV isn't really the best for our purposes as it expects a fairly
> > fixed number of columns and we're allowing for variable columns per row.
> > Also, allowing spaces around delimiters is going to throw
> > guessQuoteChar().  I've got some ideas for fixing guessQuoteChar() but
> > guessDelimiter is going to need an entirely new approach (which I think
> > I have an idea for =)
> 
> Okay, here's my status:
> 
> 1) I can sniff the quotechar.
> 2) I can sniff the delimiter IF:
>     a) there is a quotechar [determine delimiter based on relation to 
>        quotechar].
>        or
>     b) the data is regular, that is, the number of columns doesn't vary
>        a lot from record to record [based upon number of occurrences of 
>        delimiter in each record, to grossly simplify things].  This is  
>        the method DSV uses.
> 
> However, for the following I am so far unable to come up with a way to
> determine the delimiter:
> 
> all,work,and,no,play,makes,jack,a,dull,boy
> all,work,and,no,play,makes,jack,a,dull
> boy
> all,work,and,no,play,makes,jack,a
> dull,boy
> all,work,and,no,play,makes,jack
> a,dull,boy
> all,work,and,no,play,makes
> jack,a,dull,boy
> all,work,and,no,play
> makes,jack,a,dull,boy
> all,work,and,no
> play,makes,jack,a,dull,boy
> all,work,and
> no,play,makes,jack,a,dull,boy

Okay, banging my head against a wall here.  Consider this "CSV" file:

all
work
and
no
play
makes
jack
a
dull
boy

I don't see why this wouldn't be considered valid CSV, yet there is
clearly no delimiter (assuming there would have been one had each row
contained more than one column).  It seems we could just pass ',' as the
delimiter since it won't be used anyway until we encounter:

redrum
redrum
redrum
re,drum

Where "," is actually part of the data (assume for a moment that \t was
the delimiter.

Further, consider that any of the characters ('r', 'e', 'd', 'u', 'm')
could possibly be considered a delimiter (not likely though, and I'd be
willing to limit possibilities to string.punctuation + string.whitespace
for these situations if I thought it would really help).

It's becoming clear to me that without the constraints I mentioned
earlier (valid quotechar or the columns are of a mostly fixed length)
there is no good way to sniff the format. This seems unfortunate because
the formats that are unsniffable are the simplest possible cases.

Sigh.  Will think about it more but I'm becoming more pessimistic the
longer I look at it.

OTOH, I personally don't have a big problem with the constraints [just a
small one].  The DSV sniffers have been used by a lot of people without
complaint and they required fixed column widths regardless of whether
there was a quotechar or not and we're actually doing a bit better than
that right now.


-- 
Cliff Wells, Software Engineer
Logiplex Corporation (www.logiplex.net)
(503) 978-6726 x308  (800) 735-0555 x308



More information about the Csv mailing list