[Csv] Re: PEP 305 - Comments (really long post)

Thu Feb 6 08:06:02 CET 2003

    Carlos> I was discussing the CSV implementation with Skip 'et alii' over
    Carlos> the Python list, but i decided to wait a little bit to put my
    Carlos> ideas in a better form and then contribute to the PEP. Please
    Carlos> bear with me as this message is rather long and may be confusing
    Carlos> at times, but I sincerely hope it helps.

Don't worry.  Most of us read the list, and I've been forwarding all
messages which were sent to c.l.py but not cc'd to the csv list so we at
least have them archived.  Accordingly, we're already familiar with your
plight. ;-)

    Carlos> For example, look at these lines:

    Carlos> "row 1";10
    Carlos> "row 2";3,1416

    Carlos> I assume that the csv library will parse the first line as ("row
    Carlos> 1",10); the number 10 will be probably returned as a integer
    Carlos> (which is not the correct interpretation for this particular
    Carlos> file - more on this item [2]).

You'd be wrong to assume that.  The csv reader will return a list of two
strings, "row 1" and "10".  How to interpret the contents of the strings is
completely up to you.

    Carlos> The second line will probably be parsed as ("row 2","3,1416");
    Carlos> it may even raise an exception, depending on the implementation
    Carlos> details! What do you intend to do in this case?

No exception will be raised.  Assuming you have the quotechar set to '"' and
the delimiter set to ';', you will, as you surmised, get the pair of strings
you indicated.  You are completely free to call int() with "3,1416" as an
argument.  As long as your locale is set correctly, it will work.

    Carlos> Another point that you should bear in mind: even here in Brazil,
    Carlos> some programs will use the standard (US) delimiters and number
    Carlos> formats, while others will use the localized ones. So we end up
    Carlos> needing to read/write both formats - for example, when reading
    Carlos> data from Excel, and then exporting the same data to some
    Carlos> scientific package that is not locale-aware. So any
    Carlos> localization-related parameters have to be flexible and easily
    Carlos> customizable.

I understand this is going to be a problem, however I have no way of solving
it for you in a way that will make everybody happy, so I'm not going to even
try.  The csv module is about abstracting away all the little weirdnesses
which crop up in different dialects of delimited files.

You, as the application programmer, have to be sensitive to the locales in
which your data will be interpreted.  If you expect to dump an Excel
spreadsheet to a CSV file for analysis by a colleague in the US, everyone's
going to be a lot happier if you send the data encoded for either the en_US
or C locales.  If that's not possible, you need to transmit locale
information along with the data.

If you have your locale set appropriately, when writing numeric data, the
csv module should just do the right thing.  It calls str() on all numeric
data to write it out.  I believe str() is locale-sensitive.

    Carlos> [2] I assume that the csv library will convert any numbers read
    Carlos> from the csv file to some of numeric types available in Python
    Carlos> upon reading. There are some issues here.

Nope.  You get strings.

    Carlos> "row 1";10;1
    Carlos> "row 2";3,1416;2
    Carlos> "row 3";-1;3

    Carlos> The obvious choice is to parse the column 1 as strings; column 2
    Carlos> as floats; and column 3 as integers. But the problem is, how is
    Carlos> the csv library supposed to know that the second column hold
    Carlos> float values, and not integers? Look ahead is out of question -
    Carlos> after all, the only line containing a decimal point may be the
    Carlos> last on a 10 GB file.

    Carlos> For this problem, I propose the following semantics:

    ...

Just apply the necessary semantics yourself.  Here's a suggestion.  Suppose
you know you want the first column to be strings, the second floats and the
third ints.  Code your read loop something like so:

    types = (str, float, int)
    reader = csv.reader(myfile)
    for row in reader:
        row = [t(v) for (t,v) in zip(types, row)]
        process(row)

That way you have complete control over the interpretation of the data.
Nobody guesses.  No decisions have to be made at the csv level when a piece
data doesn't fit the mold.

    Carlos> b) assuming that the default column types were not supplied, the
    Carlos>    csv library will try to detect the correct values from the
    Carlos>    ones read from the first line of the file, but respecting the
    Carlos>    parameters mentioned above. If the first line contains column
    Carlos>    headers, then it will use the second line for this purpose.

This is bound to fail.  You showed an example where floats and ints were
mixed up.  What if I had a column containing hex digits?  Heck, make it more
likely to guess wrong and make them base 9 or base 11 digits.  Most of the
time with base 11 numbers the will consist only of the digits 0 through
9.  No 'a' will appear.  With base 9 numbers it's even worse.  They can
always be interpreted as decimal numbers, but that interpretation will
always be incorrect.

    Carlos> c) from the second line onwards, the csv library will keep the
    Carlos>    same conversion used for the first line. In case of error
    Carlos>    (for example, a float is found in a integer-only column), the
    Carlos>    library may take one of these actions:

    Carlos> - raise an exception
    Carlos> - coerce the value to the standard type for that particular column
    Carlos> - return the value as read, even if using a different type

You're asking us to do way too much.  It is just not going to work in the
general case, and you can do a much better job much more simply at the
application level, because you know the properties of your data.  If we
attempted to do something very elaborate, we'd probably get it wrong.  Even
if we managed to get it right, it would probably be slow.

    Carlos> [4] That said, I have one concern: setting the line terminator
    Carlos> in the CSV library (using the dialect class) does not seem
    Carlos> right.

One thing (among many) that's still missing from the PEP is the admonition
that you have to pass in files opened in binary mode.  That lets the csv
module have complete control over line endings using the lineterminator
attribute. 

    Carlos> My point here is that the line terminator in the CSV library
    Carlos> will end up being useless, as it depends ultimately on the
    Carlos> ability of the csvwriter.write() method to convince the file
    Carlos> object to use the 'correct' line terminator. 

That's why we expect you to open files in binary mode.  I plan to make
another pass through the PEP tomorrow.  I will make sure I add this.

    Carlos> ------------
    Carlos> [5] It is not clear to me what is returned as a row in the
    Carlos> example given: 

    Carlos> csvreader = csv.reader(file("some.csv"))
    Carlos> for row in csvreader:
    Carlos>     process(row)

    Carlos> It is obvious to assume that 'row' is a sequence, probably a
    Carlos> tuple. Anyway, it should be clearly stated in the PEP.

Thanks, will do.  I'm trying to twist my colleagues' arms into letting the
reader return dicts and the writer accept dicts under the proper
circumstances, but the default case is that the reader will return lists and
the writer will accept sequences (lists, tuples, strings, unicode objects
and arrays from the standard library, though any other sequence should do as
well).

    Carlos> [6] Empty strings can be mistaken for NULL fields (or None
    Carlos> values in Python).  How do you think to manage this case, both
    Carlos> when reading and writing? Please note that, depending on the
    Carlos> selection of the quote behavior and due to some side effects, it
    Carlos> may be impossible for the reader to discern the two cases; so
    Carlos> the library will need to be informed about the default choice.

I don't like writing None out at all, but my colleagues assure me the SQL
people want SQL's NULL to map to None and that the most reasonable text
representation of None is the empty string.  Quoting doesn't count.  We have
no intention to imply semantics using quotes.  I believe we still have some
thinking to do about whether to allow the user to specify the actual string
representation of None.

    ...

    Carlos> BTW, the same reasoning may be applied to the decision between
    Carlos> returning 'None' or 'zero' when reading an empty numeric field.

Again, don't forget that the csv module does no infer types.  When you read
a row you get a list of strings.  It's up to the application to decide how
to interpret it.

    Carlos> [7] A minor suggestion, why not number the items in the "Issues"
    Carlos> section? It would make easier to reference comments... For
    Carlos> example, 'issue #1', etc...

Thanks, that's a good idea.

    Carlos> [8] My comments on the last issue of the list - rows of
    Carlos> different lengths:

Rows can be returned of different lengths.  How to deal with short or long
rows is the job of the application.

    Carlos> [9] A very similar architecture can be used to handle
    Carlos> fixed-width text files.  It can be done in a separate library,
    Carlos> but using a similar interface; or it could be part of the csv
    Carlos> library, either as another class, or by means of a proper
    Carlos> selection of the parameters passed to the constructor. It would
    Carlos> be useful as some applications may like best the fixed-width
    Carlos> files instead of the delimited ones (old COBOL programs are
    Carlos> likely to behave this way; this format is still common when
    Carlos> passing data to/from mainframes).

We thought about this briefly, but fixed-width data is not what CSV files
are all about.  The csv module is about parsing tabular data which uses
various delimiters, quoting and escaping techniques.  In addition,
fixed-width data is pretty trivial to read anyway, and probably doesn't
deserve a module of its own.  There are no issues of quoting or delimiters.
You just need to read the file in chunks of the row size and split each row
along chunks of the element size.

Thanks for your comments,

Skip