[Python-Dev] CSV, bytes and encodings

skip at pobox.com skip at pobox.com
Wed Apr 1 12:37:38 CEST 2009


    >> Having read through the ticket, it seems that a CSV file must be (and
    >> 2.6 was) treated as a binary file, and part of the CSV module's job
    >> is to convert that binary data to and from strings.

    Antoine> IMO this interpretation is flawed.  In 2.6 there is no tangible
    Antoine> difference between "binary" and "text" files, except for
    Antoine> newline handling. Also, as a matter of fact, if you want the
    Antoine> 2.x CSV module to read a file with Windows line endings, you
    Antoine> have to open the file in "rU" mode (that is, the closest we
    Antoine> have to a moral equivalent of the 3.x text files).

The problem is that fields in CSV files, at least those produced by Excel,
can contain embedded newlines.  You are welcome to decide that *all* CRLF
pairs should be translated to LF, but that is not the decision the original
authors (mostly Andrew MacNamara) made.  The contents of the fields was
deemed to be separate from the newline convention, so the csv module needed
to do its own newline processing, and thus required files to be opened in
binary mode.

This case arises rarely, but it does turn up every now and again.  If you
are comfortable with translating all CRLF pairs into LF, no matter if they
are true end-of-line markers or embedded content, that's fine.  (It
certainly simplifies the implementation.)  However, a) I would run it past
the folks on csv at python.org first, and b) put a big fat note in the module
docs about the transformation.

    Antoine> Therefore, I don't think 2.x is of any guidance to us for what
    Antoine> 3.x should do.

I suspect we will disagree on this.  I believe the behavior of the 2.x
version of the module is easily defensible and should be a useful guide to
how the 3.x version of the module behaves.

    >> The documentation says "If csvfile is a file object, it must be
    >> opened with the $,1rx(Bb$,1ry(B flag on platforms where that makes a difference."

    Antoine> The documentation is, IMO, wrong even in 2.x. Just yesterday I
    Antoine> had to open a CSV file in 'rU' mode because it had Windows line
    Antoine> endings and I'm under Linux....

See above.  You almost certainly didn't have fields containing CRLF pairs or
didn't care that while reading the file your data values were silently
altered.

Skip


More information about the Python-Dev mailing list