[Python-checkins] python/nondist/peps pep-0305.txt,1.13,1.14

Sun, 09 Feb 2003 20:03:51 -0800

Update of /cvsroot/python/python/nondist/peps
In directory sc8-pr-cvs1:/tmp/cvs-serv15762

Modified Files:
	pep-0305.txt 
Log Message:
a number of changes to keep more-or-less up-to-date.

Index: pep-0305.txt
===================================================================
RCS file: /cvsroot/python/python/nondist/peps/pep-0305.txt,v
retrieving revision 1.13
retrieving revision 1.14
diff -C2 -d -r1.13 -r1.14
*** pep-0305.txt	3 Feb 2003 03:01:48 -0000	1.13
--- pep-0305.txt	10 Feb 2003 04:03:49 -0000	1.14
***************
*** 19,27 ****
  files are simple to parse, the format is not formally defined by a
  stable specification and is subtle enough that parsing lines of a CSV
! file with something like ``line.split(",")`` is bound to fail.  This
! PEP defines an API for reading and writing CSV files which should make
! it possible for programmers to select a CSV module which meets their
! requirements.  It is accompanied by a corresponding module which
! implements the API.

--- 19,25 ----
  files are simple to parse, the format is not formally defined by a
  stable specification and is subtle enough that parsing lines of a CSV
! file with something like ``line.split(",")`` is eventually bound to
! fail.  This PEP defines an API for reading and writing CSV files.  It
! is accompanied by a corresponding module which implements the API.

***************
*** 29,59 ****
  ==============================================

- - Need to better explain the advantages of a purpose-built csv module
-   over the simple ",".join() and [].split() approach.
- 
- - Need to complete initial list of formatting parameters and settle on
-   names.
- 
  - Better motivation for the choice of passing a file object to the
    constructors.  See http://manatee.mojam.com/pipermail/csv/2003-January/000179.html

- Existing Modules
- ================

! Three widely available modules enable programmers to read and write
! CSV files:

! - Object Craft's CSV module [2]_

! - Cliff Wells' Python-DSV module [3]_

! - Laurence Tratt's ASV module [4]_

! Each has a different API, making it somewhat difficult for programmers
! to switch between them.  More of a problem may be that they interpret
! some of the CSV corner cases differently, so even after surmounting
! the differences in the module APIs, the programmer has to also deal
! with semantic differences between the packages.

--- 27,55 ----
  ==============================================

  - Better motivation for the choice of passing a file object to the
    constructors.  See http://manatee.mojam.com/pipermail/csv/2003-January/000179.html

+ - Unicode.  ugh.

! Application Domain
! ==================

! This PEP is about doing one thing well: parsing tabular data which may
! use a variety of field separators, quoting characters, quote escape
! mechanisms and line endings.  The authors intend the proposed module
! to solve this one parsing problem efficiently.  The authors do not
! intend to address any of these related topics::

!     - data interpretation (is a field containing the string "10"
!       supposed to be a string, a float or an int? is it a number in
!       base 10, base 16 or base 2? is a number in quotes a number or a
!       string?)

!     - locale-specific data representation (should the number 1.23 be
!       written as "1.23" or "1,23" or "1 23"?) -- this may eventually
!       be addressed.

!     - fixed width tabular data - can already be parsed reliably.

***************
*** 61,70 ****
  =========

! By defining common APIs for reading and writing CSV files, we make it
! easier for programmers to choose an appropriate module to suit their
! needs, and make it easier to switch between modules if their needs
! change.  This PEP also forms a set of requirements for creation of a
! module which will hopefully be incorporated into the Python
! distribution.

  CSV formats are not well-defined and different implementations have a
--- 57,70 ----
  =========

! Often, CSV files are formatted simply enough that you can get by
! reading them line-by-line and splitting on the commas which delimit
! the fields.  This is especially true if all the data being read is
! numeric.  This approach may work for awhile, then come back to bite
! you in the butt when somebody puts something unexpected in the data
! like a comma.  As you dig into the problem you may eventually come to
! the conclusion that you can solve the problem using regular
! expressions.  This will work for awhile, then break mysteriously one
! day.  The problem grows, so you dig deeper and eventually realize that
! you need a purpose-built parser for the format.

  CSV formats are not well-defined and different implementations have a
***************
*** 72,85 ****
  the acronym stands for "Vague" instead of "Values".  Different
  delimiters and quoting characters are just the start.  Some programs
! generate whitespace after the delimiter.  Others quote embedded
! quoting characters by doubling them or prefixing them with an escape
! character.  The list of weird ways to do things seems nearly endless.

! Unfortunately, all this variability and subtlety means it is difficult
! for programmers to reliably parse CSV files from many sources or
! generate CSV files designed to be fed to specific external programs
! without deep knowledge of those sources and programs.  This PEP and
! the software which accompany it attempt to make the process less
! fragile.

--- 72,105 ----
  the acronym stands for "Vague" instead of "Values".  Different
  delimiters and quoting characters are just the start.  Some programs
! generate whitespace after each delimiter which is not part of the
! following field.  Others quote embedded quoting characters by doubling
! them, others by prefixing them with an escape character.  The list of
! weird ways to do things can seem endless.

! All this variability means it is difficult for programmers to reliably
! parse CSV files from many sources or generate CSV files designed to be
! fed to specific external programs without a thorough understanding of
! those sources and programs.  This PEP and the software which accompany
! it attempt to make the process less fragile.
! 
! 
! Existing Modules
! ================
! 
! This problem has been tackled before.  At least three modules
! currently available in the Python community enable programmers to read
! and write CSV files:
! 
! - Object Craft's CSV module [2]_
! 
! - Cliff Wells' Python-DSV module [3]_
! 
! - Laurence Tratt's ASV module [4]_
! 
! Each has a different API, making it somewhat difficult for programmers
! to switch between them.  More of a problem may be that they interpret
! some of the CSV corner cases differently, so even after surmounting
! the differences between the different module APIs, the programmer has
! to also deal with semantic differences between the packages.

***************
*** 87,102 ****
  ================

! The module supports two basic APIs, one for reading and one for
! writing.  The basic reading interface is::

      obj = reader(iterable [, dialect='excel']
                   [optional keyword args])

! A reader object is an iterable which takes an interable object which
! returns lines as the sole required parameter.  The optional dialect
! parameter is discussed below.  It also accepts several optional
! keyword arguments which define specific format settings for the parser
! (see the section "Formatting Parameters").  Readers are typically used
! as follows::

      csvreader = csv.reader(file("some.csv"))
--- 107,132 ----
  ================

! This PEP supports three basic APIs, one to read and parse CSV files,
! one to write them, and one to identify different CSV dialects to the
! readers and writers.
! 
! 
! Reading CSV Files
! -----------------
! 
! CSV readers are created with the reader factory function::

      obj = reader(iterable [, dialect='excel']
                   [optional keyword args])

! A reader object is an iterator which takes an iterable object
! returning lines as the sole required parameter.  If it supports a
! binary mode (file objects do), the iterable argument to the reader
! function must have been opened in binary mode.  This gives the reader
! object full control over the interpretation of the file's contents.
! The optional dialect parameter is discussed below.  The reader
! function also accepts several optional keyword arguments which define
! specific format settings for the parser (see the section "Formatting
! Parameters").  Readers are typically used as follows::

      csvreader = csv.reader(file("some.csv"))
***************
*** 104,118 ****
          process(row)

! The writing interface is similar::

!     obj = writer(fileobj [, dialect='excel'], [, fieldnames=seq]
                   [optional keyword args])

  A writer object is a wrapper around a file-like object opened for
! writing.  It accepts the same optional keyword parameters as the
! reader constructor.  In addition, it accepts an optional fieldnames
! argument.  This is a sequence that defines the order of fields in the
! output file.  It allows the write() method to accept mapping objects
! as well as sequence objects.

  Writers are typically used as follows::
--- 134,156 ----
          process(row)

! Each row returned by a reader object is a list of strings or Unicode
! objects.

! When both a dialect parameter and individual formatting parameters are
! passed to the constructor, first the dialect is queried for formatting
! parameters, then individual formatting parameters are examined.
! 
! 
! Writing CSV Files
! -----------------
! 
! Creating writers is similar::
! 
!     obj = writer(fileobj [, dialect='excel'],
                   [optional keyword args])

  A writer object is a wrapper around a file-like object opened for
! writing in binary mode (if such a distinction is made).  It accepts
! the same optional keyword parameters as the reader constructor.

  Writers are typically used as follows::
***************
*** 120,124 ****
      csvwriter = csv.writer(file("some.csv", "w"))
      for row in someiterable:
!         csvwriter.write(row)

  To generate a set of field names as the first row of the CSV file, the
--- 158,162 ----
      csvwriter = csv.writer(file("some.csv", "w"))
      for row in someiterable:
!         csvwriter.writerow(row)

  To generate a set of field names as the first row of the CSV file, the
***************
*** 133,160 ****

! Dialects
! --------

! Readers and writers support a dialect argument which is just a
! convenient handle on a group of lower level parameters.

! When dialect is a string it identifies one of the dialects which is
! known to the module, otherwise it is processed as a dialect class as
! described below.

  Dialects will generally be named after applications or organizations
! which define specific sets of format constraints.  The initial dialect
! is "excel", which describes the format constraints of Excel 97 and
! Excel 2000 regarding CSV input and output.  Another possible dialect
! (used here only as an example) might be "gnumeric".

  Dialects are implemented as attribute only classes to enable users to
! construct variant dialects by subclassing.  The "excel" dialect is
! implemented as follows::

!     class excel:
          delimiter = ','
          quotechar = '"'
-         escapechar = None
          doublequote = True
          skipinitialspace = False
--- 171,217 ----

! Managing Different Dialects
! ---------------------------

! Because CSV is a somewhat ill-defined format, there are plenty of ways
! one CSV file can differ from another, yet contain exactly the same
! data.  Many tools which can import or export tabular data allow the
! user to indicate the field delimiter, quote character, line
! terminator, and other characteristics of the file.  These can be
! fairly easily determined, but are still mildly annoying to figure out,
! and make for fairly long function calls when specified individually.

! To try and minimize the difficulty of figuring out and specifying a
! bunch of formatting parameters, reader and writer objects support a
! dialect argument which is just a convenient handle on a group of these
! lower level parameters.  When a dialect is given as a string it
! identifies one of the dialects known to the module via its
! registration functions, otherwise it must be an instance of the
! Dialect class as described below.

  Dialects will generally be named after applications or organizations
! which define specific sets of format constraints.  Two dialects are
! defined in the module as of this writing, "excel", which describes the
! default format constraints for CSV file export by Excel 97 and Excel
! 2000, and "excel-tab", which is the same as "excel" but specifies an
! ASCII TAB character as the field delimiter.

  Dialects are implemented as attribute only classes to enable users to
! construct variant dialects by subclassing.  The "excel" dialect is a
! subclass of Dialect and is defined as follows::

!     class Dialect:
!         # placeholders
!         delimiter = None
!         quotechar = None
!         escapechar = None
!         doublequote = None
!         skipinitialspace = None
!         lineterminator = None
!         quoting = None
! 
!     class excel(Dialect):
          delimiter = ','
          quotechar = '"'
          doublequote = True
          skipinitialspace = False
***************
*** 162,182 ****
          quoting = QUOTE_MINIMAL

! An excel tab separated dialect can then be defined in user code as
! follows::

!     class exceltsv(csv.excel):
          delimiter = '\t'

! Three functions are defined in the API to set, get and list dialects::

-     register_dialect(name, dialect)
      dialect = get_dialect(name)
!     known_dialects = list_dialects()

! The dialect parameter is a class or instance whose attributes are the
! formatting parameters defined in the next section.  The
! list_dialects() function returns all the registered dialect names as
! given in previous register_dialect() calls (both predefined and
! user-defined).

--- 219,243 ----
          quoting = QUOTE_MINIMAL

! The "excel-tab" dialect is defined as::

!     class exceltsv(excel):
          delimiter = '\t'

! (For a description of the individual formatting parameters see the
! section "Formatting Parameters".)
! 
! To enable string references to specific dialects, the module defines
! several functions::

      dialect = get_dialect(name)
!     names = list_dialects()
!     register_dialect(name, dialect)
!     unregister_dialect(name)

! ``get_dialect()`` returns the dialect instance associated with the
! given name.  ``list_dialects()`` returns a list of all registered
! dialect names.  ``register_dialects()`` associates a string name with
! a dialect class.  ``unregister_dialect()`` deletes a name/dialect
! association. 

***************
*** 184,241 ****
  ---------------------

! Both the reader and writer constructors take several specific
! formatting parameters, specified as keyword parameters.
! 
! - ``quotechar`` specifies a one-character string to use as the quoting
!   character.  It defaults to '"'.  Setting this to None has the same
!   effect as setting quoting to csv.QUOTE_NONE.

! - ``delimiter`` specifies a one-character string to use as the field
!   separator.  It defaults to ','.

! - ``escapechar`` specifies a one-character string used to escape the
!   delimiter when quotechar is set to None.

! - ``skipinitialspace`` specifies how to interpret whitespace which
!   immediately follows a delimiter.  It defaults to False, which means
!   that whitespace immediately following a delimiter is part of the
!   following field.

! - ``lineterminator`` specifies the character sequence which should
!   terminate rows.

! - ``quoting`` controls when quotes should be generated by the
!   writer.  It can take on any of the following module constants::

!     csv.QUOTE_MINIMAL means only when required, for example, when a
!     field contains either the quotechar or the delimiter

!     csv.QUOTE_ALL means that quotes are always placed around fields.

!     csv.QUOTE_NONNUMERIC means that quotes are always placed around
!     fields which contain characters other than [+-0-9.].

!     csv.QUOTE_NONE means that quotes are never placed around
!     fields.

! - ``doublequote`` controls the handling of quotes inside fields.  When
!   True two consecutive quotes are interpreted as one during read, and
!   when writing, each quote is written as two quotes.

! - are there more to come?

  When processing a dialect setting and one or more of the other
! optional parameters, the dialect parameter is processed first, then
! the others are processed.  This makes it easy to choose a dialect,
! then override one or more of the settings without defining a new
! dialect class.  For example, if a CSV file was generated by Excel 2000
! using single quotes as the quote character and TAB as the delimiter,
! you could create a reader like::

      csvreader = csv.reader(file("some.csv"), dialect="excel",
!                            quotechar="'", delimiter='\t')

  Other details of how Excel generates CSV files would be handled
! automatically.

--- 245,318 ----
  ---------------------

! In addition to the dialect argument, both the reader and writer
! constructors take several specific formatting parameters, specified as
! keyword parameters.  The formatting parameters understood are::

!     - ``quotechar`` specifies a one-character string to use as the
!       quoting character.  It defaults to '"'.  Setting this to None
!       has the same effect as setting quoting to csv.QUOTE_NONE.

!     - ``delimiter`` specifies a one-character string to use as the
!       field separator.  It defaults to ','.

!     - ``escapechar`` specifies a one-character string used to escape
!       the delimiter when quotechar is set to None.

!     - ``skipinitialspace`` specifies how to interpret whitespace which
!       immediately follows a delimiter.  It defaults to False, which
!       means that whitespace immediately following a delimiter is part
!       of the following field.

!     - ``lineterminator`` specifies the character sequence which should
!       terminate rows.

!     - ``quoting`` controls when quotes should be generated by the
!       writer.  It can take on any of the following module constants::

!         * csv.QUOTE_MINIMAL means only when required, for example,
!           when a field contains either the quotechar or the delimiter

!         * csv.QUOTE_ALL means that quotes are always placed around
!           fields.

!         * csv.QUOTE_NONNUMERIC means that quotes are always placed
!           around nonnumeric fields.

!         * csv.QUOTE_NONE means that quotes are never placed around
!           fields.

!     - ``doublequote`` controls the handling of quotes inside fields.
!       When True two consecutive quotes are interpreted as one during
!       read, and when writing, each quote is written as two quotes.

  When processing a dialect setting and one or more of the other
! optional parameters, the dialect parameter is processed before the
! individual formatting parameters.  This makes it easy to choose a
! dialect, then override one or more of the settings without defining a
! new dialect class.  For example, if a CSV file was generated by Excel
! 2000 using single quotes as the quote character and a colon as the
! delimiter, you could create a reader like::

      csvreader = csv.reader(file("some.csv"), dialect="excel",
!                            quotechar="'", delimiter=':')

  Other details of how Excel generates CSV files would be handled
! automatically because of the reference to the "excel" dialect.
! 
! 
! Reader Objects
! --------------
! 
! Reader objects are iterables whose next() method returns a sequence of
! strings, one string per field in the row.
! 
! 
! Writer Objects
! --------------
! 
! Writer objects have two methods, writerow() and writerows().  The
! former accepts an iterable (typically a list) of fields which are to
! be written to the output.  The latter accepts a list of iterables and
! calls writerow() for each.

***************
*** 258,318 ****
  ======

! - Should a parameter control how consecutive delimiters are
!   interpreted?  Our thought is "no".  Consecutive delimiters should
!   always denote an empty field.
! 
! - What about Unicode?  Is it sufficient to pass a file object gotten
!   from codecs.open()?  For example::
! 
!     csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))
! 
!     csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))

!   In the first example, text would be assumed to be encoded as cp1252.
!   Should the system be aggressive in converting to Unicode or should
!   Unicode strings only be returned if necessary?

!   In the second example, the file will take care of automatically
!   encoding Unicode strings as utf-8 before writing to disk.

! - What about alternate escape conventions?  When Excel exports a file,
!   it appears only the field delimiter needs to be escaped.  It
!   accomplishes this by quoting the entire field, then doubling any
!   quote characters which appear in the field.  It also quotes a field
!   if the first character is a quote character.  It would seem we need
!   to support two modes: escape-by-quoting and escape-by-prefix.  In
!   addition, for the second mode, we'd have to specify the escape
!   character (presumably defaulting to a backslash character).

! - Should there be a "fully quoted" mode for writing?  What about
!   "fully quoted except for numeric values"?

! - What about end-of-line?  If I generate a CSV file on a Unix system,
!   will Excel properly recognize the LF-only line terminators?

! - What about conversion to other file formats?  Is the list-of-lists
!   output from the csvreader sufficient to feed into other writers?

! - What about an option to generate list-of-dict output from the reader
!   and accept list-of-dicts by the writer?  This makes manipulating
!   individual rows easier since each one is independent, but you lose
!   field order when writing and have to tell the writer object the
!   order the fields should appear in the file.

! - Are quote character and delimiters limited to single characters?  I
!   had a client not that long ago who wrote their own flat file format
!   with a delimiter of ":::".

! - How should rows of different lengths be handled?  The options seem
!   to be:

!   * raise an exception when a row is encountered whose length differs
!     from the previous row

!   * silently return short rows

!   * allow the caller to specify the desired row length and what to do
!     when rows of a different length are encountered: ignore, truncate,
!     pad, raise exception, etc.

--- 335,386 ----
  ======

! 1. Should a parameter control how consecutive delimiters are
!    interpreted?  Our thought is "no".  Consecutive delimiters should
!    always denote an empty field.

! 2. What about Unicode?  Is it sufficient to pass a file object gotten
!    from codecs.open()?  For example::

!      csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))

!      csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))

!    In the first example, text would be assumed to be encoded as cp1252.
!    Should the system be aggressive in converting to Unicode or should
!    Unicode strings only be returned if necessary?

!    In the second example, the file will take care of automatically
!    encoding Unicode strings as utf-8 before writing to disk.

!    Note: As of this writing, the csv module doesn't handle Unicode
!    data.

! 3. What about alternate escape conventions?  If the dialect in use
!    includes an ``escapechar`` parameter which is not None and the
!    ``quoting`` parameter is set to QUOTE_NONE, delimiters appearing
!    within fields will be prefixed by the escape character when writing
!    and are expected to be prefixed by the escape character when
!    reading.

! 4. Should there be a "fully quoted" mode for writing?  What about
!    "fully quoted except for numeric values"?  Both are implemented
!    (QUOTE_ALL and QUOTE_NONNUMERIC, respectively).

! 5. What about end-of-line?  If I generate a CSV file on a Unix system,
!    will Excel properly recognize the LF-only line terminators?  Files
!    must be opened for reading or writing as appropriate using binary
!    mode.  Specify the ``lineterminator`` sequence as '\r\n'.  The
!    resulting file will be written correctly.

! 6. What about an option to generate dicts from the reader and accept
!    dicts by the writer?  See the DictReader and DictWriter classes in
!    csv.py.

! 8. Are quote character and delimiters limited to single characters?
!    For the time being, yes.

! 9. How should rows of different lengths be handled?  Interpretation of
!    the data is the application's job.  There is no such thing as a
!    "short row" or a "long row" at this level.