[Python-3000] Pre-PEP: Easy Text File Decoding

Sun Sep 10 05:29:05 CEST 2006

PEP: XXX
Title: Easy Text File Decoding
Version: $Revision$
Last-Modified: $Date$
Author: Paul Prescod <paul at prescod.net>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 09-Sep-2006
Post-History: 09-Sep-2006
Python-Version: 3.0

Abstract
========

Python 3000 will use Unicode as the standard string type. This means that
text files read from disk will be "decoded" into Unicode code points just as
binary files might be decoded into integers and structures. This change
brings a few issues to the fore that were
previously ignorable.

For example, in Python 2.x, it was possible to open a text file, read the
data into a Python string, filter some lines and print the remaining lines
to the console without ever considering what "encoding" the text was in. In
Python 3000, the programmer will only get access to
Python's powerful string manipulation functions after decoding the data to
Unicode code points. This means that either the programmer or the Python
runtime must select an decoding algorithm (by naming the encoding algorithm
that was used to encode the data in the first place).

Often the programmer can do so based upon out-of-band knowledge ("this file
format is always UCS-2" or "the protocol header says that this data is
latin-1"). In other cases, the programmer may be more naive or simply wish
to avoid thinking about it and would rather defer the issue to Python.

This document presents a proposal for algorithms and APIs that Python can
use to simplify the programmer's life.

Issues outside the scope of this PEP
=====================================

Any programmer who wishes to take direct control of the encoding selection
may of course ignore the features described in this PEP and choose a
decoding explicitly. The PEP is not intended to constrain them in any way.

Bytes received through means other than the file system are not addressed by
this PEP. For example, the PEP does not address data directly read from a
socket or returned from marshal functions.

Rationale
==========

The simplest possible use case for Python text processing involves a user
maintaining some form of simple database (e.g. an address book) as a text
file and processing it with Python. Unfortunately, this use case is not as
simple as it should be because of the variety of encodings in the universe.
For example, the file might be UTF-8, ISO-8859-1 or ISO-8859-2.

Professional programmers making widely distributed programs probably have no
alternative but to deal with this variability head-on. But programmers
working with data that originates and resides primarily on their own
computer might wish to avoid dealing with it. They would like Python to just
"try to do the right" thing with respect to the file. They would like to
think about encodings if and only if Python failed to guess appropriately.

Proposal
========

The function to open a text file will tenatively be called textfile(),
though the function name is not an integral part of this PEP. The function
takes three arguments, the filename, the mode ("r", "w", "r+", etc.) and the
type.

The type could be a true encoding or one of a small set of additional
symbolic values. The two main symbolic values are:

* "site" -- the default value, which invokes a site-specific alogrithm. For
example, a Japanese school teacher using Windows might default "site" to
Shift-JIS. An organization dealing with a small number of encodings might
default "site" to be equivalent to "guess". An organization with a strict
internationalization policy might default "site" to "UTF-8". An important
open issue is what Python's out-of-box interpretation of "site" should be.
This is key because "site" is the default value so Python's out-of-box
behaviour is the "default default".

* "guess" -- the value to be used by encoding-inexpert programmers and
experts who feel confident that Python's guessing algorithm will produce
sufficient results for their purposes. The guessing algorithm will
necessarily be complicated and may change over time. It will take into
account the following factors:

   - the conventions dominant on the operating system of choice

   - any localization-relevant settings available

   - a certain number of bytes at the start of the file (perhaps start and
end?). This sample will likely be on the order of thousands of bytes.

   - filesystem metadata attached to the file (in strong preference to the
above).

* "locale" -- the encoding suggested by the operating system's locale
concept

Other symbolic values might allow the programmer to suggest specific
encoding detection algorithms like XML [#XML-encoding-detection]_, HTML
[#HTML-encoding-detection]_ and the "coding:" comment convention. These
would be specified in separate PEPs.

The Site Decoding Hook
========================

The "sys" module could have a function called "setdefaultfileencoding". The
encoding specified could be a true encoding name or one of the encoding
detection scheme names (e.g. "guess" or "XML").

In addition, it should be possible to register new encoding detection
schemes using a method like "sys.registerencodingdetector". This function
would take two arguments, a string and a callable. The callable would accept
a byte stream argument and return a text stream. The contract for these
detection scheme implementations must allow them to peek ahead some bytes to
use the content as a hint to the encoding.

Alternatives and Open Issues
==============================

1. Guido proposes that the function be called merely "open". His proposal is
that the binary open should be the alternative and should be invoked
explicitly with a "b" mode switch. The PEP author feels first, that changing
the behaviour of an existing function is more confusing and disruptive than
creating another. Backporting a change to the "open" function would be
difficult and therefore it would be unnecessarily difficult to create
file-manipulating libraries that work both on Python 2.x and 3.x.

Second, the author feels that the "open" is an unnecessarily cryptic name
based only in Unix/C history. For a programmer coming from (for example)
Javascript, open() would tend to imply "open window". The PEP author
believes that factory functions should say what they are creating.

2. There is substantial disagreement on the behaviour of the function when
there is no encoding argument passed and no site override (i.e the
out-of-box default). Current proposals include ASCII (on the basis that it
is a nearly universal subset of popular encodings), UTF-8 (on the basis that
it is the dominant global standard encompassing all of Unicode), a
locale-derived encoding (on the basis that this is what a naive user will
generate in a text editor) or the guessing algorithm (on the basis that it
is by definition designed to guess right more often than any more specific
encoding name).

The PEP author strongly advocates a strict encoding like ASCII, UTF-8 or no
default at all (in which case the lack of an encoding would raise an
exception). A default like iso-8859-1 (even inferred from the environment)
will result in encodings like UTF-8, UCS-2 and even binary files being
"interpreted" as gibberish strings. This could result in document or
database corruption. An encoding with a "guess" default will encourage the
widespread creation of very unreliable code.

The current proposal is to have no out-of-box default until some point in
the future when a small set of auto-detectable encodings are globally
dominant. UTF-8 has gradually been gaining popularity through W3C and other
standards so it is possible that five years from now it will be the
"no-brainer" default. Until we can guess with substantial confidence,
absence of both an encoding declaration and a site override should result in
a thrown exception.

References
==========

.. [#XML-encoding-detection] XML Encoding Detection algorithm:
http://www.w3.org/TR/REC-xml/#sec-guessing
.. [#HTML-encoding-detection] HTML Encoding Detection algorithm:
http://www.w3.org/TR/REC-xml/#sec-guessing

Copyright
=========

This document has been placed in the public domain.

..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060909/38766c07/attachment-0001.htm