[Python-Dev] Some thoughts on the codecs...

Mon, 15 Nov 1999 16:01:59 -0000

Andy Robinson wrote:
1.	Stream interface
At the moment a codec has dump and load methods which read a (slice of a)
stream into a string in memory and vice versa.  As the proposal notes, this
could lead to errors if you take a slice out of a stream.   This is not just
due to character truncation; some Asian encodings are modal and have
shift-in and shift-out sequences as they move from Western single-byte
characters to double-byte ones.   It also seems a bit pointless to me as the
source (or target) is still a Unicode string in memory.
This is a real problem - a filter to convert big files between two encodings
should be possible without knowledge of the particular encoding, as should
one on the input/output of some server.  We can still give a default
implementation for single-byte encodings.
What's a good API for real stream conversion?   just
Codec.encodeStream(infile, outfile)  ?  or is it more useful to feed the
codec with data a chunk at a time?

A user defined chunking factor (suitably defaulted) would be useful for
processing large files.

2.	Data driven codecs
I really like codecs being objects, and believe we could build support for a
lot more encodings, a lot sooner than is otherwise possible, by making them
data driven rather making each one compiled C code with static mapping
tables.  What do people think about the approach below?
First of all, the ISO8859-1 series are straight mappings to Unicode code
points.  So one Python script could parse these files and build the mapping
table, and a very small data file could hold these encodings.  A compiled
helper function analogous to string.translate() could deal with most of
them.
Secondly, the double-byte ones involve a mixture of algorithms and data.
The worst cases I know are modal encodings which need a single-byte lookup
table, a double-byte lookup table, and have some very simple rules about
escape sequences in between them.  A simple state machine could still handle
these (and the single-byte mappings above become extra-simple special
cases); I could imagine feeding it a totally data-driven set of rules.  
Third, we can massively compress the mapping tables using a notation which
just lists contiguous ranges; and very often there are relationships between
encodings.  For example, "cpXYZ is just like cpXYY but with an extra
'smiley' at 0XFE32".  In these cases, a script can build a family of related
codecs in an auditable manner. 

The problem here is that we need to decide whether we are Unicode-centric,
or whether Unicode is just another supported encoding. If we are
Unicode-centric, then all code-page translations will require static mapping
tables between the appropriate Unicode character and the relevant code
points in the other encoding.  This would involve (worst case) 64k static
tables for each supported encoding.  Unfortunately this also precludes the
use of algorithmic conversions and or sparse conversion tables because most
of these transformations are relative to a source and target non-Unicode
encoding, eg JIS <---->EUCJIS.  If we are taking the IBM approach (see
CDRA), then we can mix and match approaches, and treat Unicode strings as
just Unicode, and normal strings as being any arbitrary MBCS encoding.

To guarantee the utmost interoperability and Unicode 3.0 (and beyond)
compliance, we should probably assume that all core encodings are relative
to Unicode as the pivot encoding.  This should hopefully avoid any gotcha's
with roundtrips between any two arbitrary native encodings.  The downside is
this will probably be slower than an optimised algorithmic transformation.

3.	What encodings to distribute?
The only clean answers to this are 'almost none', or 'everything that
Unicode 3.0 has a mapping for'.  The latter is going to add some weight to
the distribution.  What are people's feelings?  Do we ship any at all apart
from the Unicode ones?  Should new encodings be downloadable from
www.python.org <http://www.python.org> ?  Should there be an optional
package outside the main distribution?
Ship with Unicode encodings in the core, the rest should be an add on
package.

If we are truly Unicode-centric, this gives us the most value in terms of
accessing a Unicode character properties database, which will provide
language neutral case folding, Hankaku <----> Zenkaku folding (Japan
specific), and composition / normalisation between composed characters and
their component nonspacing characters.

Regards,
Mike da Silva