[Python-Dev] Some thoughts on the codecs...

M.-A. Lemburg mal@lemburg.com
Mon, 15 Nov 1999 20:26:16 +0100


"Da Silva, Mike" wrote:
> 
> Andy Robinson wrote:
> --
> 1.      Stream interface
> At the moment a codec has dump and load methods which read a (slice of a)
> stream into a string in memory and vice versa.  As the proposal notes, this
> could lead to errors if you take a slice out of a stream.   This is not just
> due to character truncation; some Asian encodings are modal and have
> shift-in and shift-out sequences as they move from Western single-byte
> characters to double-byte ones.   It also seems a bit pointless to me as the
> source (or target) is still a Unicode string in memory.
> This is a real problem - a filter to convert big files between two encodings
> should be possible without knowledge of the particular encoding, as should
> one on the input/output of some server.  We can still give a default
> implementation for single-byte encodings.
> What's a good API for real stream conversion?   just
> Codec.encodeStream(infile, outfile)  ?  or is it more useful to feed the
> codec with data a chunk at a time?
> --
> A user defined chunking factor (suitably defaulted) would be useful for
> processing large files.
> --
> 2.      Data driven codecs
> I really like codecs being objects, and believe we could build support for a
> lot more encodings, a lot sooner than is otherwise possible, by making them
> data driven rather making each one compiled C code with static mapping
> tables.  What do people think about the approach below?
> First of all, the ISO8859-1 series are straight mappings to Unicode code
> points.  So one Python script could parse these files and build the mapping
> table, and a very small data file could hold these encodings.  A compiled
> helper function analogous to string.translate() could deal with most of
> them.
> Secondly, the double-byte ones involve a mixture of algorithms and data.
> The worst cases I know are modal encodings which need a single-byte lookup
> table, a double-byte lookup table, and have some very simple rules about
> escape sequences in between them.  A simple state machine could still handle
> these (and the single-byte mappings above become extra-simple special
> cases); I could imagine feeding it a totally data-driven set of rules.
> Third, we can massively compress the mapping tables using a notation which
> just lists contiguous ranges; and very often there are relationships between
> encodings.  For example, "cpXYZ is just like cpXYY but with an extra
> 'smiley' at 0XFE32".  In these cases, a script can build a family of related
> codecs in an auditable manner.
> --
> The problem here is that we need to decide whether we are Unicode-centric,
> or whether Unicode is just another supported encoding. If we are
> Unicode-centric, then all code-page translations will require static mapping
> tables between the appropriate Unicode character and the relevant code
> points in the other encoding.  This would involve (worst case) 64k static
> tables for each supported encoding.  Unfortunately this also precludes the
> use of algorithmic conversions and or sparse conversion tables because most
> of these transformations are relative to a source and target non-Unicode
> encoding, eg JIS <---->EUCJIS.  If we are taking the IBM approach (see
> CDRA), then we can mix and match approaches, and treat Unicode strings as
> just Unicode, and normal strings as being any arbitrary MBCS encoding.
> 
> To guarantee the utmost interoperability and Unicode 3.0 (and beyond)
> compliance, we should probably assume that all core encodings are relative
> to Unicode as the pivot encoding.  This should hopefully avoid any gotcha's
> with roundtrips between any two arbitrary native encodings.  The downside is
> this will probably be slower than an optimised algorithmic transformation.

Optimizations should go into separate packages for direct EncodingA
-> EncodingB conversions. I don't think we need them in the core.

> --
> 3.      What encodings to distribute?
> The only clean answers to this are 'almost none', or 'everything that
> Unicode 3.0 has a mapping for'.  The latter is going to add some weight to
> the distribution.  What are people's feelings?  Do we ship any at all apart
> from the Unicode ones?  Should new encodings be downloadable from
> www.python.org <http://www.python.org> ?  Should there be an optional
> package outside the main distribution?
> --
> Ship with Unicode encodings in the core, the rest should be an add on
> package.
> 
> If we are truly Unicode-centric, this gives us the most value in terms of
> accessing a Unicode character properties database, which will provide
> language neutral case folding, Hankaku <----> Zenkaku folding (Japan
> specific), and composition / normalisation between composed characters and
> their component nonspacing characters.

>From the proposal:

"""
Unicode Character Properties:
-----------------------------

A separate module "unicodedata" should provide a compact interface to
all Unicode character properties defined in the standard's
UnicodeData.txt file.

Among other things, these properties provide ways to recognize
numbers, digits, spaces, whitespace, etc.

Since this module will have to provide access to all Unicode
characters, it will eventually have to contain the data from
UnicodeData.txt which takes up around 200kB. For this reason, the data
should be stored in static C data. This enables compilation as shared
module which the underlying OS can shared between processes (unlike
normal Python code modules).

XXX Define the interface...

"""

Special CJK packages can then access this data for the purposes
you mentioned above.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/