[Python-Dev] Some thoughts on the codecs...

M.-A. Lemburg mal@lemburg.com
Mon, 15 Nov 1999 23:54:38 +0100


[I'll get back on this tomorrow, just some quick notes here...]

Guido van Rossum wrote:
> 
> > Andy Robinson wrote:
> > >
> > > Some thoughts on the codecs...
> > >
> > > 1. Stream interface
> > > At the moment a codec has dump and load methods which
> > > read a (slice of a) stream into a string in memory and
> > > vice versa.  As the proposal notes, this could lead to
> > > errors if you take a slice out of a stream.   This is
> > > not just due to character truncation; some Asian
> > > encodings are modal and have shift-in and shift-out
> > > sequences as they move from Western single-byte
> > > characters to double-byte ones.   It also seems a bit
> > > pointless to me as the source (or target) is still a
> > > Unicode string in memory.
> > >
> > > This is a real problem - a filter to convert big files
> > > between two encodings should be possible without
> > > knowledge of the particular encoding, as should one on
> > > the input/output of some server.  We can still give a
> > > default implementation for single-byte encodings.
> > >
> > > What's a good API for real stream conversion?   just
> > > Codec.encodeStream(infile, outfile)  ?  or is it more
> > > useful to feed the codec with data a chunk at a time?
> 
> M.-A. Lemburg responds:
> 
> > The idea was to use Unicode as intermediate for all
> > encoding conversions.
> >
> > What you invision here are stream recoders. The can
> > easily be implemented as an useful addition to the Codec
> > subclasses, but I don't think that these have to go
> > into the core.
> 
> What I wanted was a codec API that acts somewhat like a buffered file;
> the buffer makes it possible to efficient handle shift states.  This
> is not exactly what Andy shows, but it's not what Marc's current spec
> has either.
> 
> I had thought something more like what Java does: an output stream
> codec's constructor takes a writable file object and the object
> returned by the constructor has a write() method, a flush() method and
> a close() method.  It acts like a buffering interface to the
> underlying file; this allows it to generate the minimal number of
> shift sequeuces.  Similar for input stream codecs.

The Codecs provide implementations for encoding and decoding,
they are not intended as complete wrappers for e.g. files or
sockets.

The unicodec module will define a generic stream wrapper
(which is yet to be defined) for dealing with files, sockets,
etc. It will use the codec registry to do the actual codec
work.
 
>From the proposal:
"""
For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(<filename>,<mode>,<encname>) could be provided as
    short-hand for unicodec.file(open(<filename>,<mode>),<encname>) which
    also assures that <mode> contains the 'b' character when needed.

XXX Specify the wrapper(s)...

    Open issues: what to do with Python strings
    fed to the .write() method (may need to know the encoding of the
    strings) and when/if to return Python strings through the .read()
    method.

    Perhaps we need more than one type of wrapper here.
"""

> Andy's file translation example could then be written as follows:
> 
> # assuming variables input_file, input_encoding, output_file,
> # output_encoding, and constant BUFFER_SIZE
> 
> f = open(input_file, "rb")
> f1 = unicodec.codecs[input_encoding].stream_reader(f)
> g = open(output_file, "wb")
> g1 = unicodec.codecs[output_encoding].stream_writer(f)
> 
> while 1:
>       buffer = f1.read(BUFFER_SIZE)
>       if not buffer:
>          break
>       f2.write(buffer)
> 
> f2.close()
> f1.close()

 
> Note that we could possibly make these the only API that a codec needs
> to provide; the string object <--> unicode object conversions can be
> done using this and the cStringIO module.  (On the other hand it seems
> a common case that would be quite useful.)

You wouldn't want to go via cStringIO for *every* encoding
translation.

The Codec interface defines two pairs of methods
on purpose: one which works internally (ie. directly between
strings and Unicode objects), and one which works externally
(directly between a stream and Unicode objects).

> > > 2. Data driven codecs
> > > I really like codecs being objects, and believe we
> > > could build support for a lot more encodings, a lot
> > > sooner than is otherwise possible, by making them data
> > > driven rather making each one compiled C code with
> > > static mapping tables.  What do people think about the
> > > approach below?
> > >
> > > First of all, the ISO8859-1 series are straight
> > > mappings to Unicode code points.  So one Python script
> > > could parse these files and build the mapping table,
> > > and a very small data file could hold these encodings.
> > >   A compiled helper function analogous to
> > > string.translate() could deal with most of them.
> >
> > The problem with these large tables is that currently
> > Python modules are not shared among processes since
> > every process builds its own table.
> >
> > Static C data has the advantage of being shareable at
> > the OS level.
> 
> Don't worry about it.  128K is too small to care, I think...

Huh ? 128K for every process using Python ? That quickly
sums up to lots of megabytes lying around pretty much unused.

> > You can of course implement Python based lookup tables,
> > but these should be too large...
> >
> > > Secondly, the double-byte ones involve a mixture of
> > > algorithms and data.  The worst cases I know are modal
> > > encodings which need a single-byte lookup table, a
> > > double-byte lookup table, and have some very simple
> > > rules about escape sequences in between them.  A
> > > simple state machine could still handle these (and the
> > > single-byte mappings above become extra-simple special
> > > cases); I could imagine feeding it a totally
> > > data-driven set of rules.
> > >
> > > Third, we can massively compress the mapping tables
> > > using a notation which just lists contiguous ranges;
> > > and very often there are relationships between
> > > encodings.  For example, "cpXYZ is just like cpXYY but
> > > with an extra 'smiley' at 0XFE32".  In these cases, a
> > > script can build a family of related codecs in an
> > > auditable manner.
> >
> > These are all great ideas, but I think they unnecessarily
> > complicate the proposal.
> 
> Agreed, let's leave the *implementation* of codecs out of the current
> efforts.
> 
> However I want to make sure that the *interface* to codecs is defined
> right, because changing it will be expensive.  (This is Linus
> Torvald's philosophy on drivers -- he doesn't care about bugs in
> drivers, as they will get fixed; however he greatly cares about
> defining the driver APIs correctly.)
> 
> > > 3. What encodings to distribute?
> > > The only clean answers to this are 'almost none', or
> > > 'everything that Unicode 3.0 has a mapping for'.  The
> > > latter is going to add some weight to the
> > > distribution.  What are people's feelings?  Do we ship
> > > any at all apart from the Unicode ones?  Should new
> > > encodings be downloadable from www.python.org?  Should
> > > there be an optional package outside the main
> > > distribution?
> >
> > Since Codecs can be registered at runtime, there is quite
> > some potential there for extension writers coding their
> > own fast codecs. E.g. one could use mxTextTools as codec
> > engine working at C speeds.
> 
> (Do you think you'll be able to extort some money from HP for these? :-)

Don't know, it depends on what their specs look like. I use
mxTextTools for fast HTML file processing. It uses a small
Turing machine with some extra magic and is progammable via
Python tuples.
 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8':            8-bit variable length encoding
> >   'utf-16':           16-bit variable length encoding (litte/big endian)
> >   'utf-16-le':                utf-16 but explicitly little endian
> >   'utf-16-be':                utf-16 but explicitly big endian
> >   'ascii':            7-bit ASCII codepage
> >   'latin-1':          Latin-1 codepage
> >   'html-entities':    Latin-1 + HTML entities;
> >                       see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> >                       Japanese character encoding
> >   'unicode-escape':   See Unicode Constructors for a definition
> >   'native':           Dump of the Internal Format used by Python
> >
> > Perhaps not even 'html-entities' (even though it would make
> > a cool replacement for cgi.escape()) and maybe we should
> > also place the JIS encoding into a separate Unicode package.
> 
> I'd drop html-entities, it seems too cutesie.  (And who uses these
> anyway, outside browsers?)

Ok.
 
> For JIS (shift-JIS?) I hope that Andy can help us with some pointers
> and validation.
> 
> And unicode-escape: now that you mention it, this is a section of
> the proposal that I don't understand.  I quote it here:
> 
> | Python should provide a built-in constructor for Unicode strings which
> | is available through __builtins__:
> |
> |   u = unicode(<encoded Python string>[,<encoding name>=<default encoding>])
>                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I meant this as optional second argument defaulting to
whatever we define <default encoding> to mean, e.g. 'utf-8'.

u = unicode("string","utf-8") == unicode("string")

The <encoding name> argument must be a string identifying one
of the registered codecs.
 
> | With the 'unicode-escape' encoding being defined as:
> |
> |   u = u'<unicode-escape encoded Python string>'
> |
> | · for single characters (and this includes all \XXX sequences except \uXXXX),
> |   take the ordinal and interpret it as Unicode ordinal;
> |
> | · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX
> |   instead, e.g. \u03C0 to represent the character Pi.
> 
> I've looked at this several times and I don't see the difference
> between the two bullets.  (Ironically, you are using a non-ASCII
> character here that doesn't always display, depending on where I look
> at your mail :-).

The first bullet covers the normal Python string characters
and escapes, e.g. \n and \267 (the center dot ;-), while the
second explains how \uXXXX is interpreted.
 
> Can you give some examples?
> 
> Is u'\u0020' different from u'\x20' (a space)?

No, they both map to the same Unicode ordinal.

> Does '\u0020' (no u prefix) have a meaning?

No, \uXXXX is only defined for u"" strings or strings that are
used to build Unicode objects with this encoding:

u = u'\u0020' == unicode(r'\u0020','unicode-escape')

Note that writing \uXX is an error, e.g. u"\u12 " will cause
cause a syntax error.
 
Aside: I just noticed that '\x2010' doesn't give '\x20' + '10'
but instead '\x10' -- is this intended ?

> Also, I remember reading Tim Peters who suggested that a "raw unicode"
> notation (ur"...") might be necessary, to encode regular expressions.
> I tend to agree.

This can be had via unicode():

u = unicode(r'\a\b\c\u0020','unicode-escaped')

If that's too long, define a ur() function which wraps up the
above line in a function.

> While I'm on the topic, I don't see in your proposal a description of
> the source file character encoding.  Currently, this is undefined, and
> in fact can be (ab)used to enter non-ASCII in string literals.  For
> example, a programmer named François might write a file containing
> this statement:
> 
>   print "Written by François." # (There's a cedilla in there!)
> 
> (He assumes his source character encoding is Latin-1, and he doesn't
> want to have to type \347 when he can type a cedilla on his keyboard.)
> 
> If his source file (or .pyc file!)  is executed by a Japanese user,
> this will probably print some garbage.
> 
> Using the new Unicode strings, François could change his program as
> follows:
> 
>   print unicode("Written by François.", "latin-1")
> 
> Assuming that François sets his sys.stdout to use Latin-1, while the
> Japanese user sets his to shift-JIS (or whatever his kanjiterm uses).
> 
> But when the Japanese user views François' source file, he will again
> see garbage.  If he uses a generic tool to translate latin-1 files to
> shift-JIS (assuming shift-JIS has a cedilla character) the program
> will no longer work correctly -- the string "latin-1" has to be
> changed to "shift-jis".
> 
> What should we do about this?  The safest and most radical solution is
> to disallow non-ASCII source characters; François will then have to
> type
> 
>   print u"Written by Fran\u00E7ois."
> 
> but, knowing François, he probably won't like this solution very much
> (since he didn't like the \347 version either).

I think best is to leave it undefined... as with all files,
only the programmer knows what format and encoding it contains,
e.g. a Japanese programmer might want to use a shift-JIS editor
to enter strings directly in shift-JIS via

u = unicode("...shift-JIS encoded text...","shift-jis")

Of course, this is not readable using an ASCII editor, but
Python will continue to produce the intended string.
NLS strings don't belong into program text anyway: i10n usually
takes the gettext() approach to handle these issues.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/