[I18n-sig] Codec API questions

Andy Robinson andy@reportlab.com
Mon, 10 Apr 2000 20:49:27 +0100


I'm beginning to wonder about some issues with the unicode implementation.
Bear in mind we have seven weeks left - if anyone else has issues or
opinions, we should raise them now.

1. Set Default Encoding at site level
----------------------------------------------------
The default encoding is defined as UTF8, which will at least annoy all
nations equally :-).

It looks like you can hack this any way you want by creating your own
wrappers around stdin/stdout/stderr.  However, I wonder if Python should
make this customizable on a site basis - for example, site.py checks for
some option somewhere to say "I want to see Latin-1" or Shift-JIS or
whatever.  I often used to write scripts to parse files of names and
addresses, and use an interactive prompt to inspect the lists and tuples
directly; the convenience of typing 'print mydata' and see it properly is
nice.  What do people think?

(Or is this feature there already and I've missed it?)


2. lookup returns Codec object rather than tuple?
---------------------------------------------------------------------
I shuld have thought of this when we were in the draft stage months back,
but couldn't really get my mind around it until I had something concrete to
play with.

Right now, codecs.lookup() returns a tuple of
    (encode_func,
    decode_func,
    stream_encoder_factory,
    stream_decoder_factory)

But there is no easy way to lookup the codec object itself - indeed, no
requirement that there be one.  I'd like to see lookup always return a Codec
object
every time, which is guaranteed to have four methods as above, but might
have more.  (Note that a Codec object would have the ability to create
StreamEncoders and StreamDecoders, but would not be one by itself).

A fifth method which is potentially very useful is validate(); a sixth might
be repair().  And for each language, there could be specific ones such as
expanding half-width to full-width katakana.

Furthermore, if we can get hold of the Codec objects, we can start to reason
about codecs - for example, ask whether encodings are compatible with each
other.

3. direct conversion lookups and short-circuiting Unicode
----------------------------------------------------------------------------
This is an extension rather than a change.  I know what I want to do, but
have only the vaguest ideas how to implement it.

As noted here before, you can get from shift-JIS to EUC and vice versa
without going through Unicode.  Because these algorithmic conversions work
on the full 94x94 'kuten space' and not just the 6879 code points in the
standard, they tend to work for any vendor-specific extensions and for
user-defined characters.  Most other Asian native encodings have used a
similar scheme.

I'd like to see an 'extended API' to go from one native character set to
another.  As before, this comes in two flavours, string and stream:
    convert(string, from_enc, to_enc)   returns a string.
We also need ways to get hold of StreamReader and StreamWriter versions.
Now one can trivially build these using Unicode in the middle

codecs.lookup('from_enc', 'to_enc') would return a codec object able to
convert from one encoding to another.  By default, this would weld together
two Unicode codecs.  But if someone writes a codec to do the job directly,
there should be a way to register that.