[I18n-sig] Codec API questions

Guido van Rossum guido@python.org
Mon, 10 Apr 2000 16:45:49 -0400


> 1. Set Default Encoding at site level
> ----------------------------------------------------
> The default encoding is defined as UTF8, which will at least annoy all
> nations equally :-).
> 
> It looks like you can hack this any way you want by creating your own
> wrappers around stdin/stdout/stderr.  However, I wonder if Python should
> make this customizable on a site basis - for example, site.py checks for
> some option somewhere to say "I want to see Latin-1" or Shift-JIS or
> whatever.  I often used to write scripts to parse files of names and
> addresses, and use an interactive prompt to inspect the lists and tuples
> directly; the convenience of typing 'print mydata' and see it properly is
> nice.  What do people think?
> 
> (Or is this feature there already and I've missed it?)

Rather than doing this per site I'd suggest doing this per user.

Surely each user (on a multi-user site) should be allowed to choose
their own apps and settings (cf. locale).

After trying to figure out how to do this, I am confused.  I can do
this:

from codecs import EncodedFile
f = EncodedFile(sys.stdout, "utf-8", "latin-1")

And then I can write Unicode strings to file f, and they are written
to sys.stdout as Latin-1.  I can also write 8-bit strings to file f,
and they are assumed to be UTF-8 and are converted properly to
Latin-1.

However, if I specify anythying except UTF-8 as the input encoding to
EncodedFile, I can't write Unicode objects to it and have something
useful happen!  It seems the Unicode is always converted to UTF-8
first, and then interpreted according to the input encode.

I think that a useful feature to have is a file-like object that
behaves as follows: if you write an 8-bit string to it, it applies a
given input encoding to turn it into Unicode; then it applies a given
output encoding to convert that to (usually multibyte) output
characters.  If you write a Unicode string to it, it skips the input
encoding (since it's already Unicode) and then applies the (same)
given output encoding.

Then I could write a program that mixes 8-bit strings and Unicode in
its output, which encodes all its 8-bit strings in (say) Latin-1.
This program must obviously be very careful when it mixes Unicode and
8-bit strings internally (always calling unicode(s, "latin-1")) to
avoid getting the default (UTF-8) encoding.  But I think this is
something you are asking for -- right?


> 2. lookup returns Codec object rather than tuple?
> ---------------------------------------------------------------------
> I shuld have thought of this when we were in the draft stage months back,
> but couldn't really get my mind around it until I had something concrete to
> play with.
> 
> Right now, codecs.lookup() returns a tuple of
>     (encode_func,
>     decode_func,
>     stream_encoder_factory,
>     stream_decoder_factory)
> 
> But there is no easy way to lookup the codec object itself - indeed, no
> requirement that there be one.  I'd like to see lookup always return a Codec
> object
> every time, which is guaranteed to have four methods as above, but might
> have more.  (Note that a Codec object would have the ability to create
> StreamEncoders and StreamDecoders, but would not be one by itself).
> 
> A fifth method which is potentially very useful is validate(); a sixth might
> be repair().  And for each language, there could be specific ones such as
> expanding half-width to full-width katakana.
> 
> Furthermore, if we can get hold of the Codec objects, we can start to reason
> about codecs - for example, ask whether encodings are compatible with each
> other.

I have no opinion on this; I've forgotten the issues.


> 3. direct conversion lookups and short-circuiting Unicode
> ----------------------------------------------------------------------------
> This is an extension rather than a change.  I know what I want to do, but
> have only the vaguest ideas how to implement it.
> 
> As noted here before, you can get from shift-JIS to EUC and vice versa
> without going through Unicode.  Because these algorithmic conversions work
> on the full 94x94 'kuten space' and not just the 6879 code points in the
> standard, they tend to work for any vendor-specific extensions and for
> user-defined characters.  Most other Asian native encodings have used a
> similar scheme.
> 
> I'd like to see an 'extended API' to go from one native character set to
> another.  As before, this comes in two flavours, string and stream:
>     convert(string, from_enc, to_enc)   returns a string.
> We also need ways to get hold of StreamReader and StreamWriter versions.
> Now one can trivially build these using Unicode in the middle
> 
> codecs.lookup('from_enc', 'to_enc') would return a codec object able to
> convert from one encoding to another.  By default, this would weld together
> two Unicode codecs.  But if someone writes a codec to do the job directly,
> there should be a way to register that.

This could be a separate module, right?  I propose that you write a
separate module (extended_codecs?) that supports such an extended
lookup function.  What functionality would you need from the core?

--Guido van Rossum (home page: http://www.python.org/~guido/)