[I18n-sig] XML and codecs

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 5 Jun 2001 20:58:51 +0200


> Should be no problem since the exception will sort of freeze
> the current state of the codec (provided it's a StreamWriter/Reader)
> and let you use this state to take appropriate actions.

What do you mean: "provided it's a StreamReader/Writer". What if I
invoke the encode method found in codec lookup, and get an exception?

The exception does not carry the state. Suppose you encode into JIS X
0201.  That has four shift states:

CHARSETS = {
    "\033(B": US_ASCII,
    "\033(J": JISX0201_1976,
    "\033$@": JISX0208_1978,
    "\033$B": JISX0208_1983,
}

Depending on which of the escape codes you've emitted last, the
following bytes will have different meanings.

Now, suppose we encode a string that cannot be translated to JIS
X0201.  The codec will raise an exception, telling us how much bytes
it has encoded. Now, suppose we want to replace this character with
the string "&9898;". If we are in the US_ASCII shift state, we can
immediately encode it. If we are in a different shift state, we must
issue the control sequence first.

When the codec does not preserve state, it cannot correctly encode the
entire string, since concatenating the results of encode() invocations
might be incorrect.

If you don't believe me, tell me how I can use your proposed interface
to encode a Unicode into JIS X 0201 + XML escapes, with using the
encode/decode functions only.

> Not sure what you mean here, but the encoder and decoder
> returned by codecs.lookup() must not maintain state. This
> property is reserved for StreamWriters and Readers (see the
> Unicode docs).

You mean the sentence that says

# The functions/methods are expected to work in a stateless mode.

What is "expected to work"? Who expects they work in stateless mode,
and why? And what happens if they don't?

It also says

# These must be functions or methods which have the same interface as
# the encode()/decode() methods of Codec instances (see Codec
# Interface).

So surely, the result of codecs.lookup may be a method. If it is a
method, it surely must be a bound method (or else, where does the self
argument come from?) Since bound methods are allows, the encode/decode
functions *may* preserve state: A bound method always references state
in form of the object it is bound to.

So I think the sentence in the documentation saying "expected to work"
is an error.

Regards,
Martin