From martin@loewis.home.cs.tu-berlin.de  Tue Jan  2 08:35:27 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 2 Jan 2001 09:35:27 +0100
Subject: [I18n-sig] naming codecs
In-Reply-To: <200012070617.PAA22443@dhcp198.grad.sccs.chukyo-u.ac.jp> (message
 from Tamito KAJIYAMA on Thu, 7 Dec 2000 15:17:06 +0900)
References: <3A2F1B701E3.FEEANODA@172.16.112.1> <200012070617.PAA22443@dhcp198.grad.sccs.chukyo-u.ac.jp>
Message-ID: <200101020835.JAA01068@loewis.home.cs.tu-berlin.de>

> | > I consider releasing a version of the JapaneseCodecs package
> | > that will include a new codec for a variant of ISO-2022-JP.  The
> | > codec is almost the same as the ISO-2022-JP codec, but it can
> | > encode and decode Halfwidth Katakana (U+FF61 to U+FF9F) which
> | > can not be encoded with ISO-2022-JP as defined in RFC1468.
> | 
> | So how exactly does it encode them? 
> | 
> | Is that your own invention, or is there some precedent for that
> | encoding (e.g. in an operating system, or text processing system)?
> 
> Halfwidth Katakana in Unicode corresponds to the character set
> JIS X 0201 Katakana, and this character set can be designated by
> the escape sequence "\033(I" in the framework of ISO 2022.

I found some time to look into this, and it appears that your encoding
deals with "JIS X 0201 Katakana", which I also found with the name
"JIS X 0201 (GR)".

I know you already found a name, but ... if your codec is indeed
*only* JISX 0201 Katakana, then why not name it that way
(e.g. "jisx-0201-katakana").

Regards,
Martin


From andy@reportlab.com  Tue Jan  2 10:41:50 2001
From: andy@reportlab.com (Andy Robinson)
Date: Tue, 2 Jan 2001 10:41:50 -0000
Subject: [I18n-sig] naming codecs
In-Reply-To: <200101020835.JAA01068@loewis.home.cs.tu-berlin.de>
Message-ID: <PGECLPOBGNBNKHNAGIJHAEFBCHAA.andy@reportlab.com>

> I found some time to look into this, and it appears that
> your encoding
> deals with "JIS X 0201 Katakana", which I also found with the name
> "JIS X 0201 (GR)".
>
> I know you already found a name, but ... if your codec is indeed
> *only* JISX 0201 Katakana, then why not name it that way
> (e.g. "jisx-0201-katakana").
>
JIS X 0201 Katakana is a character set, not an encoding.  It defines
the half-width katakana characters (about 60 of them).  Japanese
encodings contain multiple character sets.  IS0-2022-JP is a 'way of
making encodings' and within this there can be many variants; he is
talking about a specific encoding which combines two character sets...

(1) The JIS 0208 character set, 1st and 2nd levels (about 7000
characters including symbols, numeric characters, Latin, Cyrillic and
Greek alphabets, Japanese HIRAGANA, KATAKANA, and KANJI),
and
(2) The JIS 0201 Katakana characters (which are about 60 half-width
variants different from the Katakana listed in JIS0208)

...all encoded according to ISO-2022-JP

The half width katakana are basically 'deprecated' - they predate the
ability to use Kanji in computers - but won't go away in practice, so
people in Japanese IT frequently need to extend codecs to deal with
them.

I hope this explains a little further.  It is hard to understand this
without knowing a little about Japanese writing systems; Ken Lunde's
"CJKV" book does quite a good job of explaining it.

Regards,

Andy Robinson


From walter@livinglogic.de  Wed Jan  3 19:18:58 2001
From: walter@livinglogic.de (=?ISO-8859-1?Q?=22Walter_D=F6rwald=22?=)
Date: Wed, 03 Jan 2001 20:18:58 +0100
Subject: [I18n-sig] Proposal: Extended error handling
 forunicode.encode
In-Reply-To: <3A439A4A.B71F35DA@lemburg.com>
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com>
Message-ID: <200101032018580500.01F457F3@mail.livinglogic.de>

On 22.12.00 at 19:15 M.-A. Lemburg wrote:

> "Walter D=F6rwald" wrote:
> > 
> > On 21.12.00 at 18:30 M.-A. Lemburg wrote:
> > > [about state in encoders and error handlers]
> > But I don't see how this internal encoder state should influence
> > what the error handler does. There are two layers involved: The
> > character encoding layer and the "unencodable character escape
> > mechanism". Both layers are completely independent, even in your
> > "Unicode compression" example, where the "unencodable character
> > escape mechanism" is XML character entities.
> 
> This is true for your XML entity escape example, but error
> resolving in general will likely need to know about the
> current state of the encoder, e.g. to be able to write data
> corresponding page in the Unicode compression example or to
> force a switch of the current page to a different one.

How does this "Unicode compression example" look like?

> I know that error handling could be more generic, but passing
> a callable object instead of the error parameter is not an
> option since the internal APIs all use a const char parameter
> for error.

Changing this should can be done in one or two hours for someone 
who knows the Python internals. (Unfortunately I don't, I first
looked at unicodeobject.[hc] several days ago!)

> Besides, I consider such an approach a hack and not
> a solution.
> 
> Instead of trying to tweak the implementation into providing
> some kind of new error scheme, let's focus on finding a generic
> framework which could provide a solution for the general case
> while not breaking the existing applications.

Are the existing codecs (JapaneseCodecs etc.) to be considered part
of the existing applications?

The problem might be how to handle callbacks to C functions and
callback to Python functions in a consistent way. I.e. is it
extern DL_IMPORT(PyObject*) PyUnicode_Encode(
     const Py_UNICODE *s,        /* Unicode char buffer */
     int size,                   /* number of Py_UNICODE chars to encode */
     const char *encoding,       /* encoding */
     PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position)=
 /* error handling via C function */
     );
or
extern DL_IMPORT(PyObject*) PyUnicode_Encode(
     const Py_UNICODE *s,        /* Unicode char buffer */
     int size,                   /* number of Py_UNICODE chars to encode */
     const char *encoding,       /* encoding */
     PyObject *errorHandler /* error handling via Python function */
     );

> > > Writing your own function helpers which then apply all the necessary
> > > magic is simple and doesn't warrant changing APIs in the core.
> > 
> > It is not as simple as the error handler, but I could live with that.
> > 
> > The big problem is that it effectively kill the speed of your
> > application. Every XML application written in Python, no matter
> > what is does internally, will in the end have to produce an output
> > bytestring. Normally the output encoding should be one that produces
> > no unencodable characters, but you have to be prepared to handle
> > them. With the error handler the complete encoding will be done
> > in C code (with very infrequent calls to the error handler), so
> > this scheme gives the best speed possible.
> 
> It would give even better performance if the codec would provide
> this hook in some way at C level.

extern DL_IMPORT(PyObject*) PyUnicode_Encode(
     const Py_UNICODE *s,        /* Unicode char buffer */
     int size,                   /* number of Py_UNICODE chars to encode */
     const char *encoding,       /* encoding */
     PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position)=
 /* error handling via C function */
     );
would, but thats not the point. When you use an encoding, where more
than 20% of the characters have to be escaped (as XML entities or whatever)
you're using the wrong encoding.

> Note that almost all codecs
> have their own error handlers written in C already.
>
> > > Since the error handling is extensible by adding new options
> > > such as 'callback',
> > 
> > I would prefer a more object oriented way of extending the error
> > handling.
> 
> Sure, but we have to assure backward compatibility as well.
>  
> > > the existing codecs could be extended to
> > > provide this functionality as well. We'd only need a way to
> > > pass the callback to the codecs in some way, e.g. by using
> > > a keyword argument on the constructor or by subclassing it
> > > and providing a new method for the error handling in question.
> > 
> > There is no need for a string argument 'callback' and
> > an additional callback function/method that is passed to the
> > encoder. When the error argument is a string, the old mechanism
> > can be used, when it is a callable object the new will be used.
> 
> This is bad style and also gives problems in the core 
> implementation (have a look at unicodeobject.c).

I did, what is the problem with changing "const char *error" to
"PyObject *error"?


Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7=
 www.livinglogic.de


From mal@lemburg.com  Wed Jan  3 20:17:59 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 Jan 2001 21:17:59 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de>
Message-ID: <3A5388F7.FA6D49DA@lemburg.com>

"Walter Dörwald" wrote:
> 
> On 22.12.00 at 19:15 M.-A. Lemburg wrote:
> 
> > "Walter Dörwald" wrote:
> > >
> > > On 21.12.00 at 18:30 M.-A. Lemburg wrote:
> > > > [about state in encoders and error handlers]
> > > But I don't see how this internal encoder state should influence
> > > what the error handler does. There are two layers involved: The
> > > character encoding layer and the "unencodable character escape
> > > mechanism". Both layers are completely independent, even in your
> > > "Unicode compression" example, where the "unencodable character
> > > escape mechanism" is XML character entities.
> >
> > This is true for your XML entity escape example, but error
> > resolving in general will likely need to know about the
> > current state of the encoder, e.g. to be able to write data
> > corresponding page in the Unicode compression example or to
> > force a switch of the current page to a different one.
> 
> How does this "Unicode compression example" look like?

Please see the Unicode.org site for a description of the
Unicode compression algorithm. Other encoders will likely
have similar problems, e.g. ones which compress data based
on locality assumptions.

> > I know that error handling could be more generic, but passing
> > a callable object instead of the error parameter is not an
> > option since the internal APIs all use a const char parameter
> > for error.
> 
> Changing this should can be done in one or two hours for someone
> who knows the Python internals. (Unfortunately I don't, I first
> looked at unicodeobject.[hc] several days ago!)

Sure, but it would break code and alter the Python C API
in unacceptable ways. Note that all builtin C codecs would
also have to be changed.

If we are going to extend the error handling mechanism then
we'd better do it some b/w compatible way, e.g. by providing
new APIs.

> > Besides, I consider such an approach a hack and not
> > a solution.
> >
> > Instead of trying to tweak the implementation into providing
> > some kind of new error scheme, let's focus on finding a generic
> > framework which could provide a solution for the general case
> > while not breaking the existing applications.
> 
> Are the existing codecs (JapaneseCodecs etc.) to be considered part
> of the existing applications?

All code out there which uses the existing codecs and APIs
must be considered when thinking about altering published
Python C APIs.

> The problem might be how to handle callbacks to C functions and
> callback to Python functions in a consistent way. I.e. is it
> extern DL_IMPORT(PyObject*) PyUnicode_Encode(
>      const Py_UNICODE *s,        /* Unicode char buffer */
>      int size,                   /* number of Py_UNICODE chars to encode */
>      const char *encoding,       /* encoding */
>      PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position) /* error handling via C function */
>      );
> or
> extern DL_IMPORT(PyObject*) PyUnicode_Encode(
>      const Py_UNICODE *s,        /* Unicode char buffer */
>      int size,                   /* number of Py_UNICODE chars to encode */
>      const char *encoding,       /* encoding */
>      PyObject *errorHandler /* error handling via Python function */
>      );

The latter would be the "right" solution.
 
> > > > Writing your own function helpers which then apply all the necessary
> > > > magic is simple and doesn't warrant changing APIs in the core.
> > >
> > > It is not as simple as the error handler, but I could live with that.
> > >
> > > The big problem is that it effectively kill the speed of your
> > > application. Every XML application written in Python, no matter
> > > what is does internally, will in the end have to produce an output
> > > bytestring. Normally the output encoding should be one that produces
> > > no unencodable characters, but you have to be prepared to handle
> > > them. With the error handler the complete encoding will be done
> > > in C code (with very infrequent calls to the error handler), so
> > > this scheme gives the best speed possible.
> >
> > It would give even better performance if the codec would provide
> > this hook in some way at C level.
> 
> extern DL_IMPORT(PyObject*) PyUnicode_Encode(
>      const Py_UNICODE *s,        /* Unicode char buffer */
>      int size,                   /* number of Py_UNICODE chars to encode */
>      const char *encoding,       /* encoding */
>      PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position) /* error handling via C function */
>      );
> would, but thats not the point. When you use an encoding, where more
> than 20% of the characters have to be escaped (as XML entities or whatever)
> you're using the wrong encoding.

That's what I was talking about all along... if it's really
only for escaping XML, then a special Latin-1 or ASCII XML excaping
codec would go a long way (without the troubles of using callbacks
and without having to add a new error callback mechanism).

Writing such a codec doesn't take much time, since the
code's already there. Even better: XML escaping could be added
as new error handling option, e.g. "xml-escape" instead of
"replace".

Since XML escaping is general enough, I do think that adding
such an option to all builtin codecs would be an acceptable
and workable solution.

> > Note that almost all codecs
> > have their own error handlers written in C already.
> >
> > > > Since the error handling is extensible by adding new options
> > > > such as 'callback',
> > >
> > > I would prefer a more object oriented way of extending the error
> > > handling.
> >
> > Sure, but we have to assure backward compatibility as well.
> >
> > > > the existing codecs could be extended to
> > > > provide this functionality as well. We'd only need a way to
> > > > pass the callback to the codecs in some way, e.g. by using
> > > > a keyword argument on the constructor or by subclassing it
> > > > and providing a new method for the error handling in question.
> > >
> > > There is no need for a string argument 'callback' and
> > > an additional callback function/method that is passed to the
> > > encoder. When the error argument is a string, the old mechanism
> > > can be used, when it is a callable object the new will be used.
> >
> > This is bad style and also gives problems in the core
> > implementation (have a look at unicodeobject.c).
> 
> I did, what is the problem with changing "const char *error" to
> "PyObject *error"?

Backward compatibility. We can't change C API signatures
after they have been officially published. The Python way to
apply these kind of changes would be to add new extended APIs.
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Thu Jan  4 01:09:23 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 4 Jan 2001 02:09:23 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
In-Reply-To: <3A5388F7.FA6D49DA@lemburg.com> (mal@lemburg.com)
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com>
Message-ID: <200101040109.f0419NH01429@mira.informatik.hu-berlin.de>

> > How does this "Unicode compression example" look like?
> 
> Please see the Unicode.org site for a description of the
> Unicode compression algorithm. 

Specifically, http://www.unicode.org/unicode/reports/tr6/

> Other encoders will likely have similar problems, e.g. ones which
> compress data based on locality assumptions.

Of course, the TR 6 mechanism won't have the problem at all that we
are talking about - in section 5, it says

# The compression scheme is capable of compressing strings containing
# any Unicode character.

so the callback for unencodable characters would never be called.

Even if it *had* to preserve state (e.g. when encoding into ISO-2022),
Walter's proposal is that the callback returns a Unicode object that
is encoded *instead* of the original character. I have yet to see an
encoding scheme that would fail under this scheme: in the ISO-2022
case, with XML character entities, the codec would know what state it
is in, so it would know whether it has to switch to single-byte mode
to encode the &#<number> or not.

Looking again at the TR6 mechanism: Even if the error callback was
called, and even if it had to return bytes instead of unicodes, it
could still operate stateless: it would just output SQU as often as
required. I believe that most stateful encodings have a "escape to
known state" mechanism.

So I still think your objection is theoretical, whereas the problem
that Walter is trying to solve is real.

Regards,
Martin


From mal@lemburg.com  Thu Jan  4 10:00:10 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 04 Jan 2001 11:00:10 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de>
Message-ID: <3A5449AA.14A602E0@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > > How does this "Unicode compression example" look like?
> >
> > Please see the Unicode.org site for a description of the
> > Unicode compression algorithm.
> 
> Specifically, http://www.unicode.org/unicode/reports/tr6/
> 
> > Other encoders will likely have similar problems, e.g. ones which
> > compress data based on locality assumptions.
> 
> Of course, the TR 6 mechanism won't have the problem at all that we
> are talking about - in section 5, it says
> 
> # The compression scheme is capable of compressing strings containing
> # any Unicode character.
> 
> so the callback for unencodable characters would never be called.

I just used it as example for the existence of encoders which need
to preserve state. 
 
> Even if it *had* to preserve state (e.g. when encoding into ISO-2022),
> Walter's proposal is that the callback returns a Unicode object that
> is encoded *instead* of the original character. I have yet to see an
> encoding scheme that would fail under this scheme: in the ISO-2022
> case, with XML character entities, the codec would know what state it
> is in, so it would know whether it has to switch to single-byte mode
> to encode the &#<number> or not.

How would such a scheme allow passing back control information
such as: continue with the next character in the stream or
break with an exception ?
 
> Looking again at the TR6 mechanism: Even if the error callback was
> called, and even if it had to return bytes instead of unicodes, it
> could still operate stateless: it would just output SQU as often as
> required. I believe that most stateful encodings have a "escape to
> known state" mechanism.

Which is what I'm talking about all along: the codecs know best
what to do, so better extend them than try to fiddle in some
information using a callback.

I don't object to adding callback support to the codec's
error handlers, but we'll need a new set of APIs to allow
this.
 
> So I still think your objection is theoretical, whereas the problem
> that Walter is trying to solve is real.

I did propose a solution which would satisfy your needs: simply
add a new error treatment 'xml-escape' to the builtin codecs
which then does the needed XML escaping. XML is general enough
to warrant such a step and the required changes are minor.

Another candidate for a new error treatment would be 'unicode-escape'
which then replaces the character in question with '\uXXXX'.

For the general case, I'd rather add new PyUnicode_EncodeEx()
and PyUnicode_DecodeEx() APIs which then take a Python
context object as extra argument. The error treatment string
would then define how to use this context object, e.g. 'callback'
could be made to apply processing similar to what Walter
suggested.

The xxxEx() APIs will have to take special precautions to also
work with pre-2.1 codecs though, since the codec API definition
does not include the extra context objext.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Thu Jan  4 10:41:38 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 4 Jan 2001 11:41:38 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
In-Reply-To: <3A5449AA.14A602E0@lemburg.com> (mal@lemburg.com)
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com>
Message-ID: <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de>

> How would such a scheme allow passing back control information
> such as: continue with the next character in the stream or
> break with an exception ?

If it wanted to break with an exception, it would raise one. So the
function really has to acceptable results: an exception, and a Unicode
object. Since most Python functions are allowed to raise exceptions,
that went without saying.

> Which is what I'm talking about all along: the codecs know best
> what to do, so better extend them than try to fiddle in some
> information using a callback.

If that means to touch the source of all codecs, than that would be an
unacceptable solution. Doing it in a generic way would be ok - except
that I still can't see *how* this could possibly work.

> I did propose a solution which would satisfy your needs: simply
> add a new error treatment 'xml-escape' to the builtin codecs
> which then does the needed XML escaping. XML is general enough
> to warrant such a step and the required changes are minor.

Sorry, I missed that. That would also solve the problem at hand. Since
nobody has come up with a different use case for a more general
solution, that might be the solution which we can reasonably implement
for 2.1.

> Another candidate for a new error treatment would be
> 'unicode-escape' which then replaces the character in question with
> '\uXXXX'.

+0. While that falls into the same category, I haven't seen anybody
saying "I need such a feature".

> For the general case, I'd rather add new PyUnicode_EncodeEx()
> and PyUnicode_DecodeEx() APIs which then take a Python
> context object as extra argument. The error treatment string
> would then define how to use this context object, e.g. 'callback'
> could be made to apply processing similar to what Walter
> suggested.

What other acceptable values for the string would you foresee?

Regards,
Martin


From mal@lemburg.com  Fri Jan  5 08:40:52 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 05 Jan 2001 09:40:52 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de>
Message-ID: <3A558894.F2BA89F0@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > How would such a scheme allow passing back control information
> > such as: continue with the next character in the stream or
> > break with an exception ?
> 
> If it wanted to break with an exception, it would raise one. So the
> function really has to acceptable results: an exception, and a Unicode
> object. Since most Python functions are allowed to raise exceptions,
> that went without saying.

Sure, exceptions are not much of a problem, but how would the
callback tell the encoder/decoder to e.g. skip forward 2 bytes or perhaps
backward 10 bytes ? What if the callback would have to scan the
stream from the beginning to find out where to continue or look
ahead a few hundred bytes to find the next valid encodable sequence ?

Again, you should keep in mind that the scheme has to work
for all encoding/decoding work, not only conversion from and
to Unicode.
 
> > Which is what I'm talking about all along: the codecs know best
> > what to do, so better extend them than try to fiddle in some
> > information using a callback.
> 
> If that means to touch the source of all codecs, than that would be an
> unacceptable solution. Doing it in a generic way would be ok - except
> that I still can't see *how* this could possibly work.

If we were to provide a callback as optional method to 
StreamReaders/Writers, the task could be done either statically
by subclassing the existing codec StreamReaders/Writers or
dynamically by asking the codec registry to return the StreamReader/
Writer classes.

But since there aren't all that many codec implementations
around (only the few in unicodeobject.c), the proposed generic
solution of adding new error treatment strings would go a long
way...
 
> > I did propose a solution which would satisfy your needs: simply
> > add a new error treatment 'xml-escape' to the builtin codecs
> > which then does the needed XML escaping. XML is general enough
> > to warrant such a step and the required changes are minor.
> 
> Sorry, I missed that. That would also solve the problem at hand. Since
> nobody has come up with a different use case for a more general
> solution, that might be the solution which we can reasonably implement
> for 2.1.

Right.
 
> > Another candidate for a new error treatment would be
> > 'unicode-escape' which then replaces the character in question with
> > '\uXXXX'.
> 
> +0. While that falls into the same category, I haven't seen anybody
> saying "I need such a feature".

This would be handy for the case where you don't want to have
exceptions raised, but still require some form of retaining the
original data.
 
> > For the general case, I'd rather add new PyUnicode_EncodeEx()
> > and PyUnicode_DecodeEx() APIs which then take a Python
> > context object as extra argument. The error treatment string
> > would then define how to use this context object, e.g. 'callback'
> > could be made to apply processing similar to what Walter
> > suggested.
> 
> What other acceptable values for the string would you foresee?

Another option would be 'copy' which tries to simply copy input
to output in case this is reasonably possible given the encoding
(e.g. Unicode -> 8-bit encoding would copy all 8-bit Unicode chars as
is in case no mapping is defined). An option 'raise' could also
be valuable in conjunction with an exception context object to have
the codec raise customized exceptions. Provided the context
object points to another encoder/decoder, an option 'fallback'
could be used to tell the codec to pass the failing input data
to the alternate encoder/decoder in order to have it converted.
Etc. etc. 

There are many things one could do with the error string.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Fri Jan  5 09:08:09 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 5 Jan 2001 10:08:09 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
In-Reply-To: <3A558894.F2BA89F0@lemburg.com> (mal@lemburg.com)
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> <3A558894.F2BA89F0@lemburg.com>
Message-ID: <200101050908.f05989x01342@mira.informatik.hu-berlin.de>

> Sure, exceptions are not much of a problem, but how would the
> callback tell the encoder/decoder to e.g. skip forward 2 bytes or
> perhaps backward 10 bytes ?

First, I'd like to point out that encoding and decoding is *not*
symmetric with regards to error handling, so there is *no* need to
make the interfaces appear symmetric; it is rather unfortunate that
Python 2 gives this impression.

The reason for the difference is that converting from some encoding to
Unicode never fails for virtually all encodings because of missing
characters in Unicode - Unicode is supposed to support almost
everything, and code sets that cannot completely map into Unicode
probably need special attention anyway (normally, by producing a
non-reversible mapping). So the callback is not needed at all for
decoding.

For encoding, my claim is that error callbacks never want to skip
forward 2 bytes. If anything, then go forward two characters - but I
can't even imagine a scenario where that would be needed. Don't try to
design an API that nobody will ever use.

Walter has demonstrated how to implement the "skip the current
character" case: by returning u"" from the callback.

> What if the callback would have to scan the stream from the
> beginning to find out where to continue or look ahead a few hundred
> bytes to find the next valid encodable sequence ?

What would be the specific encoding, and what would be the specific
error handling algorithm that would require such a service?

> Again, you should keep in mind that the scheme has to work
> for all encoding/decoding work, not only conversion from and
> to Unicode.

Why is that? That sounds like gross overgeneralization to me.
Specifically, do you know anybody using that framework for anything
but Unicode conversion? If so, who is that, and what is the specific
application?

> If we were to provide a callback as optional method to 
> StreamReaders/Writers, the task could be done either statically
> by subclassing the existing codec StreamReaders/Writers or
> dynamically by asking the codec registry to return the StreamReader/
> Writer classes.

So how would the implementation of charmap_encode invoke this method?
It currently doesn't even get hold of the codec object.

> Another option would be 'copy' which tries to simply copy input
> to output in case this is reasonably possible given the encoding
> (e.g. Unicode -> 8-bit encoding would copy all 8-bit Unicode chars as
> is in case no mapping is defined). An option 'raise' could also
> be valuable in conjunction with an exception context object to have
> the codec raise customized exceptions. Provided the context
> object points to another encoder/decoder, an option 'fallback'
> could be used to tell the codec to pass the failing input data
> to the alternate encoder/decoder in order to have it converted.
> Etc. etc. 
> 
> There are many things one could do with the error string.

I guess my question is different: Do you consider the error string to
be of a well-defined finite enumerated set of possible values, or is
it your view that it is up to the codec what error strings to accept?
If so, why would they have to be strings?

Regards,
Martin


From mal@lemburg.com  Fri Jan  5 09:54:07 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 05 Jan 2001 10:54:07 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> <3A558894.F2BA89F0@lemburg.com> <200101050908.f05989x01342@mira.informatik.hu-berlin.de>
Message-ID: <3A5599BE.2A6CBDE2@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > Sure, exceptions are not much of a problem, but how would the
> > callback tell the encoder/decoder to e.g. skip forward 2 bytes or
> > perhaps backward 10 bytes ?
> 
> First, I'd like to point out that encoding and decoding is *not*
> symmetric with regards to error handling, so there is *no* need to
> make the interfaces appear symmetric; it is rather unfortunate that
> Python 2 gives this impression.
> 
> The reason for the difference is that converting from some encoding to
> Unicode never fails for virtually all encodings because of missing
> characters in Unicode - Unicode is supposed to support almost
> everything, and code sets that cannot completely map into Unicode
> probably need special attention anyway (normally, by producing a
> non-reversible mapping). So the callback is not needed at all for
> decoding.
> 
> For encoding, my claim is that error callbacks never want to skip
> forward 2 bytes. If anything, then go forward two characters - but I
> can't even imagine a scenario where that would be needed. Don't try to
> design an API that nobody will ever use.
> 
> Walter has demonstrated how to implement the "skip the current
> character" case: by returning u"" from the callback.

The codec design is supposed to cover the general case of
encoding/decoding arbitrary data from and to arbitrary formats.

Please don't try to break everything down to Unicode<->8-bit
codecs. The design should be able to cover conversion between
image formats, audio formats, compression schemes and other
encodings just as well as between different text formats.

I agree that the case for Unicode codecs allows some simplification
to the codec API design, but restricting it to this range of
application only would cause us much trouble in the years
to come when other codec applications start to appear in the
Python universe.

Other applications do have a need to jump back and forth in
the data stream, e.g. say you want to decode a corrupt image
file or a truncated MP3 file.

> > What if the callback would have to scan the stream from the
> > beginning to find out where to continue or look ahead a few hundred
> > bytes to find the next valid encodable sequence ?
> 
> What would be the specific encoding, and what would be the specific
> error handling algorithm that would require such a service?

See above.
 
> > Again, you should keep in mind that the scheme has to work
> > for all encoding/decoding work, not only conversion from and
> > to Unicode.
> 
> Why is that? That sounds like gross overgeneralization to me.
> Specifically, do you know anybody using that framework for anything
> but Unicode conversion? If so, who is that, and what is the specific
> application?

I am planning to add compression codecs based on zlib and
possibly cryptographic codecs which can then be used together
with stackable streams to provide seemless compression and/or
encryption to application which otherwise do not provide this
functionality.

> > If we were to provide a callback as optional method to
> > StreamReaders/Writers, the task could be done either statically
> > by subclassing the existing codec StreamReaders/Writers or
> > dynamically by asking the codec registry to return the StreamReader/
> > Writer classes.
> 
> So how would the implementation of charmap_encode invoke this method?
> It currently doesn't even get hold of the codec object.

Through the extended API I proposed earlier on: the extra context
object would allow providing a callback mechanism. Alternatively,
the StreamRead/Writer classes could use their own specific
C coding functions.
 
> > Another option would be 'copy' which tries to simply copy input
> > to output in case this is reasonably possible given the encoding
> > (e.g. Unicode -> 8-bit encoding would copy all 8-bit Unicode chars as
> > is in case no mapping is defined). An option 'raise' could also
> > be valuable in conjunction with an exception context object to have
> > the codec raise customized exceptions. Provided the context
> > object points to another encoder/decoder, an option 'fallback'
> > could be used to tell the codec to pass the failing input data
> > to the alternate encoder/decoder in order to have it converted.
> > Etc. etc.
> >
> > There are many things one could do with the error string.
> 
> I guess my question is different: Do you consider the error string to
> be of a well-defined finite enumerated set of possible values, or is
> it your view that it is up to the codec what error strings to accept?

Exactly. There is a set of error strings which the codec
must accept, but it is free to also implement other schemes
as well.

> If so, why would they have to be strings?

I chose strings to simplify the implementation. Back when the
design was discussed, we figured that the codec should take
care of the error handling. Python's codec design is one of
the few which does allow setting error handling behaviour --
other implementations tend to simply raise an exception and leave
the user in the dark.

It's too late to *change* the design. We can only extend it.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Fri Jan  5 21:00:25 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 5 Jan 2001 22:00:25 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
In-Reply-To: <3A5599BE.2A6CBDE2@lemburg.com> (mal@lemburg.com)
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> <3A558894.F2BA89F0@lemburg.com> <200101050908.f05989x01342@mira.informatik.hu-berlin.de> <3A5599BE.2A6CBDE2@lemburg.com>
Message-ID: <200101052100.f05L0Pt01067@mira.informatik.hu-berlin.de>

> The codec design is supposed to cover the general case of
> encoding/decoding arbitrary data from and to arbitrary formats.

Where is it documented as such? I believe it is wishful thinking to
assume they cover some general case, although I have to acknowledge
that *your* wish is more relevant than other people's wishes.

> Please don't try to break everything down to Unicode<->8-bit
> codecs. The design should be able to cover conversion between
> image formats, audio formats, compression schemes and other
> encodings just as well as between different text formats.

Is there any precedent that it is actually useful for anything else?

> I agree that the case for Unicode codecs allows some simplification
> to the codec API design, but restricting it to this range of
> application only would cause us much trouble in the years to come
> when other codec applications start to appear in the Python
> universe.

Well, there are a number of codec applications in the Python universe
already (e.g. uuencode/base64, various graphics format converters,
compression modules); none of which uses the codec module. I firmly
believe that they shouldn't - I rather have a good solution for each
single problem, than a mediocre solution that also solves unrelated
problems.

> Other applications do have a need to jump back and forth in
> the data stream, e.g. say you want to decode a corrupt image
> file or a truncated MP3 file.

Then they also need special API for that; your codec framework will be
useless.

> I am planning to add compression codecs based on zlib and
> possibly cryptographic codecs which can then be used together
> with stackable streams to provide seemless compression and/or
> encryption to application which otherwise do not provide this
> functionality.

Which application do you want to enhance with that functionality?  To
support writing compressed files, you just use gzip.open; or
gzip.GzipFile(fileobj=mystream) if you want to operate on a stream
instead of a named file.

> > > If we were to provide a callback as optional method to
> > > StreamReaders/Writers, the task could be done either statically
> > > by subclassing the existing codec StreamReaders/Writers or
> > > dynamically by asking the codec registry to return the StreamReader/
> > > Writer classes.
> > 
> > So how would the implementation of charmap_encode invoke this method?
> > It currently doesn't even get hold of the codec object.
> 
> Through the extended API I proposed earlier on: the extra context
> object would allow providing a callback mechanism. Alternatively,
> the StreamRead/Writer classes could use their own specific
> C coding functions.

Was there some detailed proposal of an API? I don't recall that; could
you kindly point me to the message in the archives which elaborate
that proposal?

Specifically, as an author of an application that wants to extend
existing codecs, could you post some Python code that shows how to
create the context objects (including an implementation of the codec
object's class), and how to pass it to Unicodeobject.encode?

> Exactly. There is a set of error strings which the codec
> must accept, but it is free to also implement other schemes
> as well.

Ok, the guaranteed error strings being 'strict','ignore' and
'replace'.

> I chose strings to simplify the implementation. Back when the
> design was discussed, we figured that the codec should take
> care of the error handling. Python's codec design is one of
> the few which does allow setting error handling behaviour --
> other implementations tend to simply raise an exception and leave
> the user in the dark.
> 
> It's too late to *change* the design. We can only extend it.

It's too late to change the *API*, the design of it can be changed as
long as the current API still emerges as a special case. That's what
Walter's proposal does: The API is extended to allow callable objects
as the eror parameter, and three well-known constants are
provided (codecs.{STRICT|IGNORE|REPLACE}).

Regards,
Martin


From mal@lemburg.com  Sat Jan  6 15:32:10 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 06 Jan 2001 16:32:10 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> <3A558894.F2BA89F0@lemburg.com> <200101050908.f05989x01342@mira.informatik.hu-berlin.de> <3A5599BE.2A6CBDE2@lemburg.com> <200101052100.f05L0Pt01067@mira.informatik.hu-berlin.de>
Message-ID: <3A573A7A.A596C068@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > The codec design is supposed to cover the general case of
> > encoding/decoding arbitrary data from and to arbitrary formats.
> 
> Where is it documented as such? I believe it is wishful thinking to
> assume they cover some general case, although I have to acknowledge
> that *your* wish is more relevant than other people's wishes.

Please see Misc/unicode.txt for details. I tried to design the
interface with a larger application range in mind and that's
what I will continue to argue for, obviously ;-)

> [ranting about the codec design being useless for other applications]

I don't see the point in trying to argue for uselessness of
an existing design. If you want your own design, then nobody 
will stop you from rolling your own.

> > > > If we were to provide a callback as optional method to
> > > > StreamReaders/Writers, the task could be done either statically
> > > > by subclassing the existing codec StreamReaders/Writers or
> > > > dynamically by asking the codec registry to return the StreamReader/
> > > > Writer classes.
> > >
> > > So how would the implementation of charmap_encode invoke this method?
> > > It currently doesn't even get hold of the codec object.
> >
> > Through the extended API I proposed earlier on: the extra context
> > object would allow providing a callback mechanism. Alternatively,
> > the StreamRead/Writer classes could use their own specific
> > C coding functions.
> 
> Was there some detailed proposal of an API? I don't recall that; could
> you kindly point me to the message in the archives which elaborate
> that proposal?

There wasn't a detailed proposal, only a design idea...

"""
For the general case, I'd rather add new PyUnicode_EncodeEx()
and PyUnicode_DecodeEx() APIs which then take a Python
context object as extra argument. The error treatment string
would then define how to use this context object, e.g. 'callback'
could be made to apply processing similar to what Walter
suggested.

The xxxEx() APIs will have to take special precautions to also
work with pre-2.1 codecs though, since the codec API definition
does not include the extra context objext.
"""
 
> Specifically, as an author of an application that wants to extend
> existing codecs, could you post some Python code that shows how to
> create the context objects (including an implementation of the codec
> object's class), and how to pass it to Unicodeobject.encode?

Sure, but only *after* the context object design has implemented..
otherwise there wouldn't be a point ;-)
 
> > Exactly. There is a set of error strings which the codec
> > must accept, but it is free to also implement other schemes
> > as well.
> 
> Ok, the guaranteed error strings being 'strict','ignore' and
> 'replace'.

Right.
 
> > I chose strings to simplify the implementation. Back when the
> > design was discussed, we figured that the codec should take
> > care of the error handling. Python's codec design is one of
> > the few which does allow setting error handling behaviour --
> > other implementations tend to simply raise an exception and leave
> > the user in the dark.
> >
> > It's too late to *change* the design. We can only extend it.
> 
> It's too late to change the *API*, the design of it can be changed as
> long as the current API still emerges as a special case. That's what
> Walter's proposal does: The API is extended to allow callable objects
> as the eror parameter, and three well-known constants are
> provided (codecs.{STRICT|IGNORE|REPLACE}).

No, it does not: the error string parameter is defined as "const char*".
You can't change that to PyObject* in the C API and for the Python API
I wouldn't want to introduce "switch semantics on type" variables.
Extending APIs is OK, changing them is not.

I'll right a patch which implements the 'xml-escape' error
treatment. Hopefully that will buy us some time to think of
a design extension -- provided you play along :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Sat Jan  6 18:48:02 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 6 Jan 2001 19:48:02 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
In-Reply-To: <3A573A7A.A596C068@lemburg.com> (mal@lemburg.com)
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> <3A558894.F2BA89F0@lemburg.com> <200101050908.f05989x01342@mira.informatik.hu-berlin.de> <3A5599BE.2A6CBDE2@lemburg.com> <200101052100.f05L0Pt01067@mira.informatik.hu-berlin.de> <3A573A7A.A596C068@lemburg.com>
Message-ID: <200101061848.f06Im2v04223@mira.informatik.hu-berlin.de>

> I don't see the point in trying to argue for uselessness of
> an existing design. If you want your own design, then nobody 
> will stop you from rolling your own.

The design does not exist but on paper. What really matters is the API
and the implementation. I could not care less about the design, but
you bring to up to argue why the implementation should not be changed.

I don't want my own design, I want to enhance the API.

> > > > So how would the implementation of charmap_encode invoke this method?
> > > > It currently doesn't even get hold of the codec object.
[...]
> There wasn't a detailed proposal, only a design idea...

That's one of the major problems here, IMO. If there was a specific
proposal, it would be possible to evaluate whether it meets the
requirements.

Instead, you use "design ideas" to claim that some other specific
proposal which we already have is a bad thing, and that the design
could be much more general. That is not very convincing, as apparently
nobody can follow your design to really understand whether what you
claim is true.

> For the general case, I'd rather add new PyUnicode_EncodeEx()
> and PyUnicode_DecodeEx() APIs which then take a Python
> context object as extra argument. The error treatment string
> would then define how to use this context object, e.g. 'callback'
> could be made to apply processing similar to what Walter
> suggested.

Ok, PyUnicode_EncodeEx would then invoke PyCodec_EncodeEx, which would
eventually end-up in encodings.koi8_r.Codec.encode (or
encoding.koi8_r.Codec.encode_ex?). Now, how would that be implemented?

> The xxxEx() APIs will have to take special precautions to also
> work with pre-2.1 codecs though, since the codec API definition
> does not include the extra context objext.

In the specific case of KOI8-R, how would these precautions look like,
specifically, using, say, Python as a notation?

> > Specifically, as an author of an application that wants to extend
> > existing codecs, could you post some Python code that shows how to
> > create the context objects (including an implementation of the codec
> > object's class), and how to pass it to Unicodeobject.encode?
> 
> Sure, but only *after* the context object design has implemented..
> otherwise there wouldn't be a point ;-)

So you want to implement it first, and discuss use cases later???
Or maybe you don't want to discuss the design at all?

> No, it does not: the error string parameter is defined as "const char*".

You mean, in PyUnicode_FromEncodedObject, PyUnicode_Decode, and other
C functions? So you would have to provide additional functions in the
C API, but that is the same as your proposal with the *Ex functions,
as I understand it.

> You can't change that to PyObject* in the C API and for the Python API
> I wouldn't want to introduce "switch semantics on type" variables.

Ah, but it's 'switch semantics on value' :-) If you pass the string
'ignore', it has a different semantics than passing 'replace', which
again has a different semantic than passing
codecs.REPLACE_WITH_XML_CHARACTER_ENTITIES, which happens to be
callable.

> Extending APIs is OK, changing them is not.

That just is an extension. For the C interface, it apparently means
duplication; for the Python interface, we can keep the old signatures
and extend the acceptable parameter values.

> I'll right a patch which implements the 'xml-escape' error
> treatment. Hopefully that will buy us some time to think of
> a design extension -- provided you play along :-)

Good. I'm willing to agree on any proposal once I can see that it does
what it was designed for...

Regards,
Martin


From andy@reportlab.com  Sat Jan  6 23:26:45 2001
From: andy@reportlab.com (Andy Robinson)
Date: Sat, 6 Jan 2001 23:26:45 -0000
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
In-Reply-To: <200101061848.f06Im2v04223@mira.informatik.hu-berlin.de>
Message-ID: <PGECLPOBGNBNKHNAGIJHMEGLCHAA.andy@reportlab.com>

>> The codec design is supposed to cover the general case of
>> encoding/decoding arbitrary data from and to arbitrary formats.
>
> Where is it documented as such? I believe it is wishful thinking to
> assume they cover some general case, although I have to acknowledge
> that *your* wish is more relevant than other people's wishes.
>
>> Please don't try to break everything down to Unicode<->8-bit
>> codecs. The design should be able to cover conversion between
>> image formats, audio formats, compression schemes and other
>> encodings just as well as between different text formats.

> Is there any precedent that it is actually useful for 
> anything else?

I'm trying to catch up on this thread after a long absence.
I have not been able to do any i18n work this year and  
cannot give any opinions on the error handling
details, but I must comment on these paragraphs.

There was a great deal of discussion about keeping the codec
mechanism general-purpose on the python-dev list when the unicode
proposal was first put together.  This came from two directions:

(1) I argued long and hard then that i18n is not just Unicode; there
are many legacy problems where you want to be able to write
codecs to go direct from one native encoding to another without
going through Unicode.  They are never needed in the case of
perfectly encoded data, but this need is pressing if having 
to deal with and clean up large amounts of misencoded data, 
user-defined characters etc. I spent a year of my life on 
a very complex i18n project, corresponded with Ken Lunde 
and many other developers in the field, and got the same feedback
from the developers at Digital Garage in Tokyo, who deal with this
every day.  

The key requirements I had were that (a) the API should not be
limited to Unicode <--> 8-bit, and (b) you should be able to
extend codec mappings and algorithms without needing a C compiler
every time.  I can provide lots of use cases if needed but they
are hard to follow if you don't know a little Japanese.

(2) there was much interest in the Java concept of 'stackable
streams' and stream conversion tools.  The general case is
clearly a stream of bytes, and Unicode strings are one 
case of these.  Several of us also felt that with the right
little state machine in the codec package, you could do vey 
powerful things in different spheres like compression, binary 
encodings like base 64/85/whatever.  

Guido played a large part in the discussions and, I believe he
fully understood and echoed the design goal you question
at the top.

Since then, Marc-Andre has done a fantastic mount of largely 
unpaid work, but I have not been able to follow up with the 
work I wanted to do on Asian codecs.  If I had, you'd have 
plenty of use cases for keeping things general purpose.  I 
am however confident that whenever we get around to building
the right codec package (which depends a lot on when ReportLab
gets its first Asian customers), people in the feel will
see Python's i18n support is way ahead of that of Java.


Regards,

Andy Robinson
(still flat out keeping a startup going and failing to do
my duties as sig moderator, sadly)


From martin@loewis.home.cs.tu-berlin.de  Sun Jan  7 10:09:53 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 7 Jan 2001 11:09:53 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
In-Reply-To: <PGECLPOBGNBNKHNAGIJHMEGLCHAA.andy@reportlab.com>
References: <PGECLPOBGNBNKHNAGIJHMEGLCHAA.andy@reportlab.com>
Message-ID: <200101071009.f07A9rB01152@mira.informatik.hu-berlin.de>

[need for codecs to go direct from one native encoding to another]
> I spent a year of my life on a very complex i18n project,
> corresponded with Ken Lunde and many other developers in the field,
> and got the same feedback from the developers at Digital Garage in
> Tokyo, who deal with this every day.

I then have to accept that this really happens in life, although I
surely hope that the cases where it is necessary to have such cases
become more and more rare.

Can you elaborate a bit what the problem was in this complex project?
I.e. which where the encodings A and B that you needed direct
conversion for? Why couldn't you go through Unicode? If the reason was
that you could not have "correctly" recoded a certain subset of the
characters, then which characters would have suffered?

> The key requirements I had were that (a) the API should not be
> limited to Unicode <--> 8-bit, and 

I believe that requirement is not completely answered. If you want to
get from A to B, and both a and b are byte-oriented encodings, then
the API offers

   b = a.encode("AtoB")

First, you need a codec name that describes both source and target
encoding; for the Unicode codecs, you only need one encoding in the
codec name.

However, that API does not work: The encode method of a byte string
assumes that the string is in the system encoding. It first tries to
decode the string into a Unicode object, then takes the codec name as
one going from Unicode to the target.

So instead, you have to write

  enc,dec,_,_ = codecs.lookup("AtoB")
  b,_ = enc(a)

That assumes that you first had registered your codec:

  import AtoB,codecs
  codecs.register(AtoB.lookup)

In this case, it would be easier *not* to use the framework:

  import AtoB
  b = AtoB.encode(a)

> (b) you should be able to extend codec mappings and algorithms
> without needing a C compiler every time.

I don't know what you mean by "extend codec mappings". If you want to
register codecs written in Python and use it from C, that works very
well.

If you want to enhance an existing codec to support additional
characters, or to partially replace the output of an existing codec -
well, that is surely not available, and the matter of the current
debate: It is currently not possible to enhance an existing codec so
that it would produce &#4567; if U+4567 is not supported in the target
encoding.

> I can provide lots of use cases if needed but they are hard to
> follow if you don't know a little Japanese.

Please assume I know a little Japanese, and present a single use
case. Since that would be mainly to satisfy my curiosity: don't if
that would be a longer essay.

> (2) there was much interest in the Java concept of 'stackable
> streams' and stream conversion tools.  The general case is
> clearly a stream of bytes, and Unicode strings are one 
> case of these.  Several of us also felt that with the right
> little state machine in the codec package, you could do vey 
> powerful things in different spheres like compression, binary 
> encodings like base 64/85/whatever.  
> 
> Guido played a large part in the discussions and, I believe he
> fully understood and echoed the design goal you question
> at the top.

Indeed, that's what I question. Stackable things always look like a
good idea on paper, so people can be easily talked into approving
them. I'm not quite clear why the file API doesn't already provide
stackable streams, in fact, gzip.GzipFile is a demonstration that this
is really possible.

The question is whether anybody currently *has* written codecs that
don't deal with strings, yet use the codec interfaces. My claim is
that you never want to 'stack' more than one stream on top of
another. People are then happy with whatever stacking API the codec
offers.

My concern is not so much the existance of the API, but that it is
taken as a rationale for preventing improvements of the usability of
the Unicode library.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Mon Jan  8 08:44:44 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Mon, 8 Jan 2001 09:44:44 +0100
Subject: [I18n-sig] iconv codec
Message-ID: <200101080844.f088iii02150@mira.informatik.hu-berlin.de>

I have checked-in an iconv codec into the practicecodecs/iconv
directory on SF. It has been tested only on Linux so far; if you have
any problems with it, or other comments, please let me know.

Regards,
Martin


From mal@lemburg.com  Mon Jan  8 15:52:14 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 08 Jan 2001 16:52:14 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
References: <PGECLPOBGNBNKHNAGIJHMEGLCHAA.andy@reportlab.com> <200101071009.f07A9rB01152@mira.informatik.hu-berlin.de>
Message-ID: <3A59E22E.349B0981@lemburg.com>

Martin,

what is the point of these endless discussions about use-cases
(which you seem esp. fond of ;), design vs. API, Walter's proposal
and whether or not the codec design covers more general cases than
just encoding and decoding from and to Unicode ?

These discussions don't get us anywhere.

To summarize:

* the codec design was discussed at length early last year
* the design was chosen after many useful suggestions from people
  who know what codecs have to deal with (e.g. Andy, Fredrik
  (from the PIL-perspective BTW)) and others
* the design is written down in Misc/unicode.txt
* extending the design is OK, breaking APIs is not
* extending the design by adding parameters is OK, extending
  the design by switching on parameter type is not
* I have no problem with extending the design
* Walter's proposal breaks the Unicode C API in untolerable ways;
  I agree that the general idea is worth persuing though and
  Walter's proposal has some good ideas into that direction

So where are we heading ?

* I will start to code a new error treatment option 'xml-escape'
  which can then also be used as basis for other escape techniques
  which might be of general use (e.g. 'unicode-escape')
* we should start thinking of ways to extend the existing C API
  to allow providing a context object to the encoder/decoder. I've
  already made a few suggestions into that direction; more are to
  come once I find more time to work on this; other suggestions
  are, of course, welcome too
* the new error handler extensions will be a post-2.1 feature
* a PEP is needed for the design (most people don't read endless 
  threads like these to catch up)

What the PEP should include:

* a proposal for extending the Unicode C API to provide an
  extra context object to the encoder/decoder functions (which
  are otherwise stateless)
* a hook for StreamWriters/Readers to use as standard error
  handler in case 'callback' is used as error handling option
* the Python APIs .encode() and unicode() should be extended
  by a third optional argument: the context object
* all builtin codecs should be extended to handle the new
  scheme
* Codec.encode and .decode APIs should allow a context object as
  additional optional argument; default should be None
* the changes must be 100% backward compatible, both at C
  and at Python level

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From walter@livinglogic.de  Mon Jan  8 18:25:15 2001
From: walter@livinglogic.de (=?ISO-8859-1?Q?=22Walter_D=F6rwald=22?=)
Date: Mon, 08 Jan 2001 19:25:15 +0100
Subject: [I18n-sig] Proposal: Extended error
 handlingforunicode.encode
In-Reply-To: <3A5388F7.FA6D49DA@lemburg.com>
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com>
 <200101032018580500.01F457F3@mail.livinglogic.de>
 <3A5388F7.FA6D49DA@lemburg.com>
Message-ID: <200101081925150671.00529F1F@mail.livinglogic.de>

On 03.01.01 at 21:17 M.-A. Lemburg wrote:

> [ Unicode compression example ]
> 
> > > I know that error handling could be more generic, but passing
> > > a callable object instead of the error parameter is not an
> > > option since the internal APIs all use a const char parameter
> > > for error.
> > 
> > Changing this should can be done in one or two hours for someone
> > who knows the Python internals. (Unfortunately I don't, I first
> > looked at unicodeobject.[hc] several days ago!)
> 
> Sure, but it would break code and alter the Python C API
> in unacceptable ways. Note that all builtin C codecs would
> also have to be changed.
> 
> If we are going to extend the error handling mechanism then
> we'd better do it some b/w compatible way, e.g. by providing
> new APIs.

But I don't think that can be done in a completely backward
compatible way. At least the codecs will have to be changed.

> [...]
>
> > extern DL_IMPORT(PyObject*) PyUnicode_Encode(
> >      const Py_UNICODE *s,        /* Unicode char buffer */
> >      int size,                   /* number of Py_UNICODE chars to=
 encode */
> >      const char *encoding,       /* encoding */
> >      PyUnicodeObject *errorHandler(PyUnicodeObject *string, int=
 position) /* error handling via C function */
> >      );
> > would, but thats not the point. When you use an encoding, where more
> > than 20% of the characters have to be escaped (as XML entities or=
 whatever)
> > you're using the wrong encoding.
> 
> That's what I was talking about all along... if it's really
> only for escaping XML, then a special Latin-1 or ASCII XML excaping
> codec would go a long way (without the troubles of using callbacks
> and without having to add a new error callback mechanism).

But I would like to hav and escaping mechanism that can
be used with any encoding, not just latin1 + xml-escape,
and ascii + xml-escape, but also shift-jis + xml-escape,
euc + xml-escape, koi8 + xml-escape, ...

> Writing such a codec doesn't take much time, since the
> code's already there. Even better: XML escaping could be added
> as new error handling option, e.g. "xml-escape" instead of
> "replace".
> Since XML escaping is general enough, I do think that adding
> such an option to all builtin codecs would be an acceptable
> and workable solution.

But that method has two problems: Handling "xml-escape" has to 
be implemented in every codec and it only solves one problem: 
escaping via numeric (decimal) XML character entities.

What if I want an output where "=DF" is escaped as "&szlig;"
and not "&#223;"?

And maybe I define my own entities, so that "&#x3042;"
will be written as "&hiraA;"?

Another use case is, when such a string is written to the terminal
(encoded with sys.getdefaultencoding()):
I want to hightlight the character entities, so I have to
put ANSI escape sequences around the escaped character.

Implementing all of this in all the codecs would be lot of work
and it is definitely nothing that should be part of the codecs
because it is too application specific.

> [...]


Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7=
 www.livinglogic.de


From walter@livinglogic.de  Mon Jan  8 18:59:43 2001
From: walter@livinglogic.de (=?us-ascii?Q?=22Walter_D=F6rwald=22?=)
Date: Mon, 08 Jan 2001 19:59:43 +0100
Subject: [I18n-sig] Proposal: Extended error
 handlingforunicode.encode
In-Reply-To: <3A5449AA.14A602E0@lemburg.com>
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com>
 <200101032018580500.01F457F3@mail.livinglogic.de>
 <3A5388F7.FA6D49DA@lemburg.com>
 <200101040109.f0419NH01429@mira.informatik.hu-berlin.de>
 <3A5449AA.14A602E0@lemburg.com>
Message-ID: <200101081959430656.00722D2F@mail.livinglogic.de>

On 04.01.01 at 11:00 M.-A. Lemburg wrote:

> [...]
>  
> > Even if it *had* to preserve state (e.g. when encoding into ISO-2022),
> > Walter's proposal is that the callback returns a Unicode object that
> > is encoded *instead* of the original character. I have yet to see an
> > encoding scheme that would fail under this scheme: in the ISO-2022
> > case, with XML character entities, the codec would know what state it
> > is in, so it would know whether it has to switch to single-byte mode
> > to encode the &#<number> or not.
> 
> How would such a scheme allow passing back control information
> such as: continue with the next character in the stream

def ignore(encoding, string, position):
	return u""

u"xxx".encode(encoding, 'callback', ignore)

> or break with an exception ?

def raiseAnException(encoding, string, position):
	raise FancyException("can't encode character %r at position %d in string=
 %r with encoding %s" 
		% (string[position], position, string, encoding))

u"xxx".encode(encoding, 'callback', raiseAnException)

> > Looking again at the TR6 mechanism: Even if the error callback was
> > called, and even if it had to return bytes instead of unicodes, it
> > could still operate stateless: it would just output SQU as often as
> > required. I believe that most stateful encodings have a "escape to
> > known state" mechanism.
> 
> Which is what I'm talking about all along: the codecs know best
> what to do, so better extend them than try to fiddle in some
> information using a callback.

The callback is only used in the situation when the codec does
not know what to do, i.e. when it encounters an unencodable
character. The callback is an *error handler* and not a
"I don't know how to implement my own encoding algorithm,
please help me"-handler. >;->

> I don't object to adding callback support to the codec's
> error handlers, but we'll need a new set of APIs to allow
> this.

I could live with a
	u"xxx".encode(encoding, 'callback', handler)
on the Python side, but what does this mean for the C API?

> > So I still think your objection is theoretical, whereas the problem
> > that Walter is trying to solve is real.
> 
> I did propose a solution which would satisfy your needs: simply
> add a new error treatment 'xml-escape' to the builtin codecs
> which then does the needed XML escaping. XML is general enough
> to warrant such a step and the required changes are minor.
> 
> Another candidate for a new error treatment would be 'unicode-escape'
> which then replaces the character in question with '\uXXXX'.
> 
> For the general case, I'd rather add new PyUnicode_EncodeEx()
> and PyUnicode_DecodeEx() APIs which then take a Python
> context object as extra argument. 

What should this extra argument be for the decoder?

> The error treatment string
> would then define how to use this context object, e.g. 'callback'
> could be made to apply processing similar to what Walter
> suggested.

'callback' seem too generic to me. May there will be other callbacks
in the future that are used for different stuff. This is the
"give me a replacement or die" error handler.

> The xxxEx() APIs will have to take special precautions to also
> work with pre-2.1 codecs though, since the codec API definition
> does not include the extra context objext.


Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7=
 www.livinglogic.de


From martin@loewis.home.cs.tu-berlin.de  Mon Jan  8 22:43:07 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Mon, 8 Jan 2001 23:43:07 +0100
Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode
In-Reply-To: <3A59E22E.349B0981@lemburg.com> (mal@lemburg.com)
References: <PGECLPOBGNBNKHNAGIJHMEGLCHAA.andy@reportlab.com> <200101071009.f07A9rB01152@mira.informatik.hu-berlin.de> <3A59E22E.349B0981@lemburg.com>
Message-ID: <200101082243.f08Mh7l00855@mira.informatik.hu-berlin.de>

> These discussions don't get us anywhere.

I'd surely hoped they would, but I realize that this is not
possible. I don't agree with your summary, but we can probably leave
it at that.

Regards,
Martin


From andy@reportlab.com  Tue Jan  9 08:49:29 2001
From: andy@reportlab.com (Andy Robinson)
Date: Tue, 9 Jan 2001 08:49:29 -0000
Subject: [I18n-sig] PEP needed
In-Reply-To: <200101082243.f08Mh7l00855@mira.informatik.hu-berlin.de>
Message-ID: <PGECLPOBGNBNKHNAGIJHAEHICHAA.andy@reportlab.com>

>
> > These discussions don't get us anywhere.
>
> I'd surely hoped they would, but I realize that this is not
> possible. I don't agree with your summary, but we can probably leave
> it at that.
>
> Regards,
> Martin

I think Marc-Andre's suggestion of a PEP is an excellent one.
Martin, why not try to produce something like this which starts at
the very beginning?  Explain what the problems are that you are trying
to solve, in PEP format; give code snippets of what you have to do
now, why it dooesn;t work, and how you would like it to work.
Then we can all get involved, and even ask Guido if we need to.
But we can't expect him or anyone else to give an opinion without a
PEP.

I don't have time to trawl through the emails and I certainly feel
a need for a summary of this debate.  Since only 2-3 people are
involved, I guess no one else has found the time either.

For anyone not familiar with these, Python Enhancement Proposals
(PEPs)
are a standard form of document used to record Python design
decisions.
They were introduced to save Guido time and give everyone something
to discuss without having to trawl through months of emails. They
can all be found at

http://cvs.sourceforge.net/cgi-bin/cvsweb.cgi/python/nondist/peps/?cvs
root=python


Thanks,

Andy the pointy-haired manager
p.s. I will finish off my 'use cases' in the next couple of days;
I have a very big deadline today annd have had no time.


From kajiyama@grad.sccs.chukyo-u.ac.jp  Tue Jan  9 23:40:21 2001
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Wed, 10 Jan 2001 08:40:21 +0900
Subject: [I18n-sig] iconv codec
In-Reply-To: <200101080844.f088iii02150@mira.informatik.hu-berlin.de>
 (martin@loewis.home.cs.tu-berlin.de)
Message-ID: <200101092340.IAA09353@dhcp234.grad.sccs.chukyo-u.ac.jp>

Martin v. Loewis wrote:
|
| I have checked-in an iconv codec into the practicecodecs/iconv
| directory on SF.

Cool.

| It has been tested only on Linux so far; if you have
| any problems with it, or other comments, please let me know.

I've tested the iconv codec (checked out last night) on two
Linux boxes of mine, one with glibc-2.1.2 and the other with
old libc5 plus libiconv-1.5.1
(http://clisp.cons.org/~haible/packages-libiconv.html).

I have the following error messages on both two platforms:

Python 2.0 (#1, Oct 27 2000, 00:27:59) 
[GCC 2.7.2.3] on linux2
>>> import iconvcodec
>>> unicode("test","euc-jp")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "iconvcodec.py", line 50, in decode
    return self.decoder.iconv(msg, return_unicode=1),len(msg)
SystemError: new style getargs format but argument is not a tuple
>>> u"test".encode("euc-jp")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "iconvcodec.py", line 19, in encode
    return self.encoder.iconv(msg),len(msg)
SystemError: new style getargs format but argument is not a tuple
>>> 

What goes wrong?

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From martin@loewis.home.cs.tu-berlin.de  Wed Jan 10 07:37:13 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 10 Jan 2001 08:37:13 +0100
Subject: [I18n-sig] iconv codec
In-Reply-To: <200101092340.IAA09353@dhcp234.grad.sccs.chukyo-u.ac.jp> (message
 from Tamito KAJIYAMA on Wed, 10 Jan 2001 08:40:21 +0900)
References: <200101092340.IAA09353@dhcp234.grad.sccs.chukyo-u.ac.jp>
Message-ID: <200101100737.f0A7bDe00912@mira.informatik.hu-berlin.de>

> SystemError: new style getargs format but argument is not a tuple
> >>> 
> 
> What goes wrong?

Thanks for the report. Iconv_iconv should have used
METH_VARARGS|METH_KEYWORDS, but was using only METH_KEYWORDS. Please
update your tree and try again. I don't know why this was no problem
with the CVS Python.

Regards,
Martin


From kajiyama@grad.sccs.chukyo-u.ac.jp  Wed Jan 10 08:32:44 2001
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Wed, 10 Jan 2001 17:32:44 +0900
Subject: [I18n-sig] iconv codec
In-Reply-To: <200101100737.f0A7bDe00912@mira.informatik.hu-berlin.de>
 (martin@loewis.home.cs.tu-berlin.de)
References: <200101100737.f0A7bDe00912@mira.informatik.hu-berlin.de>
Message-ID: <200101100832.RAA10608@dhcp234.grad.sccs.chukyo-u.ac.jp>

Martin v. Loewis wrote:
|
| > SystemError: new style getargs format but argument is not a tuple
| > >>> 
| > 
| > What goes wrong?
| 
| Thanks for the report. Iconv_iconv should have used
| METH_VARARGS|METH_KEYWORDS, but was using only METH_KEYWORDS. Please
| update your tree and try again. I don't know why this was no problem
| with the CVS Python.

It works both with glibc-2.1.2 and with libiconv-1.5.1.  Thanks.

FYI: I've modified setup.py in the following way to build the
iconv codec with an old libc5 and libiconv.  Two iconv libraries
are statically linked so that iconvmodule.so can be imported
without relying on the LD_LIBRARY_PATH environment variable.
The prefix /opt/libiconv-1.5.1 should be changed appropriately.
I could not figure out a way to achieve the same things without
modifying the setup.py script (possible?).

--- setup.py.orig	Wed Jan 10 08:50:56 2001
+++ setup.py	Wed Jan 10 09:01:22 2001
@@ -14,6 +14,9 @@
 """,
 
        py_modules = ['iconvcodec'],
-       ext_modules = [Extension("iconv",sources=["iconvmodule.c"])]
+       ext_modules = [Extension("iconv",sources=["iconvmodule.c"],
+                                include_dirs=["/opt/libiconv-1.5.1/include"],
+                                extra_objects=["/opt/libiconv-1.5.1/lib/libiconv.a",
+                                               "/opt/libiconv-1.5.1/lib/libcharset.a"])]
        )
 
Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From martin@loewis.home.cs.tu-berlin.de  Wed Jan 10 21:45:16 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 10 Jan 2001 22:45:16 +0100
Subject: [I18n-sig] iconv codec
In-Reply-To: <200101100832.RAA10608@dhcp234.grad.sccs.chukyo-u.ac.jp> (message
 from Tamito KAJIYAMA on Wed, 10 Jan 2001 17:32:44 +0900)
References: <200101100737.f0A7bDe00912@mira.informatik.hu-berlin.de> <200101100832.RAA10608@dhcp234.grad.sccs.chukyo-u.ac.jp>
Message-ID: <200101102145.f0ALjGH01465@mira.informatik.hu-berlin.de>

> FYI: I've modified setup.py in the following way to build the
> iconv codec with an old libc5 and libiconv.  Two iconv libraries
> are statically linked so that iconvmodule.so can be imported
> without relying on the LD_LIBRARY_PATH environment variable.
> The prefix /opt/libiconv-1.5.1 should be changed appropriately.

Is that a standard location as provided by some Linux distributor? If
so, we could check whether some specific files are there, and then
automatically add them as extra objects.

If you can find a patch (e.g. using os.path.exists) that detects your
configuration (and perhaps the default /usr/local installation), feel
free to check that into the CVS.

As for linking statically vs dynamically: If you give the extension a
runtime_library_dirs attribute, the resulting extension will find its
shared libraries in these directories; this is achieved through the -R
linker option. Of course, if the shared library is in /usr/local/lib,
it'll be found anyway.

> I could not figure out a way to achieve the same things without
> modifying the setup.py script (possible?).

I believe using the build_ext command's options --link-objects,
--libraries, --library-dirs, and --rpath might help, so

python setup.py build_ext -I/opt/libiconv-1.5.1/include -L/opt/libiconv-1.5.1/lib -R/opt/libiconv-1.5.1/lib -liconv -lcharset

should have worked. I get an exception that something is a string that
shouldn't; if you run into the same problem, you may report it as a
distutils bug.

Regards,
Martin


From kajiyama@grad.sccs.chukyo-u.ac.jp  Thu Jan 11 06:43:14 2001
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Thu, 11 Jan 2001 15:43:14 +0900
Subject: [I18n-sig] iconv codec
In-Reply-To: <200101102145.f0ALjGH01465@mira.informatik.hu-berlin.de>
 (martin@loewis.home.cs.tu-berlin.de)
References: <200101110245.LAA01718@sam.hi-ho.ne.jp>
Message-ID: <200101110643.PAA12420@dhcp234.grad.sccs.chukyo-u.ac.jp>

Martin v. Loewis wrote:
|
| > FYI: I've modified setup.py in the following way to build the
| > iconv codec with an old libc5 and libiconv.  Two iconv libraries
| > are statically linked so that iconvmodule.so can be imported
| > without relying on the LD_LIBRARY_PATH environment variable.
| > The prefix /opt/libiconv-1.5.1 should be changed appropriately.
| 
| Is that a standard location as provided by some Linux distributor?

No.  That location is a personal preference of mine, not a
standard one.

| As for linking statically vs dynamically: If you give the extension a
| runtime_library_dirs attribute, the resulting extension will find its
| shared libraries in these directories; this is achieved through the -R
| linker option.

I've used GCC 2.7.2.3, and it seems not to support the -R
option...  I tried to give the compiler two linker options
-Wl,-rpath -Wl,/opt/libiconv-1.5.1/lib, but I could not get
the desired effect, too.

| python setup.py build_ext -I/opt/libiconv-1.5.1/include -L/opt/libiconv-1.5.1/lib -R/opt/libiconv-1.5.1/lib -liconv -lcharset
| 
| should have worked. I get an exception that something is a string that
| shouldn't;

Me too :-<

| if you run into the same problem, you may report it as a
| distutils bug.

I see.  Thanks.

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From martin@loewis.home.cs.tu-berlin.de  Thu Jan 11 13:17:11 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 11 Jan 2001 14:17:11 +0100
Subject: [I18n-sig] Distutils and iconv codec
Message-ID: <200101111317.f0BDHBl09327@mira.informatik.hu-berlin.de>

It appears that there was a patch for processing -L options in
distutils lately, see

http://sourceforge.net/patch/?func=detailpatch&patch_id=102971&group_id=5470

so

  python setup.py build_ext -L/tmp -lbla 

works now for me. Unfortunately, passing -R is still broken;

  python setup.py build_ext -L/tmp -R/tmp -lbla 

gives

...
  File "/usr/local/lib/python2.0/distutils/unixccompiler.py", line 208, in link
    (libraries, library_dirs, runtime_library_dirs) = \
  File "/usr/local/lib/python2.0/distutils/ccompiler.py", line 438, in _fix_lib_args
    runtime_library_dirs = (list (runtime_library_dirs) +
TypeError: can only concatenate list (not "string") to list

Also, I wonder what the rationale is for supporting -L/tmp:/var/tmp,
while not supporting the Unixish -L/tmp -L/var/tmp.

Regards,
Martin


From mal@lemburg.com  Mon Jan 22 19:34:15 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 22 Jan 2001 20:34:15 +0100
Subject: [I18n-sig] Codec licenses
Message-ID: <3A6C8B37.EDEB795D@lemburg.com>

Hi everybody,

scanning through the CVS archive of the SourceForge python-codecs
project I found that most codec packages were placed under the GPL
for some reason. This makes the codecs unusable for software which
isn't GPL compatible and limits its usefulness considerably.

Please consider either moving to the LGPL which does not have the
GPL problems (other software relying on it will need to be shipped
under the GPL too), but still assures that your code remains freely
available or one of the Python licenses (preferrably the
old CWI one).

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Fri Jan 26 09:48:24 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 26 Jan 2001 10:48:24 +0100
Subject: [I18n-sig] Codec licenses
References: <3A6C8B37.EDEB795D@lemburg.com>
Message-ID: <3A7147E8.99A2BDC4@lemburg.com>

"M.-A. Lemburg" wrote:
> 
> Hi everybody,
> 
> scanning through the CVS archive of the SourceForge python-codecs
> project I found that most codec packages were placed under the GPL
> for some reason. This makes the codecs unusable for software which
> isn't GPL compatible and limits its usefulness considerably.
> 
> Please consider either moving to the LGPL which does not have the
> GPL problems (other software relying on it will need to be shipped
> under the GPL too), but still assures that your code remains freely
> available or one of the Python licenses (preferrably the
> old CWI one).

I haven't received any comment on the above so far. 

Should I take this as rejection of the proposal ? This would be sad
and probably cause rewrites for most of the codecs in order to make 
them useful in closed-source software projects too.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@digicool.com  Fri Jan 26 16:06:57 2001
From: guido@digicool.com (Guido van Rossum)
Date: Fri, 26 Jan 2001 11:06:57 -0500
Subject: [I18n-sig] Codec licenses
In-Reply-To: Your message of "Fri, 26 Jan 2001 10:48:24 +0100."
 <3A7147E8.99A2BDC4@lemburg.com>
References: <3A6C8B37.EDEB795D@lemburg.com>
 <3A7147E8.99A2BDC4@lemburg.com>
Message-ID: <200101261606.LAA23895@cj20424-a.reston1.va.home.com>

> > Hi everybody,
> > 
> > scanning through the CVS archive of the SourceForge python-codecs
> > project I found that most codec packages were placed under the GPL
> > for some reason. This makes the codecs unusable for software which
> > isn't GPL compatible and limits its usefulness considerably.
> > 
> > Please consider either moving to the LGPL which does not have the
> > GPL problems (other software relying on it will need to be shipped
> > under the GPL too), but still assures that your code remains freely
> > available or one of the Python licenses (preferrably the
> > old CWI one).
> 
> I haven't received any comment on the above so far. 
> 
> Should I take this as rejection of the proposal ? This would be sad
> and probably cause rewrites for most of the codecs in order to make 
> them useful in closed-source software projects too.

If it helps, I'd certainly prefer the LGPL over the GPL.  Of course my
favorite license is the *old* Python license:

    http://www.python.org/doc/Copyright.html

Another good one is the (current) BSD license:

    http://www.opensource.org/licenses/bsd-license.html

But maybe you could approach those people who have chosen the GPL
directly, and explain to them why you prefer something other than the
GPL, as long as it's Open Source?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From kajiyama@grad.sccs.chukyo-u.ac.jp  Fri Jan 26 16:32:25 2001
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Sat, 27 Jan 2001 01:32:25 +0900
Subject: [I18n-sig] Codec licenses
In-Reply-To: <3A7147E8.99A2BDC4@lemburg.com> (mal@lemburg.com)
References: <3A7147E8.99A2BDC4@lemburg.com>
Message-ID: <200101261632.BAA01375@dhcp234.grad.sccs.chukyo-u.ac.jp>

M.-A. Lemburg wrote:
|
| > scanning through the CVS archive of the SourceForge python-codecs
| > project I found that most codec packages were placed under the GPL
| > for some reason. This makes the codecs unusable for software which
| > isn't GPL compatible and limits its usefulness considerably.
| > 
| > Please consider either moving to the LGPL which does not have the
| > GPL problems (other software relying on it will need to be shipped
| > under the GPL too), but still assures that your code remains freely
| > available or one of the Python licenses (preferrably the
| > old CWI one).

Well, I have two (opposite?) thoughts regarding to the licensing
of the JapaneseCodecs package.

First, I've released the package under the terms of GNU GPL,
because that license is comfortable for me.  I want users to
"use" the package in the GNU GPL sense.

On the other hand, I hope that many people use my software.  If
needed, I release JapaneseCodecs or its part under different
licensing terms.  It is not a problem for me that a package that
includes JapaneseCodecs as its part is released under an open
source license (like the PyXML package).

To tell the truth, JapaneseCodecs is the first free software
package that I've released, and when I released it I was not
sure what was the best licensing terms for the package.  I've
chosen the GNU GPL, but the situation seems complex...

If possible, I'd like to utilize two different licenses: the
GNU GPL for JapaneseCodecs as a separate package, and another
license for the composite package that includes JapaneseCodecs
as its part.

Hmm...  Does this reply make sense?  I'm confused...

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From guido@digicool.com  Fri Jan 26 16:35:29 2001
From: guido@digicool.com (Guido van Rossum)
Date: Fri, 26 Jan 2001 11:35:29 -0500
Subject: [I18n-sig] Codec licenses
In-Reply-To: Your message of "Sat, 27 Jan 2001 01:32:25 +0900."
 <200101261632.BAA01375@dhcp234.grad.sccs.chukyo-u.ac.jp>
References: <3A7147E8.99A2BDC4@lemburg.com>
 <200101261632.BAA01375@dhcp234.grad.sccs.chukyo-u.ac.jp>
Message-ID: <200101261635.LAA24205@cj20424-a.reston1.va.home.com>

> M.-A. Lemburg wrote:
> |
> | > scanning through the CVS archive of the SourceForge python-codecs
> | > project I found that most codec packages were placed under the GPL
> | > for some reason. This makes the codecs unusable for software which
> | > isn't GPL compatible and limits its usefulness considerably.
> | > 
> | > Please consider either moving to the LGPL which does not have the
> | > GPL problems (other software relying on it will need to be shipped
> | > under the GPL too), but still assures that your code remains freely
> | > available or one of the Python licenses (preferrably the
> | > old CWI one).
> 
> Well, I have two (opposite?) thoughts regarding to the licensing
> of the JapaneseCodecs package.
> 
> First, I've released the package under the terms of GNU GPL,
> because that license is comfortable for me.  I want users to
> "use" the package in the GNU GPL sense.
> 
> On the other hand, I hope that many people use my software.  If
> needed, I release JapaneseCodecs or its part under different
> licensing terms.  It is not a problem for me that a package that
> includes JapaneseCodecs as its part is released under an open
> source license (like the PyXML package).
> 
> To tell the truth, JapaneseCodecs is the first free software
> package that I've released, and when I released it I was not
> sure what was the best licensing terms for the package.  I've
> chosen the GNU GPL, but the situation seems complex...
> 
> If possible, I'd like to utilize two different licenses: the
> GNU GPL for JapaneseCodecs as a separate package, and another
> license for the composite package that includes JapaneseCodecs
> as its part.
> 
> Hmm...  Does this reply make sense?  I'm confused...

Makes sense to me -- you as the author can issue as many different
licenses as you want to.  E.g. Perl does this.

I don't know the composite package -- is that also yours?  If not, you
will have to give the author or distributor of that package explicit
permission to include JapaneseCodecs with a different license.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From kajiyama@grad.sccs.chukyo-u.ac.jp  Fri Jan 26 17:14:15 2001
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Sat, 27 Jan 2001 02:14:15 +0900
Subject: [I18n-sig] Codec licenses
In-Reply-To: <200101261635.LAA24205@cj20424-a.reston1.va.home.com> (message
 from Guido van Rossum on Fri, 26 Jan 2001 11:35:29 -0500)
References: <200101261635.LAA24205@cj20424-a.reston1.va.home.com>
Message-ID: <200101261714.CAA01428@dhcp234.grad.sccs.chukyo-u.ac.jp>

Guido van Rossum wrote:
|
| > If possible, I'd like to utilize two different licenses: the
| > GNU GPL for JapaneseCodecs as a separate package, and another
| > license for the composite package that includes JapaneseCodecs
| > as its part.
|
| I don't know the composite package -- is that also yours?

No, there is no such a package (yet).  Once in this list,
someone gave an idea of releasing a composite codecs package as
a product of the i18n SIG.  That is what I called "the composite
package".

| If not, you
| will have to give the author or distributor of that package explicit
| permission to include JapaneseCodecs with a different license.

Yes.  I'm quite sure that I will give the permission if
required.

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From mal@lemburg.com  Fri Jan 26 17:19:08 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 26 Jan 2001 18:19:08 +0100
Subject: [I18n-sig] Codec licenses
References: <3A7147E8.99A2BDC4@lemburg.com> <200101261632.BAA01375@dhcp234.grad.sccs.chukyo-u.ac.jp>
Message-ID: <3A71B18C.BDAF205E@lemburg.com>

Tamito KAJIYAMA wrote:
> 
> M.-A. Lemburg wrote:
> |
> | > scanning through the CVS archive of the SourceForge python-codecs
> | > project I found that most codec packages were placed under the GPL
> | > for some reason. This makes the codecs unusable for software which
> | > isn't GPL compatible and limits its usefulness considerably.
> | >
> | > Please consider either moving to the LGPL which does not have the
> | > GPL problems (other software relying on it will need to be shipped
> | > under the GPL too), but still assures that your code remains freely
> | > available or one of the Python licenses (preferrably the
> | > old CWI one).
> 
> Well, I have two (opposite?) thoughts regarding to the licensing
> of the JapaneseCodecs package.
> 
> First, I've released the package under the terms of GNU GPL,
> because that license is comfortable for me.  I want users to
> "use" the package in the GNU GPL sense.
> 
> On the other hand, I hope that many people use my software.  If
> needed, I release JapaneseCodecs or its part under different
> licensing terms.  It is not a problem for me that a package that
> includes JapaneseCodecs as its part is released under an open
> source license (like the PyXML package).
> 
> To tell the truth, JapaneseCodecs is the first free software
> package that I've released, and when I released it I was not
> sure what was the best licensing terms for the package.  I've
> chosen the GNU GPL, but the situation seems complex...
> 
> If possible, I'd like to utilize two different licenses: the
> GNU GPL for JapaneseCodecs as a separate package, and another
> license for the composite package that includes JapaneseCodecs
> as its part.
> 
> Hmm...  Does this reply make sense?  I'm confused...

I know its confusing and I am pretty sure that many programmers
out there who put their software under the GPL don't know about
the consequences of this step.

To make it simple:

* the GPL allows your software to be used stand-alone or
  as part of another package which then has to have a license
  compatible with the GPL (many popular licenses out there are
  *not* compatible with the GPL so this causes problems, e.g.
  Zope's license is not GPL compatible, so GPLed modules cannot
  be shipped together with Zope licensed packages)

* the LGPL (Library GPL) does not impose any restriction with
  respect to including it in some package except that the packager
  will have to make the source code of the LGPLed available (possibly
  as seperate package); as a result there are no problems with
  non-GPL compatible products and your software gets used by
  many more people out there

Both versions make sure that your software and any modifications
applied to it are again published under the same terms, meaning
that the source code (including any modification) must be made
available without fee.

GPL is fine for stand-alone products. LGPL should be used for
everything which smells like a library ;) Even better are the
new BSD licenses, since they give your users all the freedom in the
world.

Hope this clarifies things a bit.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From kajiyama@grad.sccs.chukyo-u.ac.jp  Fri Jan 26 18:11:07 2001
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Sat, 27 Jan 2001 03:11:07 +0900
Subject: [I18n-sig] Codec licenses
In-Reply-To: <3A71B18C.BDAF205E@lemburg.com> (mal@lemburg.com)
References: <3A71B18C.BDAF205E@lemburg.com>
Message-ID: <200101261811.DAA01489@dhcp234.grad.sccs.chukyo-u.ac.jp>

M.-A. Lemburg wrote:
| 
[The excellent summaries of the GNU GPL/LGPL snipped.]
| 
| Both versions make sure that your software and any modifications
| applied to it are again published under the same terms, meaning
| that the source code (including any modification) must be made
| available without fee.

Exactly this effect of the GNU licenses was the reason why I
chose the GNU GPL for JapaneseCodecs.  I wanted my software to
be shared by people forever.

| Even better are the
| new BSD licenses, since they give your users all the freedom in the
| world.

To the best of my knowledge BSD licenses allow someone to make
that software proprietary and closed-source.  This aspect is a
contrast to the aforementioned effect of the GNU GPL/LGPL.
That's why I prefer the latter licenses.

| Hope this clarifies things a bit.

Thank you for the clear explanations.

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From walter@livinglogic.de  Fri Jan 26 18:55:49 2001
From: walter@livinglogic.de (=?ISO-8859-1?Q?=22Walter_D=F6rwald=22?=)
Date: Fri, 26 Jan 2001 19:55:49 +0100
Subject: [I18n-sig] Extended error handling for codecs
In-Reply-To: <3A5ACB61.E4BEAC6C@lemburg.com>
References: <200012201506250171.00D313E3@mail.tmt.de>
 <3A40FFF5.882E0D82@lemburg.com>
 <200012202054.VAA01458@loewis.home.cs.tu-berlin.de>
 <3A423E4D.88C7639@lemburg.com>
 <200012221632310203.0105EF8A@mail.livinglogic.de>
 <3A439A4A.B71F35DA@lemburg.com>
 <200101032018580500.01F457F3@mail.livinglogic.de>
 <3A5388F7.FA6D49DA@lemburg.com>
 <200101040109.f0419NH01429@mira.informatik.hu-berlin.de>
 <3A5449AA.14A602E0@lemburg.com>
 <200101081958290687.00710C3F@mail.livinglogic.de>
 <3A5ACB61.E4BEAC6C@lemburg.com>
Message-ID: <200101261955490531.00D88BBA@mail.livinglogic.de>

On 09.01.01 at 09:27 M.-A. Lemburg wrote:

[ I think this was supposed to go to the list ]

> "Walter D=F6rwald" wrote:
> > 
> > On 04.01.01 at 11:00 M.-A. Lemburg wrote:
> > 
> > > [...]
> > >
> > > How would such a scheme allow passing back control information
> > > such as: continue with the next character in the stream
> > 
> > def ignore(encoding, string, position):
> >         return u""
> > 
> > u"xxx".encode(encoding, 'callback', ignore)
> > 
> > > or break with an exception ?
> > 
> > def raiseAnException(encoding, string, position):
> >         raise FancyException("can't encode character %r at position %d
> in string %r with encoding %s"
> >                 % (string[position], position, string, encoding))
> > 
> > u"xxx".encode(encoding, 'callback', raiseAnException)
> 
> Ok. I still think that we need to pass more information from
> and to the callback. How about this scheme (the internal error
> handlers work using a similar scheme):
> 
> def callback(encoding, inputdata, inputposition, 
>              outputdata, outputposition, errors):
>     ...
>     return (inputdata, inputposition, outputdata, outputposition)
> 
> This would give the callback enough information to do just
> about everything with the data in question. After having called
> the callback(), the encoder or decoder would then reinitialize
> itself using the returned data and positions.

Does that mean that the callback can feed replacement input data back
to the encoder? How does the callback tell the encoder to switch
back to the original input after the replacement input is exhausted?
Or does the callback have to construct a complete replacemant input
string? As I see it, the callback can't modify the outputdata, because
the output data is already encoded, and the callback knows nothing
about the encoding.

How could a "xml-escape" be implemented with that? 

> > > > Looking again at the TR6 mechanism: Even if the error callback was
> > > > called, and even if it had to return bytes instead of unicodes, it
> > > > could still operate stateless: it would just output SQU as often as
> > > > required. I believe that most stateful encodings have a "escape to
> > > > known state" mechanism.
> > >
> > > Which is what I'm talking about all along: the codecs know best
> > > what to do, so better extend them than try to fiddle in some
> > > information using a callback.
> > 
> > The callback is only used in the situation when the codec does
> > not know what to do, i.e. when it encounters an unencodable
> > character. The callback is an *error handler* and not a
> > "I don't know how to implement my own encoding algorithm,
> > please help me"-handler. >;->
> 
> Let's put it this way: the error handler should have at least
> the same possibilities as the current builtin error handlers
> have.

There is a big difference: the generic callback should be able
to work without knowing the encoding. All current builtin error
handlers know the encoding because there's a specific error handler
for every encoding.

> If a codec needs more information to process an error
> condition, e.g. in case it holds internal state (encoder and
> decoder functions may not use external state per design),
> then it's the codec which has to be extended -- the error handler
> won't be able to help.

But the codec knows everything about its own internal state, what
it does not know is what kind of error handling is wanted. This
additional information can't be provided by the codec, but is 
provided by the user, who does't know anything about the encoding.
(e.g. if it's a list of acceptable encodings from an HTTP Accept-Charset
header)

> Would this be a good compromise ?
>
> > > I don't object to adding callback support to the codec's
> > > error handlers, but we'll need a new set of APIs to allow
> > > this.
> > 
> > I could live with a
> >         u"xxx".encode(encoding, 'callback', handler)
> > on the Python side, but what does this mean for the C API?
> 
> Pretty much the same thing: we'll be adding PyUnicode_EncodeEx()
> and PyUnicode_DecodeEx() APIs which have the additional
> context object as PyObject*.

OK, but what are those objects supposed to know and do?

> > > > So I still think your objection is theoretical, whereas the problem
> > > > that Walter is trying to solve is real.
> > >
> > > I did propose a solution which would satisfy your needs: simply
> > > add a new error treatment 'xml-escape' to the builtin codecs
> > > which then does the needed XML escaping. XML is general enough
> > > to warrant such a step and the required changes are minor.
> > >
> > > Another candidate for a new error treatment would be 'unicode-escape'
> > > which then replaces the character in question with '\uXXXX'.
> > >
> > > For the general case, I'd rather add new PyUnicode_EncodeEx()
> > > and PyUnicode_DecodeEx() APIs which then take a Python
> > > context object as extra argument.
> > 
> > What should this extra argument be for the decoder?
> 
> A PyObject* just like for the encoder. The codec design is kept
> symmetric to simplify support for stackable streams and also
> to simplify the APIs (there aren't all that many API signatures
> to remember).

But the APIs are not really symmetric: There is no easy inverse of
    u"xxx".encoding(encoding, "callback", xmlReplacementHandler)
that automatically generates characters from XML character entities.
How would the decoder know, when a character entity is encountered?

Encoding errors simply mean that the encoding is not capable of 
handling the data to be encoded. The error handling then has to 
provide a way of making the unencodable part of the data encodable. 
Ideally this should be independant from the encoding.

Decoding errors mean something completely different: The encoded
data does not conform to the format it claims to be in. Fixing
this kind of error requires an intimate knowledge of the encoding
and therefore can not be encoding independent.

> > > The error treatment string
> > > would then define how to use this context object, e.g. 'callback'
> > > could be made to apply processing similar to what Walter
> > > suggested.
> > 
> > 'callback' seem too generic to me. May there will be other callbacks
> > in the future that are used for different stuff. This is the
> > "give me a replacement or die" error handler.
> 
> The error handling string should provide enough room for
> extensions... what other short string would you recommend ?
> 'handler' or 'callcontext' ?

In theory "replace" would be the correct name, as the error handler
returns a replacement string to be encoded instead of the offending
character. but we could use "replacementhandler" or something like
that.

> [...]

Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7=
 www.livinglogic.de


From tim.one@home.com  Fri Jan 26 21:01:17 2001
From: tim.one@home.com (Tim Peters)
Date: Fri, 26 Jan 2001 16:01:17 -0500
Subject: [I18n-sig] Codec licenses
In-Reply-To: <200101261811.DAA01489@dhcp234.grad.sccs.chukyo-u.ac.jp>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIFILAA.tim.one@home.com>

[Tamito KAJIYAMA]
> Exactly this effect of the GNU licenses was the reason why I
> chose the GNU GPL for JapaneseCodecs.  I wanted my software to
> be shared by people forever.

Guido does too <smile>.  The GPL forces everyone who uses your code to make
*their* code fall under the GPL too.  So by using it, you're also telling
other people how they have to license their own software (provided they want
to use yours).  That's part of the GNU philosophy, of course.  You should
read Stallman's "Why you shouldn't use the Library GPL for your next
library":

    http://www.fsf.org/philosophy/why-not-lgpl.html

Unless you code is impossible to duplicate by other means, people who
*don't* want to put their own software under the GPL have a choice:  they
can implement their own library, and sooner or later someone will, and
release it under a less drastic license than the GPL, and then the GPL'ed
version will get used less and less.  That's why the LGPL was invented.

> To the best of my knowledge BSD licenses allow someone to make
> that software proprietary and closed-source.

Absolutely.  That has no effect on your code, though:  people can still come
to you to get your code.  You're the only one who can change your licensing
terms.  For example, Python is used in some closed-source projects and we
couldn't care less.  Well, actually, we're happy they're using Python!  It
doesn't stop you from getting Python from us, and doing whatever *you* want
to do with it, so it's hard to see how anyone could feel injured (we don't
feel injured, you're happy, and the closed-source people are happy too).

> This aspect is a contrast to the aforementioned effect of the GNU
> GPL/LGPL.  That's why I prefer the latter licenses.

The GPL and the LGPL shouldn't be lumped together:  they're very different.
Stallman's essay (above) should make that clearer.


From martin@mira.cs.tu-berlin.de  Fri Jan 26 20:43:31 2001
From: martin@mira.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 26 Jan 2001 21:43:31 +0100
Subject: [I18n-sig] Codec licenses
In-Reply-To: <3A7147E8.99A2BDC4@lemburg.com> (mal@lemburg.com)
References: <3A6C8B37.EDEB795D@lemburg.com> <3A7147E8.99A2BDC4@lemburg.com>
Message-ID: <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de>

> I haven't received any comment on the above so far. 
> 
> Should I take this as rejection of the proposal ?

I wasn't going to go into a long discussion about that matter, but I
feel quite comfortable with the iconv codec being GPL'ed. Your main
rationale for requesting such a change was

> This makes the codecs unusable for software which isn't GPL
> compatible and limits its usefulness considerably.

I firmly believe that free software should be useful on its own
technical merits, and that the LGPL is called the "lesser" GPL for a
reason; the FSF actively encourages authors *not* to license software
under its terms.

I could be talked into changing the license if some project that I
support would want to use it, and couldn't because of the GPL (e.g. if
it was candidate for inclusion into Python). I won't change in
advance.

Regards,
Martin


From mal@lemburg.com  Fri Jan 26 21:47:10 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 26 Jan 2001 22:47:10 +0100
Subject: [I18n-sig] Codec licenses
References: <3A6C8B37.EDEB795D@lemburg.com> <3A7147E8.99A2BDC4@lemburg.com> <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de>
Message-ID: <3A71F05E.1B1FC74E@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > I haven't received any comment on the above so far.
> >
> > Should I take this as rejection of the proposal ?
> 
> I wasn't going to go into a long discussion about that matter, but I
> feel quite comfortable with the iconv codec being GPL'ed. Your main
> rationale for requesting such a change was
> 
> > This makes the codecs unusable for software which isn't GPL
> > compatible and limits its usefulness considerably.
> 
> I firmly believe that free software should be useful on its own
> technical merits, and that the LGPL is called the "lesser" GPL for a
> reason; the FSF actively encourages authors *not* to license software
> under its terms.
> 
> I could be talked into changing the license if some project that I
> support would want to use it, and couldn't because of the GPL (e.g. if
> it was candidate for inclusion into Python). I won't change in
> advance.

Writing an iconv package has been on my list of "nice projects"
for a while. Unfortunately, I haven't found time to look into
this. After having seen you code up something along those lines,
I dropped the idea... I guess I'll have to revive it again :-/

Note that iconv itself is distributed under the LGPL, so nothing
would prevent me from writing a codec package under a Python
style license. The same applies to all other codecs. 

I still think that such a needless effort could be avoided if 
people were to play nice. We could then wrap a nice codec extension
package for everyone to use at their will.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@reportlab.com  Fri Jan 26 23:19:12 2001
From: andy@reportlab.com (Andy Robinson)
Date: Fri, 26 Jan 2001 23:19:12 -0000
Subject: [I18n-sig] Codec licenses
In-Reply-To: <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de>
Message-ID: <PGECLPOBGNBNKHNAGIJHEENFCHAA.andy@reportlab.com>

> 
> I could be talked into changing the license if some project that I
> support would want to use it, and couldn't because of the 
> GPL (e.g. if
> it was candidate for inclusion into Python). I won't change in
> advance.

This dicussion has surprised me considerably.  Python has always had
a non-restrictive licence, as do almost all the packages available
for it, and that is one reason why it is successful.
If we want to create an "official" Python codec package, we should
be prepared to do it under a Python-style license.

My own company (www.reportlab.com) makes free and unrestricted 
reporting libraries, but we are preparing commercial products 
which will sit on top of those.  We need to start selling these 
products for high prices per server license in order to stay alive 
and keep coding and keep contributing to open source. One feature
we will need within six months is encoding conversions.

We would not be able to use any GPL'ed code.  So, if we get a 
customer for Report Markup Language in Japan and we need 
to do encoding conversions, we will be forced to write a clean 
implementation.  And I promise that we'll release it to the world 
under a Python compatible licence, as we have no interest in trying
to sell such a general-purpose utility.  

Furthermore, I have done a lot of consulting projects for big
corporate customers where we solved problems by integrating open
source code.  They don't want to take any GPL'ed code, as the
cost of ripping it out in future if they ever do want to sell
some software would be huge. The Python license has never
caused a question.

It's always the author's choice, but if you prevent any software 
house from developing commercial packages which use your code, 
you limit its exposure and acceptance.

Just my 2p worth,

Andy Robinson


From tree@basistech.com  Fri Jan 26 23:33:24 2001
From: tree@basistech.com (Tom Emerson)
Date: Fri, 26 Jan 2001 18:33:24 -0500
Subject: [I18n-sig] Codec licenses
In-Reply-To: <PGECLPOBGNBNKHNAGIJHEENFCHAA.andy@reportlab.com>
References: <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de>
 <PGECLPOBGNBNKHNAGIJHEENFCHAA.andy@reportlab.com>
Message-ID: <14962.2372.636324.340540@cymru.basistech.com>

I agree with Andy. I will also add that, for most of the encodings
we're looking at, there is no magic going on: EUC-KR or EUC-CN to
Unicode, and back, is a simple table lookup. Doing the ISO-2022
encodings is a bit more work, but it isn't rocket science.

<flame>
As far as I'm concerned, a codec that merely wraps the Unicode
Consortium's mapping tables is hardly deserving of any license at
all. Using the existing codecs (or an Asian codec package) is an issue
of convenience more than anything.

This is not meant to belittle those who have written these
codecs... my point is merely that placing a highly restrictive license
such as the GPL on a codec is considerable overkill.
</flame>

Please direct nasty grams to /dev/null.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Zenkaku Language Hacker                            http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From martin@mira.cs.tu-berlin.de  Fri Jan 26 23:36:44 2001
From: martin@mira.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 27 Jan 2001 00:36:44 +0100
Subject: [I18n-sig] Codec licenses
In-Reply-To: <3A71F05E.1B1FC74E@lemburg.com> (mal@lemburg.com)
References: <3A6C8B37.EDEB795D@lemburg.com> <3A7147E8.99A2BDC4@lemburg.com> <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de> <3A71F05E.1B1FC74E@lemburg.com>
Message-ID: <200101262336.f0QNaie01717@mira.informatik.hu-berlin.de>

> Note that iconv itself is distributed under the LGPL, so nothing
> would prevent me from writing a codec package under a Python
> style license. The same applies to all other codecs. 
> 
> I still think that such a needless effort could be avoided if 
> people were to play nice. We could then wrap a nice codec extension
> package for everyone to use at their will.

I don't see your point (but that is probably a starting point to a
long and needless discussion on free software and licensing).

You are certainly free to write an iconv codec. I can't see *why* you
would want to do so - unless you have an actual need for it. If so,
what is that need? I'm curious.

Talking about talking other people into changing the license of their
software: Could you please change the license of mxODBC so that it is
free software? A BSD-style license would be nice; restrictions on
commercial use are not.

Regards,
Martin


From mal@lemburg.com  Sat Jan 27 00:05:53 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 27 Jan 2001 01:05:53 +0100
Subject: [I18n-sig] Codec licenses
References: <3A6C8B37.EDEB795D@lemburg.com> <3A7147E8.99A2BDC4@lemburg.com> <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de> <3A71F05E.1B1FC74E@lemburg.com> <200101262336.f0QNaie01717@mira.informatik.hu-berlin.de>
Message-ID: <3A7210E1.F1867092@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > Note that iconv itself is distributed under the LGPL, so nothing
> > would prevent me from writing a codec package under a Python
> > style license. The same applies to all other codecs.
> >
> > I still think that such a needless effort could be avoided if
> > people were to play nice. We could then wrap a nice codec extension
> > package for everyone to use at their will.
> 
> I don't see your point (but that is probably a starting point to a
> long and needless discussion on free software and licensing).
> 
> You are certainly free to write an iconv codec. I can't see *why* you
> would want to do so - unless you have an actual need for it. If so,
> what is that need? I'm curious.

Very simple: I make a living out of selling closed-source 
software. As it happens much of the closed-source software
uses basic building blocks which are open source, such as Python
and many of my mx tools. 

GPLed code is useless in such a setup
though, so I'd need to rewrite the code using either a closed
source license (doesn't buy me anything) or a liberal Python
style license (buys me free debugging and save lots of others the
effort of writing their own version -- with the result of making
everyone happy).
 
> Talking about talking other people into changing the license of their
> software: Could you please change the license of mxODBC so that it is
> free software? A BSD-style license would be nice; restrictions on
> commercial use are not.

I'm not talking anyone into changing their mind on what license
to put on their software. I just want people to be aware of
what they are doing when they use the GPL for licensing software.

As for mxODBC: that will turn into a commercial product starting
with the next release. I have to take this step in order to fund
development of the other mx open source tools and to be able
to actively maintain the package (which is a can of worms...).

Anyway, let's *not* head down this road. The codec authors are
free to do whatever they like. I just wanted to clarify the
problems which using the GPL for library style has for the code
and its users -- nothing more. I don't want to talk anyone into
changing licenses. It would be nice though, if I could convince 
some of the authors to rethink their decision.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Company:                                        http://www.egenix.com/
Consulting:                                    http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/

PS: We seem to be on different wave-length on a lot of subjects,
Martin. Let's simply agree to differ :-)


From frank63@ms5.hinet.net  Sat Jan 27 11:49:52 2001
From: frank63@ms5.hinet.net (Frank Chen)
Date: Sat, 27 Jan 2001 11:49:52 -0000
Subject: [I18n-sig] Re: Codec licenses
Message-ID: <200101270347.LAA12815@ms5.hinet.net>

Hi:

Then, if I say:

Conform to GPL or LGPL.

Is this logical?


Frank Chen


From andy@reportlab.com  Sat Jan 27 08:07:49 2001
From: andy@reportlab.com (Andy Robinson)
Date: Sat, 27 Jan 2001 08:07:49 -0000
Subject: [I18n-sig] Re: Codec licenses
In-Reply-To: <200101270347.LAA12815@ms5.hinet.net>
Message-ID: <PGECLPOBGNBNKHNAGIJHMENHCHAA.andy@reportlab.com>

> Then, if I say:
> 
> Conform to GPL or LGPL.
> 
> Is this logical?
If you give people the choice of licenses, yes,
that totally solves the problem.

- Andy Robinson