From martin@loewis.home.cs.tu-berlin.de Tue Jan 2 08:35:27 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 2 Jan 2001 09:35:27 +0100 Subject: [I18n-sig] naming codecs In-Reply-To: <200012070617.PAA22443@dhcp198.grad.sccs.chukyo-u.ac.jp> (message from Tamito KAJIYAMA on Thu, 7 Dec 2000 15:17:06 +0900) References: <3A2F1B701E3.FEEANODA@172.16.112.1> <200012070617.PAA22443@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <200101020835.JAA01068@loewis.home.cs.tu-berlin.de> > | > I consider releasing a version of the JapaneseCodecs package > | > that will include a new codec for a variant of ISO-2022-JP. The > | > codec is almost the same as the ISO-2022-JP codec, but it can > | > encode and decode Halfwidth Katakana (U+FF61 to U+FF9F) which > | > can not be encoded with ISO-2022-JP as defined in RFC1468. > | > | So how exactly does it encode them? > | > | Is that your own invention, or is there some precedent for that > | encoding (e.g. in an operating system, or text processing system)? > > Halfwidth Katakana in Unicode corresponds to the character set > JIS X 0201 Katakana, and this character set can be designated by > the escape sequence "\033(I" in the framework of ISO 2022. I found some time to look into this, and it appears that your encoding deals with "JIS X 0201 Katakana", which I also found with the name "JIS X 0201 (GR)". I know you already found a name, but ... if your codec is indeed *only* JISX 0201 Katakana, then why not name it that way (e.g. "jisx-0201-katakana"). Regards, Martin From andy@reportlab.com Tue Jan 2 10:41:50 2001 From: andy@reportlab.com (Andy Robinson) Date: Tue, 2 Jan 2001 10:41:50 -0000 Subject: [I18n-sig] naming codecs In-Reply-To: <200101020835.JAA01068@loewis.home.cs.tu-berlin.de> Message-ID: > I found some time to look into this, and it appears that > your encoding > deals with "JIS X 0201 Katakana", which I also found with the name > "JIS X 0201 (GR)". > > I know you already found a name, but ... if your codec is indeed > *only* JISX 0201 Katakana, then why not name it that way > (e.g. "jisx-0201-katakana"). > JIS X 0201 Katakana is a character set, not an encoding. It defines the half-width katakana characters (about 60 of them). Japanese encodings contain multiple character sets. IS0-2022-JP is a 'way of making encodings' and within this there can be many variants; he is talking about a specific encoding which combines two character sets... (1) The JIS 0208 character set, 1st and 2nd levels (about 7000 characters including symbols, numeric characters, Latin, Cyrillic and Greek alphabets, Japanese HIRAGANA, KATAKANA, and KANJI), and (2) The JIS 0201 Katakana characters (which are about 60 half-width variants different from the Katakana listed in JIS0208) ...all encoded according to ISO-2022-JP The half width katakana are basically 'deprecated' - they predate the ability to use Kanji in computers - but won't go away in practice, so people in Japanese IT frequently need to extend codecs to deal with them. I hope this explains a little further. It is hard to understand this without knowing a little about Japanese writing systems; Ken Lunde's "CJKV" book does quite a good job of explaining it. Regards, Andy Robinson From walter@livinglogic.de Wed Jan 3 19:18:58 2001 From: walter@livinglogic.de (=?ISO-8859-1?Q?=22Walter_D=F6rwald=22?=) Date: Wed, 03 Jan 2001 20:18:58 +0100 Subject: [I18n-sig] Proposal: Extended error handling forunicode.encode In-Reply-To: <3A439A4A.B71F35DA@lemburg.com> References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> Message-ID: <200101032018580500.01F457F3@mail.livinglogic.de> On 22.12.00 at 19:15 M.-A. Lemburg wrote: > "Walter D=F6rwald" wrote: > > > > On 21.12.00 at 18:30 M.-A. Lemburg wrote: > > > [about state in encoders and error handlers] > > But I don't see how this internal encoder state should influence > > what the error handler does. There are two layers involved: The > > character encoding layer and the "unencodable character escape > > mechanism". Both layers are completely independent, even in your > > "Unicode compression" example, where the "unencodable character > > escape mechanism" is XML character entities. > > This is true for your XML entity escape example, but error > resolving in general will likely need to know about the > current state of the encoder, e.g. to be able to write data > corresponding page in the Unicode compression example or to > force a switch of the current page to a different one. How does this "Unicode compression example" look like? > I know that error handling could be more generic, but passing > a callable object instead of the error parameter is not an > option since the internal APIs all use a const char parameter > for error. Changing this should can be done in one or two hours for someone who knows the Python internals. (Unfortunately I don't, I first looked at unicodeobject.[hc] several days ago!) > Besides, I consider such an approach a hack and not > a solution. > > Instead of trying to tweak the implementation into providing > some kind of new error scheme, let's focus on finding a generic > framework which could provide a solution for the general case > while not breaking the existing applications. Are the existing codecs (JapaneseCodecs etc.) to be considered part of the existing applications? The problem might be how to handle callbacks to C functions and callback to Python functions in a consistent way. I.e. is it extern DL_IMPORT(PyObject*) PyUnicode_Encode( const Py_UNICODE *s, /* Unicode char buffer */ int size, /* number of Py_UNICODE chars to encode */ const char *encoding, /* encoding */ PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position)= /* error handling via C function */ ); or extern DL_IMPORT(PyObject*) PyUnicode_Encode( const Py_UNICODE *s, /* Unicode char buffer */ int size, /* number of Py_UNICODE chars to encode */ const char *encoding, /* encoding */ PyObject *errorHandler /* error handling via Python function */ ); > > > Writing your own function helpers which then apply all the necessary > > > magic is simple and doesn't warrant changing APIs in the core. > > > > It is not as simple as the error handler, but I could live with that. > > > > The big problem is that it effectively kill the speed of your > > application. Every XML application written in Python, no matter > > what is does internally, will in the end have to produce an output > > bytestring. Normally the output encoding should be one that produces > > no unencodable characters, but you have to be prepared to handle > > them. With the error handler the complete encoding will be done > > in C code (with very infrequent calls to the error handler), so > > this scheme gives the best speed possible. > > It would give even better performance if the codec would provide > this hook in some way at C level. extern DL_IMPORT(PyObject*) PyUnicode_Encode( const Py_UNICODE *s, /* Unicode char buffer */ int size, /* number of Py_UNICODE chars to encode */ const char *encoding, /* encoding */ PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position)= /* error handling via C function */ ); would, but thats not the point. When you use an encoding, where more than 20% of the characters have to be escaped (as XML entities or whatever) you're using the wrong encoding. > Note that almost all codecs > have their own error handlers written in C already. > > > > Since the error handling is extensible by adding new options > > > such as 'callback', > > > > I would prefer a more object oriented way of extending the error > > handling. > > Sure, but we have to assure backward compatibility as well. > > > > the existing codecs could be extended to > > > provide this functionality as well. We'd only need a way to > > > pass the callback to the codecs in some way, e.g. by using > > > a keyword argument on the constructor or by subclassing it > > > and providing a new method for the error handling in question. > > > > There is no need for a string argument 'callback' and > > an additional callback function/method that is passed to the > > encoder. When the error argument is a string, the old mechanism > > can be used, when it is a callable object the new will be used. > > This is bad style and also gives problems in the core > implementation (have a look at unicodeobject.c). I did, what is the problem with changing "const char *error" to "PyObject *error"? Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7= www.livinglogic.de From mal@lemburg.com Wed Jan 3 20:17:59 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 03 Jan 2001 21:17:59 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> Message-ID: <3A5388F7.FA6D49DA@lemburg.com> "Walter Dörwald" wrote: > > On 22.12.00 at 19:15 M.-A. Lemburg wrote: > > > "Walter Dörwald" wrote: > > > > > > On 21.12.00 at 18:30 M.-A. Lemburg wrote: > > > > [about state in encoders and error handlers] > > > But I don't see how this internal encoder state should influence > > > what the error handler does. There are two layers involved: The > > > character encoding layer and the "unencodable character escape > > > mechanism". Both layers are completely independent, even in your > > > "Unicode compression" example, where the "unencodable character > > > escape mechanism" is XML character entities. > > > > This is true for your XML entity escape example, but error > > resolving in general will likely need to know about the > > current state of the encoder, e.g. to be able to write data > > corresponding page in the Unicode compression example or to > > force a switch of the current page to a different one. > > How does this "Unicode compression example" look like? Please see the Unicode.org site for a description of the Unicode compression algorithm. Other encoders will likely have similar problems, e.g. ones which compress data based on locality assumptions. > > I know that error handling could be more generic, but passing > > a callable object instead of the error parameter is not an > > option since the internal APIs all use a const char parameter > > for error. > > Changing this should can be done in one or two hours for someone > who knows the Python internals. (Unfortunately I don't, I first > looked at unicodeobject.[hc] several days ago!) Sure, but it would break code and alter the Python C API in unacceptable ways. Note that all builtin C codecs would also have to be changed. If we are going to extend the error handling mechanism then we'd better do it some b/w compatible way, e.g. by providing new APIs. > > Besides, I consider such an approach a hack and not > > a solution. > > > > Instead of trying to tweak the implementation into providing > > some kind of new error scheme, let's focus on finding a generic > > framework which could provide a solution for the general case > > while not breaking the existing applications. > > Are the existing codecs (JapaneseCodecs etc.) to be considered part > of the existing applications? All code out there which uses the existing codecs and APIs must be considered when thinking about altering published Python C APIs. > The problem might be how to handle callbacks to C functions and > callback to Python functions in a consistent way. I.e. is it > extern DL_IMPORT(PyObject*) PyUnicode_Encode( > const Py_UNICODE *s, /* Unicode char buffer */ > int size, /* number of Py_UNICODE chars to encode */ > const char *encoding, /* encoding */ > PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position) /* error handling via C function */ > ); > or > extern DL_IMPORT(PyObject*) PyUnicode_Encode( > const Py_UNICODE *s, /* Unicode char buffer */ > int size, /* number of Py_UNICODE chars to encode */ > const char *encoding, /* encoding */ > PyObject *errorHandler /* error handling via Python function */ > ); The latter would be the "right" solution. > > > > Writing your own function helpers which then apply all the necessary > > > > magic is simple and doesn't warrant changing APIs in the core. > > > > > > It is not as simple as the error handler, but I could live with that. > > > > > > The big problem is that it effectively kill the speed of your > > > application. Every XML application written in Python, no matter > > > what is does internally, will in the end have to produce an output > > > bytestring. Normally the output encoding should be one that produces > > > no unencodable characters, but you have to be prepared to handle > > > them. With the error handler the complete encoding will be done > > > in C code (with very infrequent calls to the error handler), so > > > this scheme gives the best speed possible. > > > > It would give even better performance if the codec would provide > > this hook in some way at C level. > > extern DL_IMPORT(PyObject*) PyUnicode_Encode( > const Py_UNICODE *s, /* Unicode char buffer */ > int size, /* number of Py_UNICODE chars to encode */ > const char *encoding, /* encoding */ > PyUnicodeObject *errorHandler(PyUnicodeObject *string, int position) /* error handling via C function */ > ); > would, but thats not the point. When you use an encoding, where more > than 20% of the characters have to be escaped (as XML entities or whatever) > you're using the wrong encoding. That's what I was talking about all along... if it's really only for escaping XML, then a special Latin-1 or ASCII XML excaping codec would go a long way (without the troubles of using callbacks and without having to add a new error callback mechanism). Writing such a codec doesn't take much time, since the code's already there. Even better: XML escaping could be added as new error handling option, e.g. "xml-escape" instead of "replace". Since XML escaping is general enough, I do think that adding such an option to all builtin codecs would be an acceptable and workable solution. > > Note that almost all codecs > > have their own error handlers written in C already. > > > > > > Since the error handling is extensible by adding new options > > > > such as 'callback', > > > > > > I would prefer a more object oriented way of extending the error > > > handling. > > > > Sure, but we have to assure backward compatibility as well. > > > > > > the existing codecs could be extended to > > > > provide this functionality as well. We'd only need a way to > > > > pass the callback to the codecs in some way, e.g. by using > > > > a keyword argument on the constructor or by subclassing it > > > > and providing a new method for the error handling in question. > > > > > > There is no need for a string argument 'callback' and > > > an additional callback function/method that is passed to the > > > encoder. When the error argument is a string, the old mechanism > > > can be used, when it is a callable object the new will be used. > > > > This is bad style and also gives problems in the core > > implementation (have a look at unicodeobject.c). > > I did, what is the problem with changing "const char *error" to > "PyObject *error"? Backward compatibility. We can't change C API signatures after they have been officially published. The Python way to apply these kind of changes would be to add new extended APIs. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Thu Jan 4 01:09:23 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 4 Jan 2001 02:09:23 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode In-Reply-To: <3A5388F7.FA6D49DA@lemburg.com> (mal@lemburg.com) References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> Message-ID: <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> > > How does this "Unicode compression example" look like? > > Please see the Unicode.org site for a description of the > Unicode compression algorithm. Specifically, http://www.unicode.org/unicode/reports/tr6/ > Other encoders will likely have similar problems, e.g. ones which > compress data based on locality assumptions. Of course, the TR 6 mechanism won't have the problem at all that we are talking about - in section 5, it says # The compression scheme is capable of compressing strings containing # any Unicode character. so the callback for unencodable characters would never be called. Even if it *had* to preserve state (e.g. when encoding into ISO-2022), Walter's proposal is that the callback returns a Unicode object that is encoded *instead* of the original character. I have yet to see an encoding scheme that would fail under this scheme: in the ISO-2022 case, with XML character entities, the codec would know what state it is in, so it would know whether it has to switch to single-byte mode to encode the &# or not. Looking again at the TR6 mechanism: Even if the error callback was called, and even if it had to return bytes instead of unicodes, it could still operate stateless: it would just output SQU as often as required. I believe that most stateful encodings have a "escape to known state" mechanism. So I still think your objection is theoretical, whereas the problem that Walter is trying to solve is real. Regards, Martin From mal@lemburg.com Thu Jan 4 10:00:10 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 04 Jan 2001 11:00:10 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> Message-ID: <3A5449AA.14A602E0@lemburg.com> "Martin v. Loewis" wrote: > > > > How does this "Unicode compression example" look like? > > > > Please see the Unicode.org site for a description of the > > Unicode compression algorithm. > > Specifically, http://www.unicode.org/unicode/reports/tr6/ > > > Other encoders will likely have similar problems, e.g. ones which > > compress data based on locality assumptions. > > Of course, the TR 6 mechanism won't have the problem at all that we > are talking about - in section 5, it says > > # The compression scheme is capable of compressing strings containing > # any Unicode character. > > so the callback for unencodable characters would never be called. I just used it as example for the existence of encoders which need to preserve state. > Even if it *had* to preserve state (e.g. when encoding into ISO-2022), > Walter's proposal is that the callback returns a Unicode object that > is encoded *instead* of the original character. I have yet to see an > encoding scheme that would fail under this scheme: in the ISO-2022 > case, with XML character entities, the codec would know what state it > is in, so it would know whether it has to switch to single-byte mode > to encode the &# or not. How would such a scheme allow passing back control information such as: continue with the next character in the stream or break with an exception ? > Looking again at the TR6 mechanism: Even if the error callback was > called, and even if it had to return bytes instead of unicodes, it > could still operate stateless: it would just output SQU as often as > required. I believe that most stateful encodings have a "escape to > known state" mechanism. Which is what I'm talking about all along: the codecs know best what to do, so better extend them than try to fiddle in some information using a callback. I don't object to adding callback support to the codec's error handlers, but we'll need a new set of APIs to allow this. > So I still think your objection is theoretical, whereas the problem > that Walter is trying to solve is real. I did propose a solution which would satisfy your needs: simply add a new error treatment 'xml-escape' to the builtin codecs which then does the needed XML escaping. XML is general enough to warrant such a step and the required changes are minor. Another candidate for a new error treatment would be 'unicode-escape' which then replaces the character in question with '\uXXXX'. For the general case, I'd rather add new PyUnicode_EncodeEx() and PyUnicode_DecodeEx() APIs which then take a Python context object as extra argument. The error treatment string would then define how to use this context object, e.g. 'callback' could be made to apply processing similar to what Walter suggested. The xxxEx() APIs will have to take special precautions to also work with pre-2.1 codecs though, since the codec API definition does not include the extra context objext. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Thu Jan 4 10:41:38 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 4 Jan 2001 11:41:38 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode In-Reply-To: <3A5449AA.14A602E0@lemburg.com> (mal@lemburg.com) References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> Message-ID: <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> > How would such a scheme allow passing back control information > such as: continue with the next character in the stream or > break with an exception ? If it wanted to break with an exception, it would raise one. So the function really has to acceptable results: an exception, and a Unicode object. Since most Python functions are allowed to raise exceptions, that went without saying. > Which is what I'm talking about all along: the codecs know best > what to do, so better extend them than try to fiddle in some > information using a callback. If that means to touch the source of all codecs, than that would be an unacceptable solution. Doing it in a generic way would be ok - except that I still can't see *how* this could possibly work. > I did propose a solution which would satisfy your needs: simply > add a new error treatment 'xml-escape' to the builtin codecs > which then does the needed XML escaping. XML is general enough > to warrant such a step and the required changes are minor. Sorry, I missed that. That would also solve the problem at hand. Since nobody has come up with a different use case for a more general solution, that might be the solution which we can reasonably implement for 2.1. > Another candidate for a new error treatment would be > 'unicode-escape' which then replaces the character in question with > '\uXXXX'. +0. While that falls into the same category, I haven't seen anybody saying "I need such a feature". > For the general case, I'd rather add new PyUnicode_EncodeEx() > and PyUnicode_DecodeEx() APIs which then take a Python > context object as extra argument. The error treatment string > would then define how to use this context object, e.g. 'callback' > could be made to apply processing similar to what Walter > suggested. What other acceptable values for the string would you foresee? Regards, Martin From mal@lemburg.com Fri Jan 5 08:40:52 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 05 Jan 2001 09:40:52 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> Message-ID: <3A558894.F2BA89F0@lemburg.com> "Martin v. Loewis" wrote: > > > How would such a scheme allow passing back control information > > such as: continue with the next character in the stream or > > break with an exception ? > > If it wanted to break with an exception, it would raise one. So the > function really has to acceptable results: an exception, and a Unicode > object. Since most Python functions are allowed to raise exceptions, > that went without saying. Sure, exceptions are not much of a problem, but how would the callback tell the encoder/decoder to e.g. skip forward 2 bytes or perhaps backward 10 bytes ? What if the callback would have to scan the stream from the beginning to find out where to continue or look ahead a few hundred bytes to find the next valid encodable sequence ? Again, you should keep in mind that the scheme has to work for all encoding/decoding work, not only conversion from and to Unicode. > > Which is what I'm talking about all along: the codecs know best > > what to do, so better extend them than try to fiddle in some > > information using a callback. > > If that means to touch the source of all codecs, than that would be an > unacceptable solution. Doing it in a generic way would be ok - except > that I still can't see *how* this could possibly work. If we were to provide a callback as optional method to StreamReaders/Writers, the task could be done either statically by subclassing the existing codec StreamReaders/Writers or dynamically by asking the codec registry to return the StreamReader/ Writer classes. But since there aren't all that many codec implementations around (only the few in unicodeobject.c), the proposed generic solution of adding new error treatment strings would go a long way... > > I did propose a solution which would satisfy your needs: simply > > add a new error treatment 'xml-escape' to the builtin codecs > > which then does the needed XML escaping. XML is general enough > > to warrant such a step and the required changes are minor. > > Sorry, I missed that. That would also solve the problem at hand. Since > nobody has come up with a different use case for a more general > solution, that might be the solution which we can reasonably implement > for 2.1. Right. > > Another candidate for a new error treatment would be > > 'unicode-escape' which then replaces the character in question with > > '\uXXXX'. > > +0. While that falls into the same category, I haven't seen anybody > saying "I need such a feature". This would be handy for the case where you don't want to have exceptions raised, but still require some form of retaining the original data. > > For the general case, I'd rather add new PyUnicode_EncodeEx() > > and PyUnicode_DecodeEx() APIs which then take a Python > > context object as extra argument. The error treatment string > > would then define how to use this context object, e.g. 'callback' > > could be made to apply processing similar to what Walter > > suggested. > > What other acceptable values for the string would you foresee? Another option would be 'copy' which tries to simply copy input to output in case this is reasonably possible given the encoding (e.g. Unicode -> 8-bit encoding would copy all 8-bit Unicode chars as is in case no mapping is defined). An option 'raise' could also be valuable in conjunction with an exception context object to have the codec raise customized exceptions. Provided the context object points to another encoder/decoder, an option 'fallback' could be used to tell the codec to pass the failing input data to the alternate encoder/decoder in order to have it converted. Etc. etc. There are many things one could do with the error string. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Fri Jan 5 09:08:09 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 5 Jan 2001 10:08:09 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode In-Reply-To: <3A558894.F2BA89F0@lemburg.com> (mal@lemburg.com) References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> <3A558894.F2BA89F0@lemburg.com> Message-ID: <200101050908.f05989x01342@mira.informatik.hu-berlin.de> > Sure, exceptions are not much of a problem, but how would the > callback tell the encoder/decoder to e.g. skip forward 2 bytes or > perhaps backward 10 bytes ? First, I'd like to point out that encoding and decoding is *not* symmetric with regards to error handling, so there is *no* need to make the interfaces appear symmetric; it is rather unfortunate that Python 2 gives this impression. The reason for the difference is that converting from some encoding to Unicode never fails for virtually all encodings because of missing characters in Unicode - Unicode is supposed to support almost everything, and code sets that cannot completely map into Unicode probably need special attention anyway (normally, by producing a non-reversible mapping). So the callback is not needed at all for decoding. For encoding, my claim is that error callbacks never want to skip forward 2 bytes. If anything, then go forward two characters - but I can't even imagine a scenario where that would be needed. Don't try to design an API that nobody will ever use. Walter has demonstrated how to implement the "skip the current character" case: by returning u"" from the callback. > What if the callback would have to scan the stream from the > beginning to find out where to continue or look ahead a few hundred > bytes to find the next valid encodable sequence ? What would be the specific encoding, and what would be the specific error handling algorithm that would require such a service? > Again, you should keep in mind that the scheme has to work > for all encoding/decoding work, not only conversion from and > to Unicode. Why is that? That sounds like gross overgeneralization to me. Specifically, do you know anybody using that framework for anything but Unicode conversion? If so, who is that, and what is the specific application? > If we were to provide a callback as optional method to > StreamReaders/Writers, the task could be done either statically > by subclassing the existing codec StreamReaders/Writers or > dynamically by asking the codec registry to return the StreamReader/ > Writer classes. So how would the implementation of charmap_encode invoke this method? It currently doesn't even get hold of the codec object. > Another option would be 'copy' which tries to simply copy input > to output in case this is reasonably possible given the encoding > (e.g. Unicode -> 8-bit encoding would copy all 8-bit Unicode chars as > is in case no mapping is defined). An option 'raise' could also > be valuable in conjunction with an exception context object to have > the codec raise customized exceptions. Provided the context > object points to another encoder/decoder, an option 'fallback' > could be used to tell the codec to pass the failing input data > to the alternate encoder/decoder in order to have it converted. > Etc. etc. > > There are many things one could do with the error string. I guess my question is different: Do you consider the error string to be of a well-defined finite enumerated set of possible values, or is it your view that it is up to the codec what error strings to accept? If so, why would they have to be strings? Regards, Martin From mal@lemburg.com Fri Jan 5 09:54:07 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 05 Jan 2001 10:54:07 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> <3A558894.F2BA89F0@lemburg.com> <200101050908.f05989x01342@mira.informatik.hu-berlin.de> Message-ID: <3A5599BE.2A6CBDE2@lemburg.com> "Martin v. Loewis" wrote: > > > Sure, exceptions are not much of a problem, but how would the > > callback tell the encoder/decoder to e.g. skip forward 2 bytes or > > perhaps backward 10 bytes ? > > First, I'd like to point out that encoding and decoding is *not* > symmetric with regards to error handling, so there is *no* need to > make the interfaces appear symmetric; it is rather unfortunate that > Python 2 gives this impression. > > The reason for the difference is that converting from some encoding to > Unicode never fails for virtually all encodings because of missing > characters in Unicode - Unicode is supposed to support almost > everything, and code sets that cannot completely map into Unicode > probably need special attention anyway (normally, by producing a > non-reversible mapping). So the callback is not needed at all for > decoding. > > For encoding, my claim is that error callbacks never want to skip > forward 2 bytes. If anything, then go forward two characters - but I > can't even imagine a scenario where that would be needed. Don't try to > design an API that nobody will ever use. > > Walter has demonstrated how to implement the "skip the current > character" case: by returning u"" from the callback. The codec design is supposed to cover the general case of encoding/decoding arbitrary data from and to arbitrary formats. Please don't try to break everything down to Unicode<->8-bit codecs. The design should be able to cover conversion between image formats, audio formats, compression schemes and other encodings just as well as between different text formats. I agree that the case for Unicode codecs allows some simplification to the codec API design, but restricting it to this range of application only would cause us much trouble in the years to come when other codec applications start to appear in the Python universe. Other applications do have a need to jump back and forth in the data stream, e.g. say you want to decode a corrupt image file or a truncated MP3 file. > > What if the callback would have to scan the stream from the > > beginning to find out where to continue or look ahead a few hundred > > bytes to find the next valid encodable sequence ? > > What would be the specific encoding, and what would be the specific > error handling algorithm that would require such a service? See above. > > Again, you should keep in mind that the scheme has to work > > for all encoding/decoding work, not only conversion from and > > to Unicode. > > Why is that? That sounds like gross overgeneralization to me. > Specifically, do you know anybody using that framework for anything > but Unicode conversion? If so, who is that, and what is the specific > application? I am planning to add compression codecs based on zlib and possibly cryptographic codecs which can then be used together with stackable streams to provide seemless compression and/or encryption to application which otherwise do not provide this functionality. > > If we were to provide a callback as optional method to > > StreamReaders/Writers, the task could be done either statically > > by subclassing the existing codec StreamReaders/Writers or > > dynamically by asking the codec registry to return the StreamReader/ > > Writer classes. > > So how would the implementation of charmap_encode invoke this method? > It currently doesn't even get hold of the codec object. Through the extended API I proposed earlier on: the extra context object would allow providing a callback mechanism. Alternatively, the StreamRead/Writer classes could use their own specific C coding functions. > > Another option would be 'copy' which tries to simply copy input > > to output in case this is reasonably possible given the encoding > > (e.g. Unicode -> 8-bit encoding would copy all 8-bit Unicode chars as > > is in case no mapping is defined). An option 'raise' could also > > be valuable in conjunction with an exception context object to have > > the codec raise customized exceptions. Provided the context > > object points to another encoder/decoder, an option 'fallback' > > could be used to tell the codec to pass the failing input data > > to the alternate encoder/decoder in order to have it converted. > > Etc. etc. > > > > There are many things one could do with the error string. > > I guess my question is different: Do you consider the error string to > be of a well-defined finite enumerated set of possible values, or is > it your view that it is up to the codec what error strings to accept? Exactly. There is a set of error strings which the codec must accept, but it is free to also implement other schemes as well. > If so, why would they have to be strings? I chose strings to simplify the implementation. Back when the design was discussed, we figured that the codec should take care of the error handling. Python's codec design is one of the few which does allow setting error handling behaviour -- other implementations tend to simply raise an exception and leave the user in the dark. It's too late to *change* the design. We can only extend it. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Fri Jan 5 21:00:25 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 5 Jan 2001 22:00:25 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode In-Reply-To: <3A5599BE.2A6CBDE2@lemburg.com> (mal@lemburg.com) References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> <3A558894.F2BA89F0@lemburg.com> <200101050908.f05989x01342@mira.informatik.hu-berlin.de> <3A5599BE.2A6CBDE2@lemburg.com> Message-ID: <200101052100.f05L0Pt01067@mira.informatik.hu-berlin.de> > The codec design is supposed to cover the general case of > encoding/decoding arbitrary data from and to arbitrary formats. Where is it documented as such? I believe it is wishful thinking to assume they cover some general case, although I have to acknowledge that *your* wish is more relevant than other people's wishes. > Please don't try to break everything down to Unicode<->8-bit > codecs. The design should be able to cover conversion between > image formats, audio formats, compression schemes and other > encodings just as well as between different text formats. Is there any precedent that it is actually useful for anything else? > I agree that the case for Unicode codecs allows some simplification > to the codec API design, but restricting it to this range of > application only would cause us much trouble in the years to come > when other codec applications start to appear in the Python > universe. Well, there are a number of codec applications in the Python universe already (e.g. uuencode/base64, various graphics format converters, compression modules); none of which uses the codec module. I firmly believe that they shouldn't - I rather have a good solution for each single problem, than a mediocre solution that also solves unrelated problems. > Other applications do have a need to jump back and forth in > the data stream, e.g. say you want to decode a corrupt image > file or a truncated MP3 file. Then they also need special API for that; your codec framework will be useless. > I am planning to add compression codecs based on zlib and > possibly cryptographic codecs which can then be used together > with stackable streams to provide seemless compression and/or > encryption to application which otherwise do not provide this > functionality. Which application do you want to enhance with that functionality? To support writing compressed files, you just use gzip.open; or gzip.GzipFile(fileobj=mystream) if you want to operate on a stream instead of a named file. > > > If we were to provide a callback as optional method to > > > StreamReaders/Writers, the task could be done either statically > > > by subclassing the existing codec StreamReaders/Writers or > > > dynamically by asking the codec registry to return the StreamReader/ > > > Writer classes. > > > > So how would the implementation of charmap_encode invoke this method? > > It currently doesn't even get hold of the codec object. > > Through the extended API I proposed earlier on: the extra context > object would allow providing a callback mechanism. Alternatively, > the StreamRead/Writer classes could use their own specific > C coding functions. Was there some detailed proposal of an API? I don't recall that; could you kindly point me to the message in the archives which elaborate that proposal? Specifically, as an author of an application that wants to extend existing codecs, could you post some Python code that shows how to create the context objects (including an implementation of the codec object's class), and how to pass it to Unicodeobject.encode? > Exactly. There is a set of error strings which the codec > must accept, but it is free to also implement other schemes > as well. Ok, the guaranteed error strings being 'strict','ignore' and 'replace'. > I chose strings to simplify the implementation. Back when the > design was discussed, we figured that the codec should take > care of the error handling. Python's codec design is one of > the few which does allow setting error handling behaviour -- > other implementations tend to simply raise an exception and leave > the user in the dark. > > It's too late to *change* the design. We can only extend it. It's too late to change the *API*, the design of it can be changed as long as the current API still emerges as a special case. That's what Walter's proposal does: The API is extended to allow callable objects as the eror parameter, and three well-known constants are provided (codecs.{STRICT|IGNORE|REPLACE}). Regards, Martin From mal@lemburg.com Sat Jan 6 15:32:10 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 06 Jan 2001 16:32:10 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> <3A558894.F2BA89F0@lemburg.com> <200101050908.f05989x01342@mira.informatik.hu-berlin.de> <3A5599BE.2A6CBDE2@lemburg.com> <200101052100.f05L0Pt01067@mira.informatik.hu-berlin.de> Message-ID: <3A573A7A.A596C068@lemburg.com> "Martin v. Loewis" wrote: > > > The codec design is supposed to cover the general case of > > encoding/decoding arbitrary data from and to arbitrary formats. > > Where is it documented as such? I believe it is wishful thinking to > assume they cover some general case, although I have to acknowledge > that *your* wish is more relevant than other people's wishes. Please see Misc/unicode.txt for details. I tried to design the interface with a larger application range in mind and that's what I will continue to argue for, obviously ;-) > [ranting about the codec design being useless for other applications] I don't see the point in trying to argue for uselessness of an existing design. If you want your own design, then nobody will stop you from rolling your own. > > > > If we were to provide a callback as optional method to > > > > StreamReaders/Writers, the task could be done either statically > > > > by subclassing the existing codec StreamReaders/Writers or > > > > dynamically by asking the codec registry to return the StreamReader/ > > > > Writer classes. > > > > > > So how would the implementation of charmap_encode invoke this method? > > > It currently doesn't even get hold of the codec object. > > > > Through the extended API I proposed earlier on: the extra context > > object would allow providing a callback mechanism. Alternatively, > > the StreamRead/Writer classes could use their own specific > > C coding functions. > > Was there some detailed proposal of an API? I don't recall that; could > you kindly point me to the message in the archives which elaborate > that proposal? There wasn't a detailed proposal, only a design idea... """ For the general case, I'd rather add new PyUnicode_EncodeEx() and PyUnicode_DecodeEx() APIs which then take a Python context object as extra argument. The error treatment string would then define how to use this context object, e.g. 'callback' could be made to apply processing similar to what Walter suggested. The xxxEx() APIs will have to take special precautions to also work with pre-2.1 codecs though, since the codec API definition does not include the extra context objext. """ > Specifically, as an author of an application that wants to extend > existing codecs, could you post some Python code that shows how to > create the context objects (including an implementation of the codec > object's class), and how to pass it to Unicodeobject.encode? Sure, but only *after* the context object design has implemented.. otherwise there wouldn't be a point ;-) > > Exactly. There is a set of error strings which the codec > > must accept, but it is free to also implement other schemes > > as well. > > Ok, the guaranteed error strings being 'strict','ignore' and > 'replace'. Right. > > I chose strings to simplify the implementation. Back when the > > design was discussed, we figured that the codec should take > > care of the error handling. Python's codec design is one of > > the few which does allow setting error handling behaviour -- > > other implementations tend to simply raise an exception and leave > > the user in the dark. > > > > It's too late to *change* the design. We can only extend it. > > It's too late to change the *API*, the design of it can be changed as > long as the current API still emerges as a special case. That's what > Walter's proposal does: The API is extended to allow callable objects > as the eror parameter, and three well-known constants are > provided (codecs.{STRICT|IGNORE|REPLACE}). No, it does not: the error string parameter is defined as "const char*". You can't change that to PyObject* in the C API and for the Python API I wouldn't want to introduce "switch semantics on type" variables. Extending APIs is OK, changing them is not. I'll right a patch which implements the 'xml-escape' error treatment. Hopefully that will buy us some time to think of a design extension -- provided you play along :-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Sat Jan 6 18:48:02 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 6 Jan 2001 19:48:02 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode In-Reply-To: <3A573A7A.A596C068@lemburg.com> (mal@lemburg.com) References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101041041.f04AfcR01013@mira.informatik.hu-berlin.de> <3A558894.F2BA89F0@lemburg.com> <200101050908.f05989x01342@mira.informatik.hu-berlin.de> <3A5599BE.2A6CBDE2@lemburg.com> <200101052100.f05L0Pt01067@mira.informatik.hu-berlin.de> <3A573A7A.A596C068@lemburg.com> Message-ID: <200101061848.f06Im2v04223@mira.informatik.hu-berlin.de> > I don't see the point in trying to argue for uselessness of > an existing design. If you want your own design, then nobody > will stop you from rolling your own. The design does not exist but on paper. What really matters is the API and the implementation. I could not care less about the design, but you bring to up to argue why the implementation should not be changed. I don't want my own design, I want to enhance the API. > > > > So how would the implementation of charmap_encode invoke this method? > > > > It currently doesn't even get hold of the codec object. [...] > There wasn't a detailed proposal, only a design idea... That's one of the major problems here, IMO. If there was a specific proposal, it would be possible to evaluate whether it meets the requirements. Instead, you use "design ideas" to claim that some other specific proposal which we already have is a bad thing, and that the design could be much more general. That is not very convincing, as apparently nobody can follow your design to really understand whether what you claim is true. > For the general case, I'd rather add new PyUnicode_EncodeEx() > and PyUnicode_DecodeEx() APIs which then take a Python > context object as extra argument. The error treatment string > would then define how to use this context object, e.g. 'callback' > could be made to apply processing similar to what Walter > suggested. Ok, PyUnicode_EncodeEx would then invoke PyCodec_EncodeEx, which would eventually end-up in encodings.koi8_r.Codec.encode (or encoding.koi8_r.Codec.encode_ex?). Now, how would that be implemented? > The xxxEx() APIs will have to take special precautions to also > work with pre-2.1 codecs though, since the codec API definition > does not include the extra context objext. In the specific case of KOI8-R, how would these precautions look like, specifically, using, say, Python as a notation? > > Specifically, as an author of an application that wants to extend > > existing codecs, could you post some Python code that shows how to > > create the context objects (including an implementation of the codec > > object's class), and how to pass it to Unicodeobject.encode? > > Sure, but only *after* the context object design has implemented.. > otherwise there wouldn't be a point ;-) So you want to implement it first, and discuss use cases later??? Or maybe you don't want to discuss the design at all? > No, it does not: the error string parameter is defined as "const char*". You mean, in PyUnicode_FromEncodedObject, PyUnicode_Decode, and other C functions? So you would have to provide additional functions in the C API, but that is the same as your proposal with the *Ex functions, as I understand it. > You can't change that to PyObject* in the C API and for the Python API > I wouldn't want to introduce "switch semantics on type" variables. Ah, but it's 'switch semantics on value' :-) If you pass the string 'ignore', it has a different semantics than passing 'replace', which again has a different semantic than passing codecs.REPLACE_WITH_XML_CHARACTER_ENTITIES, which happens to be callable. > Extending APIs is OK, changing them is not. That just is an extension. For the C interface, it apparently means duplication; for the Python interface, we can keep the old signatures and extend the acceptable parameter values. > I'll right a patch which implements the 'xml-escape' error > treatment. Hopefully that will buy us some time to think of > a design extension -- provided you play along :-) Good. I'm willing to agree on any proposal once I can see that it does what it was designed for... Regards, Martin From andy@reportlab.com Sat Jan 6 23:26:45 2001 From: andy@reportlab.com (Andy Robinson) Date: Sat, 6 Jan 2001 23:26:45 -0000 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode In-Reply-To: <200101061848.f06Im2v04223@mira.informatik.hu-berlin.de> Message-ID: >> The codec design is supposed to cover the general case of >> encoding/decoding arbitrary data from and to arbitrary formats. > > Where is it documented as such? I believe it is wishful thinking to > assume they cover some general case, although I have to acknowledge > that *your* wish is more relevant than other people's wishes. > >> Please don't try to break everything down to Unicode<->8-bit >> codecs. The design should be able to cover conversion between >> image formats, audio formats, compression schemes and other >> encodings just as well as between different text formats. > Is there any precedent that it is actually useful for > anything else? I'm trying to catch up on this thread after a long absence. I have not been able to do any i18n work this year and cannot give any opinions on the error handling details, but I must comment on these paragraphs. There was a great deal of discussion about keeping the codec mechanism general-purpose on the python-dev list when the unicode proposal was first put together. This came from two directions: (1) I argued long and hard then that i18n is not just Unicode; there are many legacy problems where you want to be able to write codecs to go direct from one native encoding to another without going through Unicode. They are never needed in the case of perfectly encoded data, but this need is pressing if having to deal with and clean up large amounts of misencoded data, user-defined characters etc. I spent a year of my life on a very complex i18n project, corresponded with Ken Lunde and many other developers in the field, and got the same feedback from the developers at Digital Garage in Tokyo, who deal with this every day. The key requirements I had were that (a) the API should not be limited to Unicode <--> 8-bit, and (b) you should be able to extend codec mappings and algorithms without needing a C compiler every time. I can provide lots of use cases if needed but they are hard to follow if you don't know a little Japanese. (2) there was much interest in the Java concept of 'stackable streams' and stream conversion tools. The general case is clearly a stream of bytes, and Unicode strings are one case of these. Several of us also felt that with the right little state machine in the codec package, you could do vey powerful things in different spheres like compression, binary encodings like base 64/85/whatever. Guido played a large part in the discussions and, I believe he fully understood and echoed the design goal you question at the top. Since then, Marc-Andre has done a fantastic mount of largely unpaid work, but I have not been able to follow up with the work I wanted to do on Asian codecs. If I had, you'd have plenty of use cases for keeping things general purpose. I am however confident that whenever we get around to building the right codec package (which depends a lot on when ReportLab gets its first Asian customers), people in the feel will see Python's i18n support is way ahead of that of Java. Regards, Andy Robinson (still flat out keeping a startup going and failing to do my duties as sig moderator, sadly) From martin@loewis.home.cs.tu-berlin.de Sun Jan 7 10:09:53 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sun, 7 Jan 2001 11:09:53 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode In-Reply-To: References: Message-ID: <200101071009.f07A9rB01152@mira.informatik.hu-berlin.de> [need for codecs to go direct from one native encoding to another] > I spent a year of my life on a very complex i18n project, > corresponded with Ken Lunde and many other developers in the field, > and got the same feedback from the developers at Digital Garage in > Tokyo, who deal with this every day. I then have to accept that this really happens in life, although I surely hope that the cases where it is necessary to have such cases become more and more rare. Can you elaborate a bit what the problem was in this complex project? I.e. which where the encodings A and B that you needed direct conversion for? Why couldn't you go through Unicode? If the reason was that you could not have "correctly" recoded a certain subset of the characters, then which characters would have suffered? > The key requirements I had were that (a) the API should not be > limited to Unicode <--> 8-bit, and I believe that requirement is not completely answered. If you want to get from A to B, and both a and b are byte-oriented encodings, then the API offers b = a.encode("AtoB") First, you need a codec name that describes both source and target encoding; for the Unicode codecs, you only need one encoding in the codec name. However, that API does not work: The encode method of a byte string assumes that the string is in the system encoding. It first tries to decode the string into a Unicode object, then takes the codec name as one going from Unicode to the target. So instead, you have to write enc,dec,_,_ = codecs.lookup("AtoB") b,_ = enc(a) That assumes that you first had registered your codec: import AtoB,codecs codecs.register(AtoB.lookup) In this case, it would be easier *not* to use the framework: import AtoB b = AtoB.encode(a) > (b) you should be able to extend codec mappings and algorithms > without needing a C compiler every time. I don't know what you mean by "extend codec mappings". If you want to register codecs written in Python and use it from C, that works very well. If you want to enhance an existing codec to support additional characters, or to partially replace the output of an existing codec - well, that is surely not available, and the matter of the current debate: It is currently not possible to enhance an existing codec so that it would produce ᇗ if U+4567 is not supported in the target encoding. > I can provide lots of use cases if needed but they are hard to > follow if you don't know a little Japanese. Please assume I know a little Japanese, and present a single use case. Since that would be mainly to satisfy my curiosity: don't if that would be a longer essay. > (2) there was much interest in the Java concept of 'stackable > streams' and stream conversion tools. The general case is > clearly a stream of bytes, and Unicode strings are one > case of these. Several of us also felt that with the right > little state machine in the codec package, you could do vey > powerful things in different spheres like compression, binary > encodings like base 64/85/whatever. > > Guido played a large part in the discussions and, I believe he > fully understood and echoed the design goal you question > at the top. Indeed, that's what I question. Stackable things always look like a good idea on paper, so people can be easily talked into approving them. I'm not quite clear why the file API doesn't already provide stackable streams, in fact, gzip.GzipFile is a demonstration that this is really possible. The question is whether anybody currently *has* written codecs that don't deal with strings, yet use the codec interfaces. My claim is that you never want to 'stack' more than one stream on top of another. People are then happy with whatever stacking API the codec offers. My concern is not so much the existance of the API, but that it is taken as a rationale for preventing improvements of the usability of the Unicode library. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Mon Jan 8 08:44:44 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Mon, 8 Jan 2001 09:44:44 +0100 Subject: [I18n-sig] iconv codec Message-ID: <200101080844.f088iii02150@mira.informatik.hu-berlin.de> I have checked-in an iconv codec into the practicecodecs/iconv directory on SF. It has been tested only on Linux so far; if you have any problems with it, or other comments, please let me know. Regards, Martin From mal@lemburg.com Mon Jan 8 15:52:14 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 08 Jan 2001 16:52:14 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode References: <200101071009.f07A9rB01152@mira.informatik.hu-berlin.de> Message-ID: <3A59E22E.349B0981@lemburg.com> Martin, what is the point of these endless discussions about use-cases (which you seem esp. fond of ;), design vs. API, Walter's proposal and whether or not the codec design covers more general cases than just encoding and decoding from and to Unicode ? These discussions don't get us anywhere. To summarize: * the codec design was discussed at length early last year * the design was chosen after many useful suggestions from people who know what codecs have to deal with (e.g. Andy, Fredrik (from the PIL-perspective BTW)) and others * the design is written down in Misc/unicode.txt * extending the design is OK, breaking APIs is not * extending the design by adding parameters is OK, extending the design by switching on parameter type is not * I have no problem with extending the design * Walter's proposal breaks the Unicode C API in untolerable ways; I agree that the general idea is worth persuing though and Walter's proposal has some good ideas into that direction So where are we heading ? * I will start to code a new error treatment option 'xml-escape' which can then also be used as basis for other escape techniques which might be of general use (e.g. 'unicode-escape') * we should start thinking of ways to extend the existing C API to allow providing a context object to the encoder/decoder. I've already made a few suggestions into that direction; more are to come once I find more time to work on this; other suggestions are, of course, welcome too * the new error handler extensions will be a post-2.1 feature * a PEP is needed for the design (most people don't read endless threads like these to catch up) What the PEP should include: * a proposal for extending the Unicode C API to provide an extra context object to the encoder/decoder functions (which are otherwise stateless) * a hook for StreamWriters/Readers to use as standard error handler in case 'callback' is used as error handling option * the Python APIs .encode() and unicode() should be extended by a third optional argument: the context object * all builtin codecs should be extended to handle the new scheme * Codec.encode and .decode APIs should allow a context object as additional optional argument; default should be None * the changes must be 100% backward compatible, both at C and at Python level -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From walter@livinglogic.de Mon Jan 8 18:25:15 2001 From: walter@livinglogic.de (=?ISO-8859-1?Q?=22Walter_D=F6rwald=22?=) Date: Mon, 08 Jan 2001 19:25:15 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode In-Reply-To: <3A5388F7.FA6D49DA@lemburg.com> References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> Message-ID: <200101081925150671.00529F1F@mail.livinglogic.de> On 03.01.01 at 21:17 M.-A. Lemburg wrote: > [ Unicode compression example ] > > > > I know that error handling could be more generic, but passing > > > a callable object instead of the error parameter is not an > > > option since the internal APIs all use a const char parameter > > > for error. > > > > Changing this should can be done in one or two hours for someone > > who knows the Python internals. (Unfortunately I don't, I first > > looked at unicodeobject.[hc] several days ago!) > > Sure, but it would break code and alter the Python C API > in unacceptable ways. Note that all builtin C codecs would > also have to be changed. > > If we are going to extend the error handling mechanism then > we'd better do it some b/w compatible way, e.g. by providing > new APIs. But I don't think that can be done in a completely backward compatible way. At least the codecs will have to be changed. > [...] > > > extern DL_IMPORT(PyObject*) PyUnicode_Encode( > > const Py_UNICODE *s, /* Unicode char buffer */ > > int size, /* number of Py_UNICODE chars to= encode */ > > const char *encoding, /* encoding */ > > PyUnicodeObject *errorHandler(PyUnicodeObject *string, int= position) /* error handling via C function */ > > ); > > would, but thats not the point. When you use an encoding, where more > > than 20% of the characters have to be escaped (as XML entities or= whatever) > > you're using the wrong encoding. > > That's what I was talking about all along... if it's really > only for escaping XML, then a special Latin-1 or ASCII XML excaping > codec would go a long way (without the troubles of using callbacks > and without having to add a new error callback mechanism). But I would like to hav and escaping mechanism that can be used with any encoding, not just latin1 + xml-escape, and ascii + xml-escape, but also shift-jis + xml-escape, euc + xml-escape, koi8 + xml-escape, ... > Writing such a codec doesn't take much time, since the > code's already there. Even better: XML escaping could be added > as new error handling option, e.g. "xml-escape" instead of > "replace". > Since XML escaping is general enough, I do think that adding > such an option to all builtin codecs would be an acceptable > and workable solution. But that method has two problems: Handling "xml-escape" has to be implemented in every codec and it only solves one problem: escaping via numeric (decimal) XML character entities. What if I want an output where "=DF" is escaped as "ß" and not "ß"? And maybe I define my own entities, so that "あ" will be written as "&hiraA;"? Another use case is, when such a string is written to the terminal (encoded with sys.getdefaultencoding()): I want to hightlight the character entities, so I have to put ANSI escape sequences around the escaped character. Implementing all of this in all the codecs would be lot of work and it is definitely nothing that should be part of the codecs because it is too application specific. > [...] Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7= www.livinglogic.de From walter@livinglogic.de Mon Jan 8 18:59:43 2001 From: walter@livinglogic.de (=?us-ascii?Q?=22Walter_D=F6rwald=22?=) Date: Mon, 08 Jan 2001 19:59:43 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode In-Reply-To: <3A5449AA.14A602E0@lemburg.com> References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> Message-ID: <200101081959430656.00722D2F@mail.livinglogic.de> On 04.01.01 at 11:00 M.-A. Lemburg wrote: > [...] > > > Even if it *had* to preserve state (e.g. when encoding into ISO-2022), > > Walter's proposal is that the callback returns a Unicode object that > > is encoded *instead* of the original character. I have yet to see an > > encoding scheme that would fail under this scheme: in the ISO-2022 > > case, with XML character entities, the codec would know what state it > > is in, so it would know whether it has to switch to single-byte mode > > to encode the &# or not. > > How would such a scheme allow passing back control information > such as: continue with the next character in the stream def ignore(encoding, string, position): return u"" u"xxx".encode(encoding, 'callback', ignore) > or break with an exception ? def raiseAnException(encoding, string, position): raise FancyException("can't encode character %r at position %d in string= %r with encoding %s" % (string[position], position, string, encoding)) u"xxx".encode(encoding, 'callback', raiseAnException) > > Looking again at the TR6 mechanism: Even if the error callback was > > called, and even if it had to return bytes instead of unicodes, it > > could still operate stateless: it would just output SQU as often as > > required. I believe that most stateful encodings have a "escape to > > known state" mechanism. > > Which is what I'm talking about all along: the codecs know best > what to do, so better extend them than try to fiddle in some > information using a callback. The callback is only used in the situation when the codec does not know what to do, i.e. when it encounters an unencodable character. The callback is an *error handler* and not a "I don't know how to implement my own encoding algorithm, please help me"-handler. >;-> > I don't object to adding callback support to the codec's > error handlers, but we'll need a new set of APIs to allow > this. I could live with a u"xxx".encode(encoding, 'callback', handler) on the Python side, but what does this mean for the C API? > > So I still think your objection is theoretical, whereas the problem > > that Walter is trying to solve is real. > > I did propose a solution which would satisfy your needs: simply > add a new error treatment 'xml-escape' to the builtin codecs > which then does the needed XML escaping. XML is general enough > to warrant such a step and the required changes are minor. > > Another candidate for a new error treatment would be 'unicode-escape' > which then replaces the character in question with '\uXXXX'. > > For the general case, I'd rather add new PyUnicode_EncodeEx() > and PyUnicode_DecodeEx() APIs which then take a Python > context object as extra argument. What should this extra argument be for the decoder? > The error treatment string > would then define how to use this context object, e.g. 'callback' > could be made to apply processing similar to what Walter > suggested. 'callback' seem too generic to me. May there will be other callbacks in the future that are used for different stuff. This is the "give me a replacement or die" error handler. > The xxxEx() APIs will have to take special precautions to also > work with pre-2.1 codecs though, since the codec API definition > does not include the extra context objext. Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7= www.livinglogic.de From martin@loewis.home.cs.tu-berlin.de Mon Jan 8 22:43:07 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Mon, 8 Jan 2001 23:43:07 +0100 Subject: [I18n-sig] Proposal: Extended error handlingforunicode.encode In-Reply-To: <3A59E22E.349B0981@lemburg.com> (mal@lemburg.com) References: <200101071009.f07A9rB01152@mira.informatik.hu-berlin.de> <3A59E22E.349B0981@lemburg.com> Message-ID: <200101082243.f08Mh7l00855@mira.informatik.hu-berlin.de> > These discussions don't get us anywhere. I'd surely hoped they would, but I realize that this is not possible. I don't agree with your summary, but we can probably leave it at that. Regards, Martin From andy@reportlab.com Tue Jan 9 08:49:29 2001 From: andy@reportlab.com (Andy Robinson) Date: Tue, 9 Jan 2001 08:49:29 -0000 Subject: [I18n-sig] PEP needed In-Reply-To: <200101082243.f08Mh7l00855@mira.informatik.hu-berlin.de> Message-ID: > > > These discussions don't get us anywhere. > > I'd surely hoped they would, but I realize that this is not > possible. I don't agree with your summary, but we can probably leave > it at that. > > Regards, > Martin I think Marc-Andre's suggestion of a PEP is an excellent one. Martin, why not try to produce something like this which starts at the very beginning? Explain what the problems are that you are trying to solve, in PEP format; give code snippets of what you have to do now, why it dooesn;t work, and how you would like it to work. Then we can all get involved, and even ask Guido if we need to. But we can't expect him or anyone else to give an opinion without a PEP. I don't have time to trawl through the emails and I certainly feel a need for a summary of this debate. Since only 2-3 people are involved, I guess no one else has found the time either. For anyone not familiar with these, Python Enhancement Proposals (PEPs) are a standard form of document used to record Python design decisions. They were introduced to save Guido time and give everyone something to discuss without having to trawl through months of emails. They can all be found at http://cvs.sourceforge.net/cgi-bin/cvsweb.cgi/python/nondist/peps/?cvs root=python Thanks, Andy the pointy-haired manager p.s. I will finish off my 'use cases' in the next couple of days; I have a very big deadline today annd have had no time. From kajiyama@grad.sccs.chukyo-u.ac.jp Tue Jan 9 23:40:21 2001 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Wed, 10 Jan 2001 08:40:21 +0900 Subject: [I18n-sig] iconv codec In-Reply-To: <200101080844.f088iii02150@mira.informatik.hu-berlin.de> (martin@loewis.home.cs.tu-berlin.de) Message-ID: <200101092340.IAA09353@dhcp234.grad.sccs.chukyo-u.ac.jp> Martin v. Loewis wrote: | | I have checked-in an iconv codec into the practicecodecs/iconv | directory on SF. Cool. | It has been tested only on Linux so far; if you have | any problems with it, or other comments, please let me know. I've tested the iconv codec (checked out last night) on two Linux boxes of mine, one with glibc-2.1.2 and the other with old libc5 plus libiconv-1.5.1 (http://clisp.cons.org/~haible/packages-libiconv.html). I have the following error messages on both two platforms: Python 2.0 (#1, Oct 27 2000, 00:27:59) [GCC 2.7.2.3] on linux2 >>> import iconvcodec >>> unicode("test","euc-jp") Traceback (most recent call last): File "", line 1, in ? File "iconvcodec.py", line 50, in decode return self.decoder.iconv(msg, return_unicode=1),len(msg) SystemError: new style getargs format but argument is not a tuple >>> u"test".encode("euc-jp") Traceback (most recent call last): File "", line 1, in ? File "iconvcodec.py", line 19, in encode return self.encoder.iconv(msg),len(msg) SystemError: new style getargs format but argument is not a tuple >>> What goes wrong? -- KAJIYAMA, Tamito From martin@loewis.home.cs.tu-berlin.de Wed Jan 10 07:37:13 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 10 Jan 2001 08:37:13 +0100 Subject: [I18n-sig] iconv codec In-Reply-To: <200101092340.IAA09353@dhcp234.grad.sccs.chukyo-u.ac.jp> (message from Tamito KAJIYAMA on Wed, 10 Jan 2001 08:40:21 +0900) References: <200101092340.IAA09353@dhcp234.grad.sccs.chukyo-u.ac.jp> Message-ID: <200101100737.f0A7bDe00912@mira.informatik.hu-berlin.de> > SystemError: new style getargs format but argument is not a tuple > >>> > > What goes wrong? Thanks for the report. Iconv_iconv should have used METH_VARARGS|METH_KEYWORDS, but was using only METH_KEYWORDS. Please update your tree and try again. I don't know why this was no problem with the CVS Python. Regards, Martin From kajiyama@grad.sccs.chukyo-u.ac.jp Wed Jan 10 08:32:44 2001 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Wed, 10 Jan 2001 17:32:44 +0900 Subject: [I18n-sig] iconv codec In-Reply-To: <200101100737.f0A7bDe00912@mira.informatik.hu-berlin.de> (martin@loewis.home.cs.tu-berlin.de) References: <200101100737.f0A7bDe00912@mira.informatik.hu-berlin.de> Message-ID: <200101100832.RAA10608@dhcp234.grad.sccs.chukyo-u.ac.jp> Martin v. Loewis wrote: | | > SystemError: new style getargs format but argument is not a tuple | > >>> | > | > What goes wrong? | | Thanks for the report. Iconv_iconv should have used | METH_VARARGS|METH_KEYWORDS, but was using only METH_KEYWORDS. Please | update your tree and try again. I don't know why this was no problem | with the CVS Python. It works both with glibc-2.1.2 and with libiconv-1.5.1. Thanks. FYI: I've modified setup.py in the following way to build the iconv codec with an old libc5 and libiconv. Two iconv libraries are statically linked so that iconvmodule.so can be imported without relying on the LD_LIBRARY_PATH environment variable. The prefix /opt/libiconv-1.5.1 should be changed appropriately. I could not figure out a way to achieve the same things without modifying the setup.py script (possible?). --- setup.py.orig Wed Jan 10 08:50:56 2001 +++ setup.py Wed Jan 10 09:01:22 2001 @@ -14,6 +14,9 @@ """, py_modules = ['iconvcodec'], - ext_modules = [Extension("iconv",sources=["iconvmodule.c"])] + ext_modules = [Extension("iconv",sources=["iconvmodule.c"], + include_dirs=["/opt/libiconv-1.5.1/include"], + extra_objects=["/opt/libiconv-1.5.1/lib/libiconv.a", + "/opt/libiconv-1.5.1/lib/libcharset.a"])] ) Regards, -- KAJIYAMA, Tamito From martin@loewis.home.cs.tu-berlin.de Wed Jan 10 21:45:16 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 10 Jan 2001 22:45:16 +0100 Subject: [I18n-sig] iconv codec In-Reply-To: <200101100832.RAA10608@dhcp234.grad.sccs.chukyo-u.ac.jp> (message from Tamito KAJIYAMA on Wed, 10 Jan 2001 17:32:44 +0900) References: <200101100737.f0A7bDe00912@mira.informatik.hu-berlin.de> <200101100832.RAA10608@dhcp234.grad.sccs.chukyo-u.ac.jp> Message-ID: <200101102145.f0ALjGH01465@mira.informatik.hu-berlin.de> > FYI: I've modified setup.py in the following way to build the > iconv codec with an old libc5 and libiconv. Two iconv libraries > are statically linked so that iconvmodule.so can be imported > without relying on the LD_LIBRARY_PATH environment variable. > The prefix /opt/libiconv-1.5.1 should be changed appropriately. Is that a standard location as provided by some Linux distributor? If so, we could check whether some specific files are there, and then automatically add them as extra objects. If you can find a patch (e.g. using os.path.exists) that detects your configuration (and perhaps the default /usr/local installation), feel free to check that into the CVS. As for linking statically vs dynamically: If you give the extension a runtime_library_dirs attribute, the resulting extension will find its shared libraries in these directories; this is achieved through the -R linker option. Of course, if the shared library is in /usr/local/lib, it'll be found anyway. > I could not figure out a way to achieve the same things without > modifying the setup.py script (possible?). I believe using the build_ext command's options --link-objects, --libraries, --library-dirs, and --rpath might help, so python setup.py build_ext -I/opt/libiconv-1.5.1/include -L/opt/libiconv-1.5.1/lib -R/opt/libiconv-1.5.1/lib -liconv -lcharset should have worked. I get an exception that something is a string that shouldn't; if you run into the same problem, you may report it as a distutils bug. Regards, Martin From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Jan 11 06:43:14 2001 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 11 Jan 2001 15:43:14 +0900 Subject: [I18n-sig] iconv codec In-Reply-To: <200101102145.f0ALjGH01465@mira.informatik.hu-berlin.de> (martin@loewis.home.cs.tu-berlin.de) References: <200101110245.LAA01718@sam.hi-ho.ne.jp> Message-ID: <200101110643.PAA12420@dhcp234.grad.sccs.chukyo-u.ac.jp> Martin v. Loewis wrote: | | > FYI: I've modified setup.py in the following way to build the | > iconv codec with an old libc5 and libiconv. Two iconv libraries | > are statically linked so that iconvmodule.so can be imported | > without relying on the LD_LIBRARY_PATH environment variable. | > The prefix /opt/libiconv-1.5.1 should be changed appropriately. | | Is that a standard location as provided by some Linux distributor? No. That location is a personal preference of mine, not a standard one. | As for linking statically vs dynamically: If you give the extension a | runtime_library_dirs attribute, the resulting extension will find its | shared libraries in these directories; this is achieved through the -R | linker option. I've used GCC 2.7.2.3, and it seems not to support the -R option... I tried to give the compiler two linker options -Wl,-rpath -Wl,/opt/libiconv-1.5.1/lib, but I could not get the desired effect, too. | python setup.py build_ext -I/opt/libiconv-1.5.1/include -L/opt/libiconv-1.5.1/lib -R/opt/libiconv-1.5.1/lib -liconv -lcharset | | should have worked. I get an exception that something is a string that | shouldn't; Me too :-< | if you run into the same problem, you may report it as a | distutils bug. I see. Thanks. -- KAJIYAMA, Tamito From martin@loewis.home.cs.tu-berlin.de Thu Jan 11 13:17:11 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 11 Jan 2001 14:17:11 +0100 Subject: [I18n-sig] Distutils and iconv codec Message-ID: <200101111317.f0BDHBl09327@mira.informatik.hu-berlin.de> It appears that there was a patch for processing -L options in distutils lately, see http://sourceforge.net/patch/?func=detailpatch&patch_id=102971&group_id=5470 so python setup.py build_ext -L/tmp -lbla works now for me. Unfortunately, passing -R is still broken; python setup.py build_ext -L/tmp -R/tmp -lbla gives ... File "/usr/local/lib/python2.0/distutils/unixccompiler.py", line 208, in link (libraries, library_dirs, runtime_library_dirs) = \ File "/usr/local/lib/python2.0/distutils/ccompiler.py", line 438, in _fix_lib_args runtime_library_dirs = (list (runtime_library_dirs) + TypeError: can only concatenate list (not "string") to list Also, I wonder what the rationale is for supporting -L/tmp:/var/tmp, while not supporting the Unixish -L/tmp -L/var/tmp. Regards, Martin From mal@lemburg.com Mon Jan 22 19:34:15 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 22 Jan 2001 20:34:15 +0100 Subject: [I18n-sig] Codec licenses Message-ID: <3A6C8B37.EDEB795D@lemburg.com> Hi everybody, scanning through the CVS archive of the SourceForge python-codecs project I found that most codec packages were placed under the GPL for some reason. This makes the codecs unusable for software which isn't GPL compatible and limits its usefulness considerably. Please consider either moving to the LGPL which does not have the GPL problems (other software relying on it will need to be shipped under the GPL too), but still assures that your code remains freely available or one of the Python licenses (preferrably the old CWI one). Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Jan 26 09:48:24 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 26 Jan 2001 10:48:24 +0100 Subject: [I18n-sig] Codec licenses References: <3A6C8B37.EDEB795D@lemburg.com> Message-ID: <3A7147E8.99A2BDC4@lemburg.com> "M.-A. Lemburg" wrote: > > Hi everybody, > > scanning through the CVS archive of the SourceForge python-codecs > project I found that most codec packages were placed under the GPL > for some reason. This makes the codecs unusable for software which > isn't GPL compatible and limits its usefulness considerably. > > Please consider either moving to the LGPL which does not have the > GPL problems (other software relying on it will need to be shipped > under the GPL too), but still assures that your code remains freely > available or one of the Python licenses (preferrably the > old CWI one). I haven't received any comment on the above so far. Should I take this as rejection of the proposal ? This would be sad and probably cause rewrites for most of the codecs in order to make them useful in closed-source software projects too. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@digicool.com Fri Jan 26 16:06:57 2001 From: guido@digicool.com (Guido van Rossum) Date: Fri, 26 Jan 2001 11:06:57 -0500 Subject: [I18n-sig] Codec licenses In-Reply-To: Your message of "Fri, 26 Jan 2001 10:48:24 +0100." <3A7147E8.99A2BDC4@lemburg.com> References: <3A6C8B37.EDEB795D@lemburg.com> <3A7147E8.99A2BDC4@lemburg.com> Message-ID: <200101261606.LAA23895@cj20424-a.reston1.va.home.com> > > Hi everybody, > > > > scanning through the CVS archive of the SourceForge python-codecs > > project I found that most codec packages were placed under the GPL > > for some reason. This makes the codecs unusable for software which > > isn't GPL compatible and limits its usefulness considerably. > > > > Please consider either moving to the LGPL which does not have the > > GPL problems (other software relying on it will need to be shipped > > under the GPL too), but still assures that your code remains freely > > available or one of the Python licenses (preferrably the > > old CWI one). > > I haven't received any comment on the above so far. > > Should I take this as rejection of the proposal ? This would be sad > and probably cause rewrites for most of the codecs in order to make > them useful in closed-source software projects too. If it helps, I'd certainly prefer the LGPL over the GPL. Of course my favorite license is the *old* Python license: http://www.python.org/doc/Copyright.html Another good one is the (current) BSD license: http://www.opensource.org/licenses/bsd-license.html But maybe you could approach those people who have chosen the GPL directly, and explain to them why you prefer something other than the GPL, as long as it's Open Source? --Guido van Rossum (home page: http://www.python.org/~guido/) From kajiyama@grad.sccs.chukyo-u.ac.jp Fri Jan 26 16:32:25 2001 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Sat, 27 Jan 2001 01:32:25 +0900 Subject: [I18n-sig] Codec licenses In-Reply-To: <3A7147E8.99A2BDC4@lemburg.com> (mal@lemburg.com) References: <3A7147E8.99A2BDC4@lemburg.com> Message-ID: <200101261632.BAA01375@dhcp234.grad.sccs.chukyo-u.ac.jp> M.-A. Lemburg wrote: | | > scanning through the CVS archive of the SourceForge python-codecs | > project I found that most codec packages were placed under the GPL | > for some reason. This makes the codecs unusable for software which | > isn't GPL compatible and limits its usefulness considerably. | > | > Please consider either moving to the LGPL which does not have the | > GPL problems (other software relying on it will need to be shipped | > under the GPL too), but still assures that your code remains freely | > available or one of the Python licenses (preferrably the | > old CWI one). Well, I have two (opposite?) thoughts regarding to the licensing of the JapaneseCodecs package. First, I've released the package under the terms of GNU GPL, because that license is comfortable for me. I want users to "use" the package in the GNU GPL sense. On the other hand, I hope that many people use my software. If needed, I release JapaneseCodecs or its part under different licensing terms. It is not a problem for me that a package that includes JapaneseCodecs as its part is released under an open source license (like the PyXML package). To tell the truth, JapaneseCodecs is the first free software package that I've released, and when I released it I was not sure what was the best licensing terms for the package. I've chosen the GNU GPL, but the situation seems complex... If possible, I'd like to utilize two different licenses: the GNU GPL for JapaneseCodecs as a separate package, and another license for the composite package that includes JapaneseCodecs as its part. Hmm... Does this reply make sense? I'm confused... -- KAJIYAMA, Tamito From guido@digicool.com Fri Jan 26 16:35:29 2001 From: guido@digicool.com (Guido van Rossum) Date: Fri, 26 Jan 2001 11:35:29 -0500 Subject: [I18n-sig] Codec licenses In-Reply-To: Your message of "Sat, 27 Jan 2001 01:32:25 +0900." <200101261632.BAA01375@dhcp234.grad.sccs.chukyo-u.ac.jp> References: <3A7147E8.99A2BDC4@lemburg.com> <200101261632.BAA01375@dhcp234.grad.sccs.chukyo-u.ac.jp> Message-ID: <200101261635.LAA24205@cj20424-a.reston1.va.home.com> > M.-A. Lemburg wrote: > | > | > scanning through the CVS archive of the SourceForge python-codecs > | > project I found that most codec packages were placed under the GPL > | > for some reason. This makes the codecs unusable for software which > | > isn't GPL compatible and limits its usefulness considerably. > | > > | > Please consider either moving to the LGPL which does not have the > | > GPL problems (other software relying on it will need to be shipped > | > under the GPL too), but still assures that your code remains freely > | > available or one of the Python licenses (preferrably the > | > old CWI one). > > Well, I have two (opposite?) thoughts regarding to the licensing > of the JapaneseCodecs package. > > First, I've released the package under the terms of GNU GPL, > because that license is comfortable for me. I want users to > "use" the package in the GNU GPL sense. > > On the other hand, I hope that many people use my software. If > needed, I release JapaneseCodecs or its part under different > licensing terms. It is not a problem for me that a package that > includes JapaneseCodecs as its part is released under an open > source license (like the PyXML package). > > To tell the truth, JapaneseCodecs is the first free software > package that I've released, and when I released it I was not > sure what was the best licensing terms for the package. I've > chosen the GNU GPL, but the situation seems complex... > > If possible, I'd like to utilize two different licenses: the > GNU GPL for JapaneseCodecs as a separate package, and another > license for the composite package that includes JapaneseCodecs > as its part. > > Hmm... Does this reply make sense? I'm confused... Makes sense to me -- you as the author can issue as many different licenses as you want to. E.g. Perl does this. I don't know the composite package -- is that also yours? If not, you will have to give the author or distributor of that package explicit permission to include JapaneseCodecs with a different license. --Guido van Rossum (home page: http://www.python.org/~guido/) From kajiyama@grad.sccs.chukyo-u.ac.jp Fri Jan 26 17:14:15 2001 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Sat, 27 Jan 2001 02:14:15 +0900 Subject: [I18n-sig] Codec licenses In-Reply-To: <200101261635.LAA24205@cj20424-a.reston1.va.home.com> (message from Guido van Rossum on Fri, 26 Jan 2001 11:35:29 -0500) References: <200101261635.LAA24205@cj20424-a.reston1.va.home.com> Message-ID: <200101261714.CAA01428@dhcp234.grad.sccs.chukyo-u.ac.jp> Guido van Rossum wrote: | | > If possible, I'd like to utilize two different licenses: the | > GNU GPL for JapaneseCodecs as a separate package, and another | > license for the composite package that includes JapaneseCodecs | > as its part. | | I don't know the composite package -- is that also yours? No, there is no such a package (yet). Once in this list, someone gave an idea of releasing a composite codecs package as a product of the i18n SIG. That is what I called "the composite package". | If not, you | will have to give the author or distributor of that package explicit | permission to include JapaneseCodecs with a different license. Yes. I'm quite sure that I will give the permission if required. -- KAJIYAMA, Tamito From mal@lemburg.com Fri Jan 26 17:19:08 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 26 Jan 2001 18:19:08 +0100 Subject: [I18n-sig] Codec licenses References: <3A7147E8.99A2BDC4@lemburg.com> <200101261632.BAA01375@dhcp234.grad.sccs.chukyo-u.ac.jp> Message-ID: <3A71B18C.BDAF205E@lemburg.com> Tamito KAJIYAMA wrote: > > M.-A. Lemburg wrote: > | > | > scanning through the CVS archive of the SourceForge python-codecs > | > project I found that most codec packages were placed under the GPL > | > for some reason. This makes the codecs unusable for software which > | > isn't GPL compatible and limits its usefulness considerably. > | > > | > Please consider either moving to the LGPL which does not have the > | > GPL problems (other software relying on it will need to be shipped > | > under the GPL too), but still assures that your code remains freely > | > available or one of the Python licenses (preferrably the > | > old CWI one). > > Well, I have two (opposite?) thoughts regarding to the licensing > of the JapaneseCodecs package. > > First, I've released the package under the terms of GNU GPL, > because that license is comfortable for me. I want users to > "use" the package in the GNU GPL sense. > > On the other hand, I hope that many people use my software. If > needed, I release JapaneseCodecs or its part under different > licensing terms. It is not a problem for me that a package that > includes JapaneseCodecs as its part is released under an open > source license (like the PyXML package). > > To tell the truth, JapaneseCodecs is the first free software > package that I've released, and when I released it I was not > sure what was the best licensing terms for the package. I've > chosen the GNU GPL, but the situation seems complex... > > If possible, I'd like to utilize two different licenses: the > GNU GPL for JapaneseCodecs as a separate package, and another > license for the composite package that includes JapaneseCodecs > as its part. > > Hmm... Does this reply make sense? I'm confused... I know its confusing and I am pretty sure that many programmers out there who put their software under the GPL don't know about the consequences of this step. To make it simple: * the GPL allows your software to be used stand-alone or as part of another package which then has to have a license compatible with the GPL (many popular licenses out there are *not* compatible with the GPL so this causes problems, e.g. Zope's license is not GPL compatible, so GPLed modules cannot be shipped together with Zope licensed packages) * the LGPL (Library GPL) does not impose any restriction with respect to including it in some package except that the packager will have to make the source code of the LGPLed available (possibly as seperate package); as a result there are no problems with non-GPL compatible products and your software gets used by many more people out there Both versions make sure that your software and any modifications applied to it are again published under the same terms, meaning that the source code (including any modification) must be made available without fee. GPL is fine for stand-alone products. LGPL should be used for everything which smells like a library ;) Even better are the new BSD licenses, since they give your users all the freedom in the world. Hope this clarifies things a bit. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From kajiyama@grad.sccs.chukyo-u.ac.jp Fri Jan 26 18:11:07 2001 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Sat, 27 Jan 2001 03:11:07 +0900 Subject: [I18n-sig] Codec licenses In-Reply-To: <3A71B18C.BDAF205E@lemburg.com> (mal@lemburg.com) References: <3A71B18C.BDAF205E@lemburg.com> Message-ID: <200101261811.DAA01489@dhcp234.grad.sccs.chukyo-u.ac.jp> M.-A. Lemburg wrote: | [The excellent summaries of the GNU GPL/LGPL snipped.] | | Both versions make sure that your software and any modifications | applied to it are again published under the same terms, meaning | that the source code (including any modification) must be made | available without fee. Exactly this effect of the GNU licenses was the reason why I chose the GNU GPL for JapaneseCodecs. I wanted my software to be shared by people forever. | Even better are the | new BSD licenses, since they give your users all the freedom in the | world. To the best of my knowledge BSD licenses allow someone to make that software proprietary and closed-source. This aspect is a contrast to the aforementioned effect of the GNU GPL/LGPL. That's why I prefer the latter licenses. | Hope this clarifies things a bit. Thank you for the clear explanations. -- KAJIYAMA, Tamito From walter@livinglogic.de Fri Jan 26 18:55:49 2001 From: walter@livinglogic.de (=?ISO-8859-1?Q?=22Walter_D=F6rwald=22?=) Date: Fri, 26 Jan 2001 19:55:49 +0100 Subject: [I18n-sig] Extended error handling for codecs In-Reply-To: <3A5ACB61.E4BEAC6C@lemburg.com> References: <200012201506250171.00D313E3@mail.tmt.de> <3A40FFF5.882E0D82@lemburg.com> <200012202054.VAA01458@loewis.home.cs.tu-berlin.de> <3A423E4D.88C7639@lemburg.com> <200012221632310203.0105EF8A@mail.livinglogic.de> <3A439A4A.B71F35DA@lemburg.com> <200101032018580500.01F457F3@mail.livinglogic.de> <3A5388F7.FA6D49DA@lemburg.com> <200101040109.f0419NH01429@mira.informatik.hu-berlin.de> <3A5449AA.14A602E0@lemburg.com> <200101081958290687.00710C3F@mail.livinglogic.de> <3A5ACB61.E4BEAC6C@lemburg.com> Message-ID: <200101261955490531.00D88BBA@mail.livinglogic.de> On 09.01.01 at 09:27 M.-A. Lemburg wrote: [ I think this was supposed to go to the list ] > "Walter D=F6rwald" wrote: > > > > On 04.01.01 at 11:00 M.-A. Lemburg wrote: > > > > > [...] > > > > > > How would such a scheme allow passing back control information > > > such as: continue with the next character in the stream > > > > def ignore(encoding, string, position): > > return u"" > > > > u"xxx".encode(encoding, 'callback', ignore) > > > > > or break with an exception ? > > > > def raiseAnException(encoding, string, position): > > raise FancyException("can't encode character %r at position %d > in string %r with encoding %s" > > % (string[position], position, string, encoding)) > > > > u"xxx".encode(encoding, 'callback', raiseAnException) > > Ok. I still think that we need to pass more information from > and to the callback. How about this scheme (the internal error > handlers work using a similar scheme): > > def callback(encoding, inputdata, inputposition, > outputdata, outputposition, errors): > ... > return (inputdata, inputposition, outputdata, outputposition) > > This would give the callback enough information to do just > about everything with the data in question. After having called > the callback(), the encoder or decoder would then reinitialize > itself using the returned data and positions. Does that mean that the callback can feed replacement input data back to the encoder? How does the callback tell the encoder to switch back to the original input after the replacement input is exhausted? Or does the callback have to construct a complete replacemant input string? As I see it, the callback can't modify the outputdata, because the output data is already encoded, and the callback knows nothing about the encoding. How could a "xml-escape" be implemented with that? > > > > Looking again at the TR6 mechanism: Even if the error callback was > > > > called, and even if it had to return bytes instead of unicodes, it > > > > could still operate stateless: it would just output SQU as often as > > > > required. I believe that most stateful encodings have a "escape to > > > > known state" mechanism. > > > > > > Which is what I'm talking about all along: the codecs know best > > > what to do, so better extend them than try to fiddle in some > > > information using a callback. > > > > The callback is only used in the situation when the codec does > > not know what to do, i.e. when it encounters an unencodable > > character. The callback is an *error handler* and not a > > "I don't know how to implement my own encoding algorithm, > > please help me"-handler. >;-> > > Let's put it this way: the error handler should have at least > the same possibilities as the current builtin error handlers > have. There is a big difference: the generic callback should be able to work without knowing the encoding. All current builtin error handlers know the encoding because there's a specific error handler for every encoding. > If a codec needs more information to process an error > condition, e.g. in case it holds internal state (encoder and > decoder functions may not use external state per design), > then it's the codec which has to be extended -- the error handler > won't be able to help. But the codec knows everything about its own internal state, what it does not know is what kind of error handling is wanted. This additional information can't be provided by the codec, but is provided by the user, who does't know anything about the encoding. (e.g. if it's a list of acceptable encodings from an HTTP Accept-Charset header) > Would this be a good compromise ? > > > > I don't object to adding callback support to the codec's > > > error handlers, but we'll need a new set of APIs to allow > > > this. > > > > I could live with a > > u"xxx".encode(encoding, 'callback', handler) > > on the Python side, but what does this mean for the C API? > > Pretty much the same thing: we'll be adding PyUnicode_EncodeEx() > and PyUnicode_DecodeEx() APIs which have the additional > context object as PyObject*. OK, but what are those objects supposed to know and do? > > > > So I still think your objection is theoretical, whereas the problem > > > > that Walter is trying to solve is real. > > > > > > I did propose a solution which would satisfy your needs: simply > > > add a new error treatment 'xml-escape' to the builtin codecs > > > which then does the needed XML escaping. XML is general enough > > > to warrant such a step and the required changes are minor. > > > > > > Another candidate for a new error treatment would be 'unicode-escape' > > > which then replaces the character in question with '\uXXXX'. > > > > > > For the general case, I'd rather add new PyUnicode_EncodeEx() > > > and PyUnicode_DecodeEx() APIs which then take a Python > > > context object as extra argument. > > > > What should this extra argument be for the decoder? > > A PyObject* just like for the encoder. The codec design is kept > symmetric to simplify support for stackable streams and also > to simplify the APIs (there aren't all that many API signatures > to remember). But the APIs are not really symmetric: There is no easy inverse of u"xxx".encoding(encoding, "callback", xmlReplacementHandler) that automatically generates characters from XML character entities. How would the decoder know, when a character entity is encountered? Encoding errors simply mean that the encoding is not capable of handling the data to be encoded. The error handling then has to provide a way of making the unencodable part of the data encodable. Ideally this should be independant from the encoding. Decoding errors mean something completely different: The encoded data does not conform to the format it claims to be in. Fixing this kind of error requires an intimate knowledge of the encoding and therefore can not be encoding independent. > > > The error treatment string > > > would then define how to use this context object, e.g. 'callback' > > > could be made to apply processing similar to what Walter > > > suggested. > > > > 'callback' seem too generic to me. May there will be other callbacks > > in the future that are used for different stuff. This is the > > "give me a replacement or die" error handler. > > The error handling string should provide enough room for > extensions... what other short string would you recommend ? > 'handler' or 'callcontext' ? In theory "replace" would be the correct name, as the error handler returns a replacement string to be encoded instead of the offending character. but we could use "replacementhandler" or something like that. > [...] Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7= www.livinglogic.de From tim.one@home.com Fri Jan 26 21:01:17 2001 From: tim.one@home.com (Tim Peters) Date: Fri, 26 Jan 2001 16:01:17 -0500 Subject: [I18n-sig] Codec licenses In-Reply-To: <200101261811.DAA01489@dhcp234.grad.sccs.chukyo-u.ac.jp> Message-ID: [Tamito KAJIYAMA] > Exactly this effect of the GNU licenses was the reason why I > chose the GNU GPL for JapaneseCodecs. I wanted my software to > be shared by people forever. Guido does too . The GPL forces everyone who uses your code to make *their* code fall under the GPL too. So by using it, you're also telling other people how they have to license their own software (provided they want to use yours). That's part of the GNU philosophy, of course. You should read Stallman's "Why you shouldn't use the Library GPL for your next library": http://www.fsf.org/philosophy/why-not-lgpl.html Unless you code is impossible to duplicate by other means, people who *don't* want to put their own software under the GPL have a choice: they can implement their own library, and sooner or later someone will, and release it under a less drastic license than the GPL, and then the GPL'ed version will get used less and less. That's why the LGPL was invented. > To the best of my knowledge BSD licenses allow someone to make > that software proprietary and closed-source. Absolutely. That has no effect on your code, though: people can still come to you to get your code. You're the only one who can change your licensing terms. For example, Python is used in some closed-source projects and we couldn't care less. Well, actually, we're happy they're using Python! It doesn't stop you from getting Python from us, and doing whatever *you* want to do with it, so it's hard to see how anyone could feel injured (we don't feel injured, you're happy, and the closed-source people are happy too). > This aspect is a contrast to the aforementioned effect of the GNU > GPL/LGPL. That's why I prefer the latter licenses. The GPL and the LGPL shouldn't be lumped together: they're very different. Stallman's essay (above) should make that clearer. From martin@mira.cs.tu-berlin.de Fri Jan 26 20:43:31 2001 From: martin@mira.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 26 Jan 2001 21:43:31 +0100 Subject: [I18n-sig] Codec licenses In-Reply-To: <3A7147E8.99A2BDC4@lemburg.com> (mal@lemburg.com) References: <3A6C8B37.EDEB795D@lemburg.com> <3A7147E8.99A2BDC4@lemburg.com> Message-ID: <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de> > I haven't received any comment on the above so far. > > Should I take this as rejection of the proposal ? I wasn't going to go into a long discussion about that matter, but I feel quite comfortable with the iconv codec being GPL'ed. Your main rationale for requesting such a change was > This makes the codecs unusable for software which isn't GPL > compatible and limits its usefulness considerably. I firmly believe that free software should be useful on its own technical merits, and that the LGPL is called the "lesser" GPL for a reason; the FSF actively encourages authors *not* to license software under its terms. I could be talked into changing the license if some project that I support would want to use it, and couldn't because of the GPL (e.g. if it was candidate for inclusion into Python). I won't change in advance. Regards, Martin From mal@lemburg.com Fri Jan 26 21:47:10 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 26 Jan 2001 22:47:10 +0100 Subject: [I18n-sig] Codec licenses References: <3A6C8B37.EDEB795D@lemburg.com> <3A7147E8.99A2BDC4@lemburg.com> <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de> Message-ID: <3A71F05E.1B1FC74E@lemburg.com> "Martin v. Loewis" wrote: > > > I haven't received any comment on the above so far. > > > > Should I take this as rejection of the proposal ? > > I wasn't going to go into a long discussion about that matter, but I > feel quite comfortable with the iconv codec being GPL'ed. Your main > rationale for requesting such a change was > > > This makes the codecs unusable for software which isn't GPL > > compatible and limits its usefulness considerably. > > I firmly believe that free software should be useful on its own > technical merits, and that the LGPL is called the "lesser" GPL for a > reason; the FSF actively encourages authors *not* to license software > under its terms. > > I could be talked into changing the license if some project that I > support would want to use it, and couldn't because of the GPL (e.g. if > it was candidate for inclusion into Python). I won't change in > advance. Writing an iconv package has been on my list of "nice projects" for a while. Unfortunately, I haven't found time to look into this. After having seen you code up something along those lines, I dropped the idea... I guess I'll have to revive it again :-/ Note that iconv itself is distributed under the LGPL, so nothing would prevent me from writing a codec package under a Python style license. The same applies to all other codecs. I still think that such a needless effort could be avoided if people were to play nice. We could then wrap a nice codec extension package for everyone to use at their will. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@reportlab.com Fri Jan 26 23:19:12 2001 From: andy@reportlab.com (Andy Robinson) Date: Fri, 26 Jan 2001 23:19:12 -0000 Subject: [I18n-sig] Codec licenses In-Reply-To: <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de> Message-ID: > > I could be talked into changing the license if some project that I > support would want to use it, and couldn't because of the > GPL (e.g. if > it was candidate for inclusion into Python). I won't change in > advance. This dicussion has surprised me considerably. Python has always had a non-restrictive licence, as do almost all the packages available for it, and that is one reason why it is successful. If we want to create an "official" Python codec package, we should be prepared to do it under a Python-style license. My own company (www.reportlab.com) makes free and unrestricted reporting libraries, but we are preparing commercial products which will sit on top of those. We need to start selling these products for high prices per server license in order to stay alive and keep coding and keep contributing to open source. One feature we will need within six months is encoding conversions. We would not be able to use any GPL'ed code. So, if we get a customer for Report Markup Language in Japan and we need to do encoding conversions, we will be forced to write a clean implementation. And I promise that we'll release it to the world under a Python compatible licence, as we have no interest in trying to sell such a general-purpose utility. Furthermore, I have done a lot of consulting projects for big corporate customers where we solved problems by integrating open source code. They don't want to take any GPL'ed code, as the cost of ripping it out in future if they ever do want to sell some software would be huge. The Python license has never caused a question. It's always the author's choice, but if you prevent any software house from developing commercial packages which use your code, you limit its exposure and acceptance. Just my 2p worth, Andy Robinson From tree@basistech.com Fri Jan 26 23:33:24 2001 From: tree@basistech.com (Tom Emerson) Date: Fri, 26 Jan 2001 18:33:24 -0500 Subject: [I18n-sig] Codec licenses In-Reply-To: References: <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de> Message-ID: <14962.2372.636324.340540@cymru.basistech.com> I agree with Andy. I will also add that, for most of the encodings we're looking at, there is no magic going on: EUC-KR or EUC-CN to Unicode, and back, is a simple table lookup. Doing the ISO-2022 encodings is a bit more work, but it isn't rocket science. As far as I'm concerned, a codec that merely wraps the Unicode Consortium's mapping tables is hardly deserving of any license at all. Using the existing codecs (or an Asian codec package) is an issue of convenience more than anything. This is not meant to belittle those who have written these codecs... my point is merely that placing a highly restrictive license such as the GPL on a codec is considerable overkill. Please direct nasty grams to /dev/null. -tree -- Tom Emerson Basis Technology Corp. Zenkaku Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From martin@mira.cs.tu-berlin.de Fri Jan 26 23:36:44 2001 From: martin@mira.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 27 Jan 2001 00:36:44 +0100 Subject: [I18n-sig] Codec licenses In-Reply-To: <3A71F05E.1B1FC74E@lemburg.com> (mal@lemburg.com) References: <3A6C8B37.EDEB795D@lemburg.com> <3A7147E8.99A2BDC4@lemburg.com> <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de> <3A71F05E.1B1FC74E@lemburg.com> Message-ID: <200101262336.f0QNaie01717@mira.informatik.hu-berlin.de> > Note that iconv itself is distributed under the LGPL, so nothing > would prevent me from writing a codec package under a Python > style license. The same applies to all other codecs. > > I still think that such a needless effort could be avoided if > people were to play nice. We could then wrap a nice codec extension > package for everyone to use at their will. I don't see your point (but that is probably a starting point to a long and needless discussion on free software and licensing). You are certainly free to write an iconv codec. I can't see *why* you would want to do so - unless you have an actual need for it. If so, what is that need? I'm curious. Talking about talking other people into changing the license of their software: Could you please change the license of mxODBC so that it is free software? A BSD-style license would be nice; restrictions on commercial use are not. Regards, Martin From mal@lemburg.com Sat Jan 27 00:05:53 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 27 Jan 2001 01:05:53 +0100 Subject: [I18n-sig] Codec licenses References: <3A6C8B37.EDEB795D@lemburg.com> <3A7147E8.99A2BDC4@lemburg.com> <200101262043.f0QKhVN00904@mira.informatik.hu-berlin.de> <3A71F05E.1B1FC74E@lemburg.com> <200101262336.f0QNaie01717@mira.informatik.hu-berlin.de> Message-ID: <3A7210E1.F1867092@lemburg.com> "Martin v. Loewis" wrote: > > > Note that iconv itself is distributed under the LGPL, so nothing > > would prevent me from writing a codec package under a Python > > style license. The same applies to all other codecs. > > > > I still think that such a needless effort could be avoided if > > people were to play nice. We could then wrap a nice codec extension > > package for everyone to use at their will. > > I don't see your point (but that is probably a starting point to a > long and needless discussion on free software and licensing). > > You are certainly free to write an iconv codec. I can't see *why* you > would want to do so - unless you have an actual need for it. If so, > what is that need? I'm curious. Very simple: I make a living out of selling closed-source software. As it happens much of the closed-source software uses basic building blocks which are open source, such as Python and many of my mx tools. GPLed code is useless in such a setup though, so I'd need to rewrite the code using either a closed source license (doesn't buy me anything) or a liberal Python style license (buys me free debugging and save lots of others the effort of writing their own version -- with the result of making everyone happy). > Talking about talking other people into changing the license of their > software: Could you please change the license of mxODBC so that it is > free software? A BSD-style license would be nice; restrictions on > commercial use are not. I'm not talking anyone into changing their mind on what license to put on their software. I just want people to be aware of what they are doing when they use the GPL for licensing software. As for mxODBC: that will turn into a commercial product starting with the next release. I have to take this step in order to fund development of the other mx open source tools and to be able to actively maintain the package (which is a can of worms...). Anyway, let's *not* head down this road. The codec authors are free to do whatever they like. I just wanted to clarify the problems which using the GPL for library style has for the code and its users -- nothing more. I don't want to talk anyone into changing licenses. It would be nice though, if I could convince some of the authors to rethink their decision. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ PS: We seem to be on different wave-length on a lot of subjects, Martin. Let's simply agree to differ :-) From frank63@ms5.hinet.net Sat Jan 27 11:49:52 2001 From: frank63@ms5.hinet.net (Frank Chen) Date: Sat, 27 Jan 2001 11:49:52 -0000 Subject: [I18n-sig] Re: Codec licenses Message-ID: <200101270347.LAA12815@ms5.hinet.net> Hi: Then, if I say: Conform to GPL or LGPL. Is this logical? Frank Chen From andy@reportlab.com Sat Jan 27 08:07:49 2001 From: andy@reportlab.com (Andy Robinson) Date: Sat, 27 Jan 2001 08:07:49 -0000 Subject: [I18n-sig] Re: Codec licenses In-Reply-To: <200101270347.LAA12815@ms5.hinet.net> Message-ID: > Then, if I say: > > Conform to GPL or LGPL. > > Is this logical? If you give people the choice of licenses, yes, that totally solves the problem. - Andy Robinson