From paulp@ActiveState.com Sun Jul 1 20:57:09 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 01 Jul 2001 12:57:09 -0700 Subject: [I18n-sig] PEP 261, Rev 1.3 - Support for "wide" Unicode characters Message-ID: <3B3F8095.8D58631D@ActiveState.com> PEP: 261 Title: Support for "wide" Unicode characters Version: $Revision: 1.3 $ Author: paulp@activestate.com (Paul Prescod) Status: Draft Type: Standards Track Created: 27-Jun-2001 Python-Version: 2.2 Post-History: 27-Jun-2001 Abstract Python 2.1 unicode characters can have ordinals only up to 2**16 -1. This range corresponds to a range in Unicode known as the Basic Multilingual Plane. There are now characters in Unicode that live on other "planes". The largest addressable character in Unicode has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we will call this TOPCHAR and call characters in this range "wide characters". Glossary Character Used by itself, means the addressable units of a Python Unicode string. Code point A code point is an integer between 0 and TOPCHAR. If you imagine Unicode as a mapping from integers to characters, each integer is a code point. But the integers between 0 and TOPCHAR that do not map to characters are also code points. Some will someday be used for characters. Some are guaranteed never to be used for characters. Codec A set of functions for translating between physical encodings (e.g. on disk or coming in from a network) into logical Python objects. Encoding Mechanism for representing abstract characters in terms of physical bits and bytes. Encodings allow us to store Unicode characters on disk and transmit them over networks in a manner that is compatible with other Unicode software. Surrogate pair Two physical characters that represent a single logical character. Part of a convention for representing 32-bit code points in terms of two 16-bit code points. Unicode string A Python type representing a sequence of code points with "string semantics" (e.g. case conversions, regular expression compatibility, etc.) Constructed with the unicode() function. Proposed Solution One solution would be to merely increase the maximum ordinal to a larger value. Unfortunately the only straightforward implementation of this idea is to use 4 bytes per character. This has the effect of doubling the size of most Unicode strings. In order to avoid imposing this cost on every user, Python 2.2 will allow the 4-byte implementation as a build-time option. Users can choose whether they care about wide characters or prefer to preserve memory. The 4-byte option is called "wide Py_UNICODE". The 2-byte option is called "narrow Py_UNICODE". Most things will behave identically in the wide and narrow worlds. * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a length-one string. * unichr(i) for 2**16 <= i <= TOPCHAR will return a length-one string on wide Python builds. On narrow builds it will raise ValueError. ISSUE Python currently allows \U literals that cannot be represented as a single Python character. It generates two Python characters known as a "surrogate pair". Should this be disallowed on future narrow Python builds? Pro: Python already the construction of a surrogate pair for a large unicode literal character escape sequence. This is basically designed as a simple way to construct "wide characters" even in a narrow Python build. It is also somewhat logical considering that the Unicode-literal syntax is basically a short-form way of invoking the unicode-escape codec. Con: Surrogates could be easily created this way but the user still needs to be careful about slicing, indexing, printing etc. Therefore some have suggested that Unicode literals should not support surrogates. ISSUE Should Python allow the construction of characters that do not correspond to Unicode code points? Unassigned Unicode code points should obviously be legal (because they could be assigned at any time). But code points above TOPCHAR are guaranteed never to be used by Unicode. Should we allow access to them anyhow? Pro: If a Python user thinks they know what they're doing why should we try to prevent them from violating the Unicode spec? After all, we don't stop 8-bit strings from containing non-ASCII characters. Con: Codecs and other Unicode-consuming code will have to be careful of these characters which are disallowed by the Unicode specification. * ord() is always the inverse of unichr() * There is an integer value in the sys module that describes the largest ordinal for a character in a Unicode string on the current interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds of Python and TOPCHAR on wide builds. ISSUE: Should there be distinct constants for accessing TOPCHAR and the real upper bound for the domain of unichr (if they differ)? There has also been a suggestion of sys.unicodewidth which can take the values 'wide' and 'narrow'. * every Python Unicode character represents exactly one Unicode code point (i.e. Python Unicode Character = Abstract Unicode character). * codecs will be upgraded to support "wide characters" (represented directly in UCS-4, and as variable-length sequences in UTF-8 and UTF-16). This is the main part of the implementation left to be done. * There is a convention in the Unicode world for encoding a 32-bit code point in terms of two 16-bit code points. These are known as "surrogate pairs". Python's codecs will adopt this convention and encode 32-bit code points as surrogate pairs on narrow Python builds. ISSUE Should there be a way to tell codecs not to generate surrogates and instead treat wide characters as errors? Pro: I might want to write code that works only with fixed-width characters and does not have to worry about surrogates. Con: No clear proposal of how to communicate this to codecs. * there are no restrictions on constructing strings that use code points "reserved for surrogates" improperly. These are called "isolated surrogates". The codecs should disallow reading these from files, but you could construct them using string literals or unichr(). Implementation There is a new (experimental) define: #define PY_UNICODE_SIZE 2 There is a new configure option: --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses wchar_t if it fits --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses whchar_t if it fits --enable-unicode same as "=ucs2" The intention is that --disable-unicode, or --enable-unicode=no removes the Unicode type altogether; this is not yet implemented. It is also proposed that one day --enable-unicode will just default to the width of your platforms wchar_t. Windows builds will be narrow for a while based on the fact that there have been few requests for wide characters, those requests are mostly from hard-core programmers with the ability to buy their own Python and Windows itself is strongly biased towards 16-bit characters. Notes This PEP does NOT imply that people using Unicode need to use a 4-byte encoding for their files on disk or sent over the network. It only allows them to do so. For example, ASCII is still a legitimate (7-bit) Unicode-encoding. It has been proposed that there should be a module that handles surrogates in narrow Python builds for programmers. If someone wants to implement that, it will be another PEP. It might also be combined with features that allow other kinds of character-, word- and line- based indexing. Rejected Suggestions More or less the status-quo We could officially say that Python characters are 16-bit and require programmers to implement wide characters in their application logic by combining surrogate pairs. This is a heavy burden because emulating 32-bit characters is likely to be very inefficient if it is coded entirely in Python. Plus these abstracted pseudo-strings would not be legal as input to the regular expression engine. "Space-efficient Unicode" type Another class of solution is to use some efficient storage internally but present an abstraction of wide characters to the programmer. Any of these would require a much more complex implementation than the accepted solution. For instance consider the impact on the regular expression engine. In theory, we could move to this implementation in the future without breaking Python code. A future Python could "emulate" wide Python semantics on narrow Python. Guido is not willing to undertake the implementation right now. Two types We could introduce a 32-bit Unicode type alongside the 16-bit type. There is a lot of code that expects there to be only a single Unicode type. This PEP represents the least-effort solution. Over the next several years, 32-bit Unicode characters will become more common and that may either convince us that we need a more sophisticated solution or (on the other hand) convince us that simply mandating wide Unicode characters is an appropriate solution. Right now the two options on the table are do nothing or do this. References Unicode Glossary: http://www.unicode.org/glossary/ Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From mal@lemburg.com Mon Jul 2 11:13:59 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 02 Jul 2001 12:13:59 +0200 Subject: [I18n-sig] Re: [Python-Dev] PEP 261, Rev 1.3 - Support for "wide" Unicode characters References: <3B3F8095.8D58631D@ActiveState.com> Message-ID: <3B404967.14FE180F@lemburg.com> Paul Prescod wrote: > > PEP: 261 > Title: Support for "wide" Unicode characters > Version: $Revision: 1.3 $ > Author: paulp@activestate.com (Paul Prescod) > Status: Draft > Type: Standards Track > Created: 27-Jun-2001 > Python-Version: 2.2 > Post-History: 27-Jun-2001 > > Abstract > > Python 2.1 unicode characters can have ordinals only up to 2**16 > -1. > This range corresponds to a range in Unicode known as the Basic > Multilingual Plane. There are now characters in Unicode that live > on other "planes". The largest addressable character in Unicode > has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we > will call this TOPCHAR and call characters in this range "wide > characters". > > Glossary > > Character > > Used by itself, means the addressable units of a Python > Unicode string. Please add: also known as "code unit". > Code point > > A code point is an integer between 0 and TOPCHAR. > If you imagine Unicode as a mapping from integers to > characters, each integer is a code point. But the > integers between 0 and TOPCHAR that do not map to > characters are also code points. Some will someday > be used for characters. Some are guaranteed never > to be used for characters. > > Codec > > A set of functions for translating between physical > encodings (e.g. on disk or coming in from a network) > into logical Python objects. > > Encoding > > Mechanism for representing abstract characters in terms of > physical bits and bytes. Encodings allow us to store > Unicode characters on disk and transmit them over networks > in a manner that is compatible with other Unicode software. > > Surrogate pair > > Two physical characters that represent a single logical Eeek... two code units (or have you ever seen a physical character walking around ;-) > character. Part of a convention for representing 32-bit > code points in terms of two 16-bit code points. > > Unicode string > > A Python type representing a sequence of code points with > "string semantics" (e.g. case conversions, regular > expression compatibility, etc.) Constructed with the > unicode() function. > > Proposed Solution > > One solution would be to merely increase the maximum ordinal > to a larger value. Unfortunately the only straightforward > implementation of this idea is to use 4 bytes per character. > This has the effect of doubling the size of most Unicode > strings. In order to avoid imposing this cost on every > user, Python 2.2 will allow the 4-byte implementation as a > build-time option. Users can choose whether they care about > wide characters or prefer to preserve memory. > > The 4-byte option is called "wide Py_UNICODE". The 2-byte option > is called "narrow Py_UNICODE". > > Most things will behave identically in the wide and narrow worlds. > > * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a > length-one string. > > * unichr(i) for 2**16 <= i <= TOPCHAR will return a > length-one string on wide Python builds. On narrow builds it will > raise ValueError. > > ISSUE > > Python currently allows \U literals that cannot be > represented as a single Python character. It generates two > Python characters known as a "surrogate pair". Should this > be disallowed on future narrow Python builds? > > Pro: > > Python already the construction of a surrogate pair > for a large unicode literal character escape sequence. > This is basically designed as a simple way to construct > "wide characters" even in a narrow Python build. It is also > somewhat logical considering that the Unicode-literal syntax > is basically a short-form way of invoking the unicode-escape > codec. > > Con: > > Surrogates could be easily created this way but the user > still needs to be careful about slicing, indexing, printing > etc. Therefore some have suggested that Unicode > literals should not support surrogates. > > ISSUE > > Should Python allow the construction of characters that do > not correspond to Unicode code points? Unassigned Unicode > code points should obviously be legal (because they could > be assigned at any time). But code points above TOPCHAR are > guaranteed never to be used by Unicode. Should we allow > access > to them anyhow? > > Pro: > > If a Python user thinks they know what they're doing why > should we try to prevent them from violating the Unicode > spec? After all, we don't stop 8-bit strings from > containing non-ASCII characters. > > Con: > > Codecs and other Unicode-consuming code will have to be > careful of these characters which are disallowed by the > Unicode specification. > > * ord() is always the inverse of unichr() > > * There is an integer value in the sys module that describes the > largest ordinal for a character in a Unicode string on the current > interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds > of Python and TOPCHAR on wide builds. > > ISSUE: Should there be distinct constants for accessing > TOPCHAR and the real upper bound for the domain of > unichr (if they differ)? There has also been a > suggestion of sys.unicodewidth which can take the > values 'wide' and 'narrow'. > > * every Python Unicode character represents exactly one Unicode code > point (i.e. Python Unicode Character = Abstract Unicode > character). > > * codecs will be upgraded to support "wide characters" > (represented directly in UCS-4, and as variable-length sequences > in UTF-8 and UTF-16). This is the main part of the implementation > left to be done. > > * There is a convention in the Unicode world for encoding a 32-bit > code point in terms of two 16-bit code points. These are known > as "surrogate pairs". Python's codecs will adopt this convention > and encode 32-bit code points as surrogate pairs on narrow Python > builds. > > ISSUE > > Should there be a way to tell codecs not to generate > surrogates and instead treat wide characters as > errors? > > Pro: > > I might want to write code that works only with > fixed-width characters and does not have to worry about > surrogates. > > Con: > > No clear proposal of how to communicate this to codecs. No need to pass this information to the codec: simply write a new one and give it a clear name, e.g. "ucs-2" will generate errors while "utf-16-le" converts them to surrogates. > * there are no restrictions on constructing strings that use > code points "reserved for surrogates" improperly. These are > called "isolated surrogates". The codecs should disallow reading > these from files, but you could construct them using string > literals or unichr(). > > Implementation > > There is a new (experimental) define: > > #define PY_UNICODE_SIZE 2 > > There is a new configure option: > > --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses > wchar_t if it fits > --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses > whchar_t if it fits > --enable-unicode same as "=ucs2" > > The intention is that --disable-unicode, or --enable-unicode=no > removes the Unicode type altogether; this is not yet implemented. > > It is also proposed that one day --enable-unicode will just > default to the width of your platforms wchar_t. > > Windows builds will be narrow for a while based on the fact that > there have been few requests for wide characters, those requests > are mostly from hard-core programmers with the ability to buy > their own Python and Windows itself is strongly biased towards > 16-bit characters. > > Notes > > This PEP does NOT imply that people using Unicode need to use a > 4-byte encoding for their files on disk or sent over the network. > It only allows them to do so. For example, ASCII is still a > legitimate (7-bit) Unicode-encoding. > > It has been proposed that there should be a module that handles > surrogates in narrow Python builds for programmers. If someone > wants to implement that, it will be another PEP. It might also be > combined with features that allow other kinds of character-, > word- and line- based indexing. > > Rejected Suggestions > > More or less the status-quo > > We could officially say that Python characters are 16-bit and > require programmers to implement wide characters in their > application logic by combining surrogate pairs. This is a heavy > burden because emulating 32-bit characters is likely to be > very inefficient if it is coded entirely in Python. Plus these > abstracted pseudo-strings would not be legal as input to the > regular expression engine. > > "Space-efficient Unicode" type > > Another class of solution is to use some efficient storage > internally but present an abstraction of wide characters to > the programmer. Any of these would require a much more complex > implementation than the accepted solution. For instance consider > the impact on the regular expression engine. In theory, we could > move to this implementation in the future without breaking > Python > code. A future Python could "emulate" wide Python semantics on > narrow Python. Guido is not willing to undertake the > implementation right now. > > Two types > > We could introduce a 32-bit Unicode type alongside the 16-bit > type. There is a lot of code that expects there to be only a > single Unicode type. > > This PEP represents the least-effort solution. Over the next > several years, 32-bit Unicode characters will become more common > and that may either convince us that we need a more sophisticated > solution or (on the other hand) convince us that simply > mandating wide Unicode characters is an appropriate solution. > Right now the two options on the table are do nothing or do > this. > > References > > Unicode Glossary: http://www.unicode.org/glossary/ Plus perhaps the Mark Davis paper at: http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/ > Copyright > > This document has been placed in the public domain. Good work, Paul ! -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From walter@livinglogic.de Mon Jul 2 12:40:52 2001 From: walter@livinglogic.de (Walter =?ISO-8859-1?Q?D=F6rwald?=) Date: Mon, 02 Jul 2001 13:40:52 +0200 Subject: [I18n-sig] Error handling (was: Re: validity of lone surrogates) References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au> <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk> <200106271416.f5REGl519361@odiug.digicool.com> <3B3A1020.7154E4B6@livinglogic.de> <200106271753.f5RHrAB19753@odiug.digicool.com> Message-ID: <3B405DC4.1050900@livinglogic.de> > > How would this work together with the proposed encode error handling > > callback feature (see patch #432401)? Does this patch have any=20 change of > > getting into Python (when it's finished)? > > I don't know. The patch looks awfully big, and the motivation seems > thin, so I don't have high hopes. I doubt that I would use it myself, > and I fear that it would be pretty slow if called frequently. Here are a few speed comparisons: --- import time s =3D u"a"*20000000 t1 =3D time.time() s.encode("ascii") t2 =3D time.time() print t2-t1 --- The result with Python 2.1 is: 0.65726006031 With the patch the time is: 0.895708084106 (This is probably due to the memory reallocation tests, which could be avoided for most encoders) And a test script with a error handler: --- import time s =3D u"a=E4"*1000000 t1 =3D time.time() s.encode("ascii", lambda enc,uni,pos: u"&#%d;" % ord(uni[pos])) t2 =3D time.time() print t2-t1 --- 37.0272110701 There a version of this error handler implemented in C, so replacing s.encode("ascii", lambda enc,uni,pos: u"&#%d;" % ord(uni[pos])) with s.encode("ascii", codecs.xmlcharrefreplace_unicodeencode_errors) gives a result of 4.77566099167 The equivalent Python code: --- import time s =3D u"a=E4"*1000000 t1 =3D time.time() v =3D [] for c in s: try: v.append(s.encode("ascii")) except UnicodeError: v.append("&#%d;" % ord(c)) "".join(v) t2 =3D time.time() print t2-t1 --- 345.193374991 (Note that this is not really equivalent, because it doesn't work with stateful encoders (e.g. UTF16 generates multiple BOMs)) > An alternative way to get what you want would be to write your own > codec. This would have to be more like a meta codec, because this feature=20 should be available for every character encoding. > Also, some standard codecs might be subclassable in a way that > makes it easy to get the desired functionality through subclassing > rather than through changing lots of C level APIs. The patch changes the API in two places: 1. "PyObject *error" is used instead of "const char *error", because=20 error may be a callable object instead of a string. There would be a=20 possibility to have error argument as "const char *error": Define an=20 error handling registry were error handling function can be registered=20 by name: codec.registerError("xmlreplace", lambda enc,uni,pos: "&#%d;" % ord(uni[pos])) and then the following call can be made: u"=E4=F6=FC".encode("ascii", "xmlreplace") As soon as the first error is encountered, the encoder uses it's builtin=20 error handling method if it recognizes the name ("strict", "replace" or=20 "ignore") or looks up the error handling function in the registry if it=20 doesn't. In this way the speed for the backwards compatible features is=20 the same as before and "const char *error" can be kept as the parameter=20 to all encoding functions. For speed common error handling names could=20 even be implemented in the encoder itself. 2. The arguments "Py_UNICODE *str, int size" to the encoder functions=20 have been replaced with "PyObject *unicode", this was done because the=20 original string is passed to the callback handler, which is just an=20 INCREF when the string is already available as "PyObject *unicode", but=20 a new string has to be created from str/size (but this has to be done=20 only once for the first error). So it's possible to changethis back to=20 the original. With this it would be possible to implement the functionality without=20 changing the API and without any loss of speed for already existing=20 functionality. Old third party encoders will continue to work for the=20 old error options and would simply raise an "unknown error handling"=20 exception for the new ones. Should I try this approach? Does it have a better chance of getting into Python? Bye, Walter D=F6rwald From fredrik@pythonware.com Mon Jul 2 17:02:31 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 2 Jul 2001 18:02:31 +0200 Subject: [I18n-sig] UCS-4 configuration References: Message-ID: <008a01c10310$671dc990$4ffa42d5@hagrid> tim wrote: > [discussion about PyUnicode_DecodeUTF16] > > It's nice that we got to chat about portability to Platforms from Mars, but > is anyone actually going to work on that function? It shouldn't be hard, I > just don't want to see it fall thru the cracks. isn't it about time you hacked on some unicode stuff? ;-) Cheers /F From pinard@iro.umontreal.ca Mon Jul 2 19:19:05 2001 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: 02 Jul 2001 14:19:05 -0400 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? In-Reply-To: <87u216qluh.fsf@deneb.enyo.de> References: <9F2D83017589D211BD1000805FA70CA703B139D6@ntxmel03.cmutual.com.au> <87u216qluh.fsf@deneb.enyo.de> Message-ID: [Florian Weimer] > ISO 10646 is the ISO standard with lowest money per page ratio ever I heard that ISO lowered the price of 10646 indeed. A few years ago, we needed 10646, and the price was, euh, substantial. :-) -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Mon Jul 2 20:05:35 2001 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: 02 Jul 2001 15:05:35 -0400 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: <200106271953.f5RJrPi19963@odiug.digicool.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3696.FFA7FCE@ActiveState.com> <200106271953.f5RJrPi19963@odiug.digicool.com> Message-ID: [Guido van Rossum] > When using UCS-4 mode, I was in favor of allowing unichr() and \U to > specify any value in range(0x100000000L) I did not check recently, but would think Unicode and 10646 are defined on 31 bits, not 32. If you represent an UCS-4 code within a 32 bit int, it will never be negative. It might be useful to rely on this. P.S. - Would not 32 bits also require one more byte in UTF-8? -- François Pinard http://www.iro.umontreal.ca/~pinard From tim.one@home.com Mon Jul 2 20:46:44 2001 From: tim.one@home.com (Tim Peters) Date: Mon, 2 Jul 2001 15:46:44 -0400 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: <008a01c10310$671dc990$4ffa42d5@hagrid> Message-ID: [/F] > isn't it about time you hacked on some unicode stuff? ;-) It's a good thing I'm out sick today, cuz they'd never pay me for this: http://sf.net/tracker/index.php?func=detail&aid=438013&group_id=5470&atid=305470 From deltab@osian.net Mon Jul 2 22:21:49 2001 From: deltab@osian.net (Daniel Biddle) Date: Mon, 2 Jul 2001 21:21:49 +0000 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: ; from pinard@iro.umontreal.ca on Mon, Jul 02, 2001 at 03:05:13PM -0400 References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3696.FFA7FCE@ActiveState.com> <200106271953.f5RJrPi19963@odiug.digicool.com> Message-ID: <20010702212149.D30109@mewtwo.espnow.com> On Mon, Jul 02, 2001 at 03:05:13PM -0400, François Pinard wrote: > [Guido van Rossum] > > > When using UCS-4 mode, I was in favor of allowing unichr() and \U to > > specify any value in range(0x100000000L) > > I did not check recently, but would think Unicode and 10646 are defined > on 31 bits, not 32. If you represent an UCS-4 code within a 32 bit int, > it will never be negative. It might be useful to rely on this. Certainly ISO 10646 is defined as 31-bit. Unicode was 16-bit, but now uses just under 20.09 bits. > P.S. - Would not 32 bits also require one more byte in UTF-8? Yes: bits 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx control 7 2 2 2 2 2 = 17 data 1 6 6 6 6 6 = 31 UTF-8 allows at most 6 bytes, which can encode 31 bits. It's been proposed that UTF-8 and UTF-32 be limited to values up to U+10FFFF, which is the limit of UTF-16. -- Daniel Biddle From wwwjessie@21cn.com Thu Jul 12 11:01:49 2001 From: wwwjessie@21cn.com (wwwjessie@21cn.com) Date: Thu, 12 Jul 2001 18:01:49 +0800 Subject: [I18n-sig] =?gb2312?B?xvPStcnPzfijrNK7sr21vc67KFlvdXIgb25saW5lIGNvbXBhbnkp?= Message-ID: <34ee401c10ab9$aafb2da0$9300a8c0@ifood1gongxing> This is a multi-part message in MIME format. ------=_NextPart_000_34EE5_01C10AFC.B91E6DA0 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: base64 1/C+tLXEu+HUsaOsxPq6w6Oh0rzKs8a31tC5+s34t/7O8dDFz6K5qcT6ss6/vKO6ICANCg0K07XT 0NfUvLq1xM34yc+5q8u+o6zVucq+uavLvrL6xre6zbf+zvGjrMzhuN/G89K1vrrV+cGmLMT609DB vdbW0aHU8aO6DQoNCjEvIM341b62qNbGIDxodHRwOi8vd3d3Lmlmb29kMS5jb20vYWJvdXR1cy9v dXJzZXJ2aWNlcy93ZWIuYXNwPiAgOg0K19S8us6su6S4/NDCo6y53MDtx7DMqLrzzKijrLj5vt3G 89K10OjSqqOsvajBotfUvLq1xM34yc+5q8u+o6zK/b7dv+LEo7/pyM7E+tGh1PGjusnMx+nQxc+i t6KyvCzN+MnPsvrGt9W5yr6jrL/Nu6e3/s7x1tDQxCzN+MnPubrO78+1zbMsv827p7nYDQrPtbnc wO0szfjJz8LbzLMszfjJz7vh0unW0NDELM34yc/V0Ma4LM22xrHPtc2zLNfKwc/PwtTY1tDQxCzO yr7ttfey6Swg1dCx6rLJubrPtc2zLLfDzsrV382zvMa31s72LCDBxMzsytIovbvB96GizLjF0Cmh raGtDQoNCs/rwcu94sr9vt2/4sSjv+nR3cq+1tDQxKO/x+vBqs+1o7ogc2FsZXNAaWZvb2QxLmNv bSA8bWFpbHRvOnNhbGVzQGlmb29kMS5jb20+DQqhobXnu7CjujA3NTUtMzc4NjMwOaGhz/rK27K/ yfLQob3jDQoNCjIvINK8zfjNqCA8aHR0cDovL29uZXQuaWZvb2QxLmNvbS8+DQot19TW+sq9vajN +KOsstnX97zytaWjrLy0vai8tNPDo7q/ydW5yr4zMNXFu/K4/Lbg1dXGrKOs19TW+sq9zqy7pKOs v8nL5sqxuPzQws28xqy6zc7E19bE2sjdo6zU2s/ft6KyvLL6xrfQxc+ioaK5q8u+tq/MrLXIo6zU +cvNtv68trn6vMrT8sP7KA0KyOdodHRwOi8veW91cm5hbWUuaWZvb2QxLmNvbSmjrNPr0rzKs8a3 1tC5+s34KNKzw+bkr8DAwb/UwtPiMjAwzfK0zim99MPcway906OszOG438LyvNK6zbnLv823w87K wb+jrLaoxtrK1bW90rzKsw0KxrfW0Ln6zfjM4bmptcS/zbun0OjH87rNssm5utDFz6Khow0KDQoN Cg0KN9TCMzDI1cewyerH67KiuLa/7sq508PSvM34zaijrMzYsfDTxbvdvNszODAw1KovxOqjrNT5 y83M9cLrueO45rKiw+K30dTayrPGt9eo0rXU09a+v6+1x7mpo6zH86OstPrA7aOsus/X99DFz6IN Cs/rwcu94rj8tuA/IKGhx+vBqs+1o7ogc2FsZXNAaWZvb2QxLmNvbSA8bWFpbHRvOnNhbGVzQGlm b29kMS5jb20+DQqhobXnu7CjujA3NTUtMzc4NjMwOaGhoaHP+srbsr/J8tChveMNCrvyILfDzsrO 0sPHtcTN+NKzIDxodHRwOi8vd3d3Lmlmb29kMS5jb20vYWJvdXR1cy9vdXJzZXJ2aWNlcy9jcHNl cnZpY2UuYXNwPg0KOnd3dy5pZm9vZDEuY29tDQoNCrvY1rSjqMfrtKvV5qO6MDc1NS0zMjM5MDQ3 u/K3orXn19PTyrz+o7ogc2FsZXNAaWZvb2QxLmNvbSA8bWFpbHRvOnNhbGVzQGlmb29kMS5jb20+ IKOpDQoNCqH1ILG+uavLvrbUzfjVvrao1sa40NDLyKShoaGhICAgICAgICAgICAgICAgICAgICAg ofUgsb65q8u+ttTSvM34zai3/s7xuNDQy8ikDQoNCrmry77D+7PGo7pfX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX19fX1/Bqs+1yMujul9fX19fX19fX19fX19fX19fXw0K X19fX18gDQoNCrXnu7Cjul9fX19fX19fX19fX19fX19fX19fX7Sr1eajul9fX19fX19fX19fX19f X19fX19fX19FLW1haWyjul9fX19fX19fX19fX19fX18NCl9fX19fXyANCg0K ------=_NextPart_000_34EE5_01C10AFC.B91E6DA0 Content-Type: text/html; charset="gb2312" Content-Transfer-Encoding: base64 PEhUTUw+DQo8SEVBRD4NCjxUSVRMRT5VbnRpdGxlZCBEb2N1bWVudDwvVElUTEU+IDxNRVRBIEhU VFAtRVFVSVY9IkNvbnRlbnQtVHlwZSIgQ09OVEVOVD0idGV4dC9odG1sOyBjaGFyc2V0PWdiMjMx MiI+IA0KPC9IRUFEPg0KDQo8Qk9EWSBCR0NPTE9SPSIjRkZGRkZGIiBURVhUPSIjMDAwMDAwIj4N CjxUQUJMRSBXSURUSD0iOTglIiBCT1JERVI9IjAiIENFTExTUEFDSU5HPSIwIiBDRUxMUEFERElO Rz0iMCI+PFRSPjxURD48UCBDTEFTUz1Nc29Ob3JtYWwgU1RZTEU9J21hcmdpbi1yaWdodDotMTcu ODVwdDtsaW5lLWhlaWdodDoxNTAlJz48Rk9OVCBTSVpFPSIyIj7X8L60tcS74dSxo6zE+rrDo6HS vMqzxrfW0Ln6zfi3/s7x0MXPormpxPqyzr+8o7ombmJzcDs8L0ZPTlQ+IA0KPC9QPjxQIENMQVNT PU1zb05vcm1hbCBTVFlMRT0nbWFyZ2luLXJpZ2h0Oi0xNy44NXB0O2xpbmUtaGVpZ2h0OjE1MCUn PjxGT05UIFNJWkU9IjIiPtO109DX1Ly6tcTN+MnPuavLvqOs1bnKvrmry76y+sa3us23/s7xo6zM 4bjfxvPStb661fnBpizE+tPQwb3W1tGh1PGjujxCUj48QlI+MS8gDQo8QQ0KSFJFRj0iaHR0cDov L3d3dy5pZm9vZDEuY29tL2Fib3V0dXMvb3Vyc2VydmljZXMvd2ViLmFzcCI+zfjVvrao1sY8L0E+ IDog19S8us6su6S4/NDCo6y53MDtx7DMqLrzzKijrLj5vt3G89K10OjSqqOsvajBotfUvLq1xM34 yc+5q8u+o6zK/b7dv+LEo7/pyM7E+tGh1PGjusnMx+nQxc+it6KyvCzN+MnPsvrGt9W5yr6jrL/N u6e3/s7x1tDQxCzN+MnPubrO78+1zbMsv827p7nYz7W53MDtLM34yc/C28yzLM34yc+74dLp1tDQ xCzN+MnP1dDGuCzNtsaxz7XNsyzXysHPz8LU2NbQ0MQszsq+7bX3suksIA0K1dCx6rLJubrPtc2z LLfDzsrV382zvMa31s72LCDBxMzsytIovbvB96GizLjF0CmhraGtPC9GT05UPjwvUD48UCBDTEFT Uz1Nc29Ob3JtYWwgU1RZTEU9J2xpbmUtaGVpZ2h0OjIwLjBwdCc+PEI+PEZPTlQgQ09MT1I9IiNG RjAwMDAiPs/rwcu94sr9vt2/4sSjv+nR3cq+1tDQxKO/PC9GT05UPjwvQj48Rk9OVCBTSVpFPSIy Ij7H68Gqz7WjujxBIEhSRUY9Im1haWx0bzpzYWxlc0BpZm9vZDEuY29tIj5zYWxlc0BpZm9vZDEu Y29tPC9BPiANCqGhtee7sKO6MDc1NS0zNzg2MzA5oaHP+srbsr/J8tChveM8L0ZPTlQ+PC9QPjxQ IENMQVNTPU1zb05vcm1hbCBTVFlMRT0nbGluZS1oZWlnaHQ6MjAuMHB0Jz48L1A+PFAgQ0xBU1M9 TXNvTm9ybWFsIFNUWUxFPSdsaW5lLWhlaWdodDoyMC4wcHQnPjxGT05UIFNJWkU9IjIiPjIvIA0K PEEgSFJFRj0iaHR0cDovL29uZXQuaWZvb2QxLmNvbS8iPtK8zfjNqDwvQT4t19TW+sq9vajN+KOs stnX97zytaWjrLy0vai8tNPDo7q/ydW5yr4zMNXFu/K4/Lbg1dXGrKOs19TW+sq9zqy7pKOsv8nL 5sqxuPzQws28xqy6zc7E19bE2sjdo6zU2s/ft6KyvLL6xrfQxc+ioaK5q8u+tq/MrLXIo6zU+cvN tv68trn6vMrT8sP7KMjnaHR0cDovL3lvdXJuYW1lLmlmb29kMS5jb20po6zT69K8yrPGt9bQufrN +CjSs8Pm5K/AwMG/1MLT4jIwMM3ytM4pvfTD3MGsvdOjrMzhuN/C8rzSus25y7/Nt8POysG/o6y2 qMbaytW1vdK8yrPGt9bQufrN+Mzhuam1xL/Nu6fQ6Mfzus2yybm60MXPoqGjPEJSPjwvRk9OVD48 L1A+PFAgQ0xBU1M9TXNvTm9ybWFsIFNUWUxFPSdtYXJnaW4tcmlnaHQ6LTE3Ljg1cHQ7bGluZS1o ZWlnaHQ6MTUwJSc+PEZPTlQgU0laRT0iMiI+PEJSPjwvRk9OVD4gDQo8Qj48Rk9OVCBDT0xPUj0i I0ZGMDAwMCI+NzwvRk9OVD48L0I+PEZPTlQgQ09MT1I9IiNGRjAwMDAiPjxCPtTCMzDI1cewyerH 67KiuLa/7sq508PSvM34zaijrMzYsfDTxbvdvNszODAw1KovxOqjrNT5y83M9cLrueO45rKiw+K3 0dTayrPGt9eo0rXU09a+v6+1x7mpo6zH86OstPrA7aOsus/X99DFz6I8L0I+PEJSPjwvRk9OVD4g DQo8Rk9OVCBTSVpFPSIyIj7P68HLveK4/LbgPyChocfrwarPtaO6PEEgSFJFRj0ibWFpbHRvOnNh bGVzQGlmb29kMS5jb20iPnNhbGVzQGlmb29kMS5jb208L0E+IA0KoaG157uwo7owNzU1LTM3ODYz MDmhoaGhz/rK27K/yfLQob3jPEJSPjwvRk9OVD48Rk9OVCBTSVpFPSIyIj678jxBDQpIUkVGPSJo dHRwOi8vd3d3Lmlmb29kMS5jb20vYWJvdXR1cy9vdXJzZXJ2aWNlcy9jcHNlcnZpY2UuYXNwIj63 w87KztLDx7XEzfjSszwvQT46d3d3Lmlmb29kMS5jb208L0ZPTlQ+PC9QPjxQIENMQVNTPU1zb05v cm1hbCBTVFlMRT0nbGluZS1oZWlnaHQ6MjAuMHB0JyBBTElHTj0iTEVGVCI+PC9QPjxQIENMQVNT PU1zb05vcm1hbCBBTElHTj1MRUZUIFNUWUxFPSdsaW5lLWhlaWdodDoyMC4wcHQnPjxGT05UIFNJ WkU9IjIiPjxCPrvY1rSjqMfrtKvV5qO6MDc1NS0zMjM5MDQ3u/K3orXn19PTyrz+o7o8L0I+PEEN CkhSRUY9Im1haWx0bzpzYWxlc0BpZm9vZDEuY29tIj5zYWxlc0BpZm9vZDEuY29tIDwvQT48Qj6j qTwvQj48L0ZPTlQ+PC9QPjxQPjxGT05UIFNJWkU9IjIiPqH1IA0Ksb65q8u+ttTN+NW+tqjWxrjQ 0MvIpKGhoaEmbmJzcDsmbmJzcDsgJm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7 Jm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7IA0KJm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7 Jm5ic3A7IKH1ILG+uavLvrbU0rzN+M2ot/7O8bjQ0MvIpDwvRk9OVD48L1A+PFAgQ0xBU1M9TXNv Tm9ybWFsIFNUWUxFPSdsaW5lLWhlaWdodDoyMC4wcHQnPjxGT05UIFNJWkU9IjIiPrmry77D+7PG o7pfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX1/Bqs+1yMujul9f X19fX19fX19fX19fX19fX19fX19fIA0KPEJSPiA8QlI+ILXnu7Cjul9fX19fX19fX19fX19fX19f X19fX7Sr1eajul9fX19fX19fX19fX19fX19fX19fX19FLW1haWyjul9fX19fX19fX19fX19fX19f X19fX18gDQo8L0ZPTlQ+PC9QPjxQIENMQVNTPU1zb05vcm1hbCBTVFlMRT0nbGluZS1oZWlnaHQ6 MjAuMHB0Jz48L1A+PC9URD48L1RSPjwvVEFCTEU+IA0KPC9CT0RZPg0KPC9IVE1MPg0K ------=_NextPart_000_34EE5_01C10AFC.B91E6DA0-- From Misha.Wolf@reuters.com Fri Jul 13 14:11:32 2001 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Fri, 13 Jul 2001 14:11:32 +0100 Subject: [I18n-sig] Call for Papers - 20th Unicode Conference - Jan/Feb 2001 - Washington DC Message-ID: Twentieth International Unicode Conference (IUC20) Unicode and the Web: The Global Connection http://www.unicode.org/iuc/iuc20 January 28 - February 1, 2002 Washington, DC, USA > > > > > > > C A L L F O R P A P E R S < < < < < < < Submissions due: September 21, 2001 Notification date: October 12, 2001 Completed papers due : November 2, 2001 (in electronic form and camera-ready paper form) * * * * * The Internet and the World Wide Web continue to change the shape of computing. The goal of network computing and understandable text access across wide, diverse groups of people has brought great momentum to computing environments that build Unicode into their foundation. Whether it's Internet commerce, network access to data, or highly portable applications, Unicode makes a solid foundation for the network, global enterprises, and software users everywhere. The Twentieth International Unicode Conference (IUC20) will address topics ranging from Unicode use in the World Wide Web and in operating systems and databases, to the latest developments with Unicode 3.1, Java, Open Source, XML and Web protocols. Conference attendees will include managers, software engineers, systems analysts, and product marketing personnel responsible for the development of software supporting Unicode, as well as those involved in all aspects of the globalization of software and the Internet. THEME & TOPICS Computing with Unicode is the overall theme of the Conference. Presentations should be geared towards a technical audience. Suggested topics of interest include, but are not limited to: - Internationalization features of portable devices - Implementing new features of Unicode Version 3.1 - Unicode normalization, collation - Programming Languages and Libraries (Java, Perl, et al) - The World Wide Web (WWW) and Unicode - Character set issues - Web search engines and Unicode - Library and archival concerns - Unicode in operating systems - Unicode in databases - Unicode in large scale networks - Unicode in government applications - The results of using Unicode applications (case studies, solutions) - Language processing issues with Unicode data - Migrating legacy applications to Unicode - Cross platform issues - Printing and imaging - Optimizing performance of Unicode systems and applications - Testing Unicode applications - Usability evaluations of Unicode applications - Internationalization and Localization SESSIONS The Conference Program will provide a wide range of sessions including: - Keynote presentations - Workshops/Tutorials - Technical presentations - Panel sessions All sessions except the Workshops/Tutorials will be of 40 minute duration. In some cases, two consecutive 40 minute program slots may be devoted to a single session. The Workshops/Tutorials will each last approximately three hours. They should be designed to stimulate discussion and participation, using slides and demonstrations. PUBLICITY If your paper is accepted, your details will be included in the Conference brochure and Web pages and the paper itself will appear on a Conference CD, with an optional printed book of Conference Proceedings. CONFERENCE LANGUAGE The Conference language is English. All submissions, papers and presentations should be provided in English. SUBMISSIONS Submissions MUST contain: 1. An abstract of 150-250 words, consisting of statement of purpose, paper description, and your conclusions or final summary. 2. A brief biography. 3. The details listed below: SESSION TITLE: _________________________________________ _________________________________________ TITLE (eg Dr/Mr/Mrs/Ms): _________________________________________ NAME: _________________________________________ JOB TITLE: _________________________________________ ORGANIZATION/AFFILIATION: _________________________________________ ORGANIZATION'S WWW URL: _________________________________________ OWN WWW URL: _________________________________________ ADDRESS FOR PAPER MAIL: _________________________________________ _________________________________________ _________________________________________ TELEPHONE: _________________________________________ FAX: _________________________________________ E-MAIL ADDRESS: _________________________________________ TYPE OF SESSION: [ ] Keynote presentation [ ] Workshop/Tutorial [ ] Technical presentation [ ] Panel PANELISTS (if Panel): _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ TARGET AUDIENCE (you may select more than one category): [ ] Managers [ ] Software Engineers [ ] Systems Analysts [ ] Marketers [ ] Other: ______________________________ LEVEL OF SESSION (you may select more than one category): [ ] Beginner [ ] Intermediate [ ] Advanced Submissions should be sent by e-mail to either of the following addresses: papers@unicode.org info@global-conference.com They should use ASCII, non-compressed text and the following subject line: Proposal for IUC 20 If desired, a copy of the submission may also be sent by post to: Twentieth International Unicode Conference c/o Global Meeting Services, Inc. 4360 Benhurst Avenue San Diego, CA 92122 USA Tel: +1 858 638 0206 Fax: +1 858 638 0504 CONFERENCE PROCEEDINGS All Conference papers will be published on CD. Printed proceedings will be offered as an option. EXHIBIT OPPORTUNITIES The Conference will have an Exhibition area for corporations or individuals who wish to display and promote their products, technology and/or services. Every effort will be made to provide maximum exposure and advertising. Exhibit space is limited. For further information or to reserve a place, please contact Global Meeting Services at the above location. CONFERENCE VENUE Omni Shoreham Hotel 2500 Calvert Street, NW Washington, DC 20008 USA Tel: +1 202 234 0700 Fax: +1 202 265 7972 THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From guido@digicool.com Fri Jul 13 16:04:23 2001 From: guido@digicool.com (Guido van Rossum) Date: Fri, 13 Jul 2001 11:04:23 -0400 Subject: [I18n-sig] Call for Papers - 20th Unicode Conference - Jan/Feb 2001 - Washington DC In-Reply-To: Your message of "Fri, 13 Jul 2001 14:11:32 BST." References: Message-ID: <200107131504.f6DF4NK16532@odiug.digicool.com> > Twentieth International Unicode Conference (IUC20) > Unicode and the Web: The Global Connection > http://www.unicode.org/iuc/iuc20 > January 28 - February 1, 2002 > Washington, DC, USA If you go to this conference, you can combine it with the 10th Python conference, which will be the next week in Alexandria (a suburb of Washington). (The new conference date and location will be officially be announced at the O'Reilly conference in San Diego later this month.) --Guido van Rossum (home page: http://www.python.org/~guido/) From wwwjessie@21cn.com Mon Jul 16 10:47:33 2001 From: wwwjessie@21cn.com (wwwjessie@21cn.com) Date: Mon, 16 Jul 2001 17:47:33 +0800 Subject: [I18n-sig] =?gb2312?B?tPPBrC0yMDAxxOq5+rzKwszJq8qzxrfT68jLwOC9ob+1sqnAwLvhKA==?= =?gb2312?B?QWdybyBBbm51YWwgTWVldGluZyBDaGluYSAyMDAxKQ0=?= Message-ID: <2d95001c10ddc$56986a40$9300a8c0@ifood1gongxing> This is a multi-part message in MIME format. ------=_NextPart_000_2D951_01C10E1F.64BBAA40 Content-Type: text/plain; charset="gb2312" Content-Transfer-Encoding: base64 MjAwMcTq1tC5+rn6vMrFqdK1v8a8vMTqu+ENCrn6vMrCzMmryrPGt9PryMvA4L2hv7WyqcDAu+G8 sNGnyvXM1sLbu+ENCg0KCQ0K1bnG2qO6IAmhoTIwMDHE6jnUwjTI1S03yNUJDQq12LXjo7ogCaGh tPPBrNDHuqO74dW51tDQxAkNCtb3sOyjuiAJoaHW0LuqyMvD8bmyus25+sWp0rWyvw0KoaHW0Ln6 v8bRp7y8yvXQrbvhDQqhobTzwazK0MjLw/HV/riuDQoJDQqz0LDso7ogCaGh1tC5+sLMyavKs8a3 t6LVudbQ0MQNCqGh1tC5+sWp0ae74Q0KoaHW0Ln6wszJq8qzxrfQrbvhDQqhobTzwazK0MWp0rW+ 1g0KoaG088Gs0Me6o7vh1bnW0NDEDQoJDQrN+MLnt/7O8czhuanJzKO60rzKs8a31tC5+s34IGh0 dHA6Ly93d3cuaWZvb2QxLmNvbQ0KPGh0dHA6Ly93d3cuaWZvb2QxLmNvbS9pbmRleC5hc3A/ZnI9 aTE4bi1zaWdAcHl0aG9uLm9yZz4gCQ0KIAkNCqH6IM2ouf3SvMqzxrfW0Ln6zfixqMP7ss7VuaO6 vsXV29PFu90oscjI58/W09DDv7j2IDNNIFggM00gtcSx6te81bnOu9StvNtSTUI0NTAwo6zNqLn9 ztLDx9a70Oi4tlJNQjQwNTApo6wNCrGow/u92Na5yNXG2jIwMDHE6jfUwjIwyNUgPGh0dHA6Ly9n cmVlbjIwMDEuaWZvb2QxLmNvbS9mcm9tMS5hc3A+IA0Kofogu7bTrSDD4rfR16Ky4SA8aHR0cDov L3d3dy5pZm9vZDEuY29tL3NpZ251cC9zZXZhZ3JlZW0uYXNwPiCzyc6quavLvrvh1LGhow0KN9TC MjDI1cew16Ky4aOsxPq9q9TaN9TCMjXI1cewzai5/bXn19PTyrz+t73KvcPit9G78bXDMzDM9bLJ ubrQxc+ioaMNCsjnufvE+rK7z+vK1bW9ztLDx7XE08q8/qOsx+sgwarPtc7Sw8cgPG1haWx0bzp1 bnN1YnNjcmliZUBpZm9vZDEuY29tPiCjrM7Sw8fS1Lrzvauyu9TZt6LTyrz+uPjE+qGjDQqy6dGv o7ogc2FsZXNAaWZvb2QxLmNvbSA8bWFpbHRvOnNhbGVzQGlmb29kMS5jb20+ICChoaGhtee7sKO6 MDc1NS0zNzg2MzA5oaHP+srbsr8NCsny0KG94yC2xc/IyfoNCg0KDQogDQoNCrvYINa0IKOox+u0 q9Xmo7owNzU1LTMyMzkwNDe78iC3orXn19PTyrz+o7ogc2FsZXNAaWZvb2QxLmNvbSA8bWFpbHRv OnNhbGVzQGlmb29kMS5jb20+DQqjqQkNCqH1ILG+uavLvtPQ0uLNqLn90rzKs8a31tC5+s34ss7V uSChoaGhIKH1ILG+uavLvsTivfjSu7K9wcu94rjDsqnAwLvho6zH69PrztLDx8Gqz7UNCg0KuavL vsP7s8ajul9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fDQrBqs+1yMujul9f X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18NCrXnu7Cjul9fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX18NCrSr1eajul9fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX18NCkUtbWFpbKO6X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f Xw0KCQ0KIAkNCiAJDQogCQ0KIAkNCiAJDQo= ------=_NextPart_000_2D951_01C10E1F.64BBAA40 Content-Type: text/html; charset="gb2312" Content-Transfer-Encoding: base64 PGh0bWw+DQo8aGVhZD4NCjx0aXRsZT5VbnRpdGxlZCBEb2N1bWVudDwvdGl0bGU+IDxtZXRhIGh0 dHAtZXF1aXY9IkNvbnRlbnQtVHlwZSIgY29udGVudD0idGV4dC9odG1sOyBjaGFyc2V0PWdiMjMx MiI+IA0KPHN0eWxlIHR5cGU9InRleHQvY3NzIj4NCjwhLS0NCnRkIHsgIGxpbmUtaGVpZ2h0OiAy NHB4fQ0KLS0+DQo8L3N0eWxlPiANCjwvaGVhZD4NCg0KPGJvZHkgYmdjb2xvcj0iI0ZGRkZGRiIg dGV4dD0iIzAwMDAwMCI+DQo8ZGl2IGFsaWduPSJDRU5URVIiPjx0YWJsZSB3aWR0aD0iNzUlIiBi b3JkZXI9IjAiIGNlbGxzcGFjaW5nPSIwIiBjZWxscGFkZGluZz0iMCI+PHRyPjx0ZCBhbGlnbj0i Q0VOVEVSIj48YSBocmVmPSJodHRwOy8vZ3JlZW4yMDAxLmlmb29kMS5jb20iPjxiPjIwMDHE6tbQ ufq5+rzKxanStb/GvLzE6rvhPGJyPrn6vMrCzMmryrPGt9PryMvA4L2hv7WyqcDAu+G8sNGnyvXM 1sLbu+E8L2I+PC9hPjxicj48YnI+PC90ZD48L3RyPjx0cj48dGQgYWxpZ249IkNFTlRFUiI+PHRh YmxlIHdpZHRoPSI3NSUiIGJvcmRlcj0iMCIgY2VsbHNwYWNpbmc9IjAiIGNlbGxwYWRkaW5nPSIw Ij48dHI+PHRkIGhlaWdodD0iMTIiIHdpZHRoPSIzOSUiIGFsaWduPSJSSUdIVCI+PGI+PGZvbnQg c2l6ZT0iMiI+1bnG2qO6IA0KPC9mb250PjwvYj48L3RkPjx0ZCBoZWlnaHQ9IjEyIiB3aWR0aD0i NjElIj48Zm9udCBzaXplPSIyIj6hoTIwMDHE6jnUwjTI1S03yNU8L2ZvbnQ+PC90ZD48L3RyPjx0 cj48dGQgaGVpZ2h0PSIxMiIgd2lkdGg9IjM5JSIgYWxpZ249IlJJR0hUIj48Yj48Zm9udCBzaXpl PSIyIj612LXjo7ogDQo8L2ZvbnQ+PC9iPjwvdGQ+PHRkIGhlaWdodD0iMTIiIHdpZHRoPSI2MSUi Pjxmb250IHNpemU9IjIiPqGhtPPBrNDHuqO74dW51tDQxDwvZm9udD48L3RkPjwvdHI+PHRyPjx0 ZCBoZWlnaHQ9IjEyIiB3aWR0aD0iMzklIiBhbGlnbj0iUklHSFQiIHZhbGlnbj0iVE9QIj48Yj48 Zm9udCBzaXplPSIyIj7W97Dso7ogDQo8L2ZvbnQ+PC9iPjwvdGQ+PHRkIGhlaWdodD0iMTIiIHdp ZHRoPSI2MSUiPjxmb250IHNpemU9IjIiPqGhPC9mb250Pjxmb250IHNpemU9IjIiPtbQu6rIy8Px ubK6zbn6xanStbK/PGJyPqGh1tC5+r/G0ae8vMr10K274Txicj6hobTzwazK0MjLw/HV/riuPGJy PjwvZm9udD48L3RkPjwvdHI+PHRyPjx0ZCBoZWlnaHQ9IjEyIiB3aWR0aD0iMzklIiBhbGlnbj0i UklHSFQiIHZhbGlnbj0iVE9QIj48Yj48Zm9udCBzaXplPSIyIj6z0LDso7ogDQo8L2ZvbnQ+PC9i PjwvdGQ+PHRkIGhlaWdodD0iMTIiIHdpZHRoPSI2MSUiPjxmb250IHNpemU9IjIiPqGhPC9mb250 Pjxmb250IHNpemU9IjIiPtbQufrCzMmryrPGt7ei1bnW0NDEPGJyPqGh1tC5+sWp0ae74Txicj6h odbQufrCzMmryrPGt9Ctu+E8YnI+oaG088GsytDFqdK1vtY8YnI+oaG088Gs0Me6o7vh1bnW0NDE PGJyPjwvZm9udD48L3RkPjwvdHI+PHRyPjx0ZCBjb2xzcGFuPSIyIiBhbGlnbj0iQ0VOVEVSIj48 Zm9udCBzaXplPSIyIj7N+MLnt/7O8czhuanJzKO60rzKs8a31tC5+s34IA0KPGEgaHJlZj0iaHR0 cDovL3d3dy5pZm9vZDEuY29tL2luZGV4LmFzcD9mcj1pMThuLXNpZ0BweXRob24ub3JnIj5odHRw Oi8vd3d3Lmlmb29kMS5jb208L2E+PC9mb250PjwvdGQ+PC90cj48dHI+PHRkIGNvbHNwYW49IjIi IGFsaWduPSJDRU5URVIiPiZuYnNwOzwvdGQ+PC90cj48dHI+PHRkIGNvbHNwYW49IjIiIGFsaWdu PSJMRUZUIj48cD48Zm9udCBzaXplPSIyIj6h+iANCs2ouf3SvMqzxrfW0Ln6zfixqMP7ss7VuaO6 PGI+PGZvbnQgc2l6ZT0iMyIgY29sb3I9IiNGRjAwMDAiPr7F1dvTxbvdPC9mb250PjwvYj4oscjI 58/W09DDv7j2IDNNIFggM00gDQq1xLHq17zVuc671K2821JNQjQ1MDCjrM2ouf3O0sPH1rvQ6Li2 Uk1CNDA1MCmjrCA8YSBocmVmPSJodHRwOi8vZ3JlZW4yMDAxLmlmb29kMS5jb20vZnJvbTEuYXNw Ij48Yj48Zm9udCBzaXplPSIzIiBjb2xvcj0iI0ZGMDAwMCI+sajD+73Y1rnI1cbaMjAwMcTqN9TC MjDI1TwvZm9udD48L2I+PC9hPjxicj6h+iANCru20608YSBocmVmPSJodHRwOi8vd3d3Lmlmb29k MS5jb20vc2lnbnVwL3NldmFncmVlbS5hc3AiPsPit9HXorLhPC9hPrPJzqq5q8u+u+HUsaGjIDxm b250IGNvbG9yPSIjRkYwMDAwIj48Yj48Zm9udCBzaXplPSIzIj431MIyMMjVx7DXorLho6zE+r2r 1No31MIyNcjVx7DNqLn9tefX09PKvP63vcq9w+K30bvxtcMzMMz1ssm5utDFz6KhozwvZm9udD48 L2I+PC9mb250Pjxicj7I57n7xPqyu8/rytW1vc7Sw8e1xNPKvP6jrMfrPGEgaHJlZj0ibWFpbHRv OnVuc3Vic2NyaWJlQGlmb29kMS5jb20iPsGqz7XO0sPHPC9hPqOsztLDx9LUuvO9q7K71Nm3otPK vP64+MT6oaM8YnI+sunRr6O6PGEgaHJlZj0ibWFpbHRvOnNhbGVzQGlmb29kMS5jb20iPnNhbGVz QGlmb29kMS5jb208L2E+IA0KoaGhobXnu7CjujA3NTUtMzc4NjMwOaGhz/rK27K/IMny0KG94yC2 xc/Iyfo8YnI+PC9mb250PjwvcD48cD4mbmJzcDs8L3A+PC90ZD48L3RyPjx0cj48dGQgaGVpZ2h0 PSIzMCIgY29sc3Bhbj0iMiIgYWxpZ249IkNFTlRFUiI+PGZvbnQgc2l6ZT0iMiI+PGI+u9ggDQrW tCCjqMfrtKvV5qO6MDc1NS0zMjM5MDQ3u/Igt6K159fT08q8/qO6IDxhIGhyZWY9Im1haWx0bzpz YWxlc0BpZm9vZDEuY29tIj5zYWxlc0BpZm9vZDEuY29tPC9hPiANCqOpPC9iPjwvZm9udD48L3Rk PjwvdHI+PHRyPjx0ZCBoZWlnaHQ9IjEyIiBjb2xzcGFuPSIyIj48Zm9udCBzaXplPSIyIj6h9SCx vrmry77T0NLizai5/dK8yrPGt9bQufrN+LLO1bkgDQqhoaGhIKH1ILG+uavLvsTivfjSu7K9wcu9 4rjDsqnAwLvho6zH69PrztLDx8Gqz7U8YnI+PGJyPrmry77D+7PGo7pfX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fXzxicj7Bqs+1yMujul9fX19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX188YnI+PC9mb250Pjxmb250IHNpemU9IjIiPrXnu7Cjul9fX19fX19fX19f X19fX19fX19fX19fX19fX19fX19fX19fX188YnI+tKvV5qO6X19fX19fX19fX19fX19fX19fX19f X19fX19fX19fX19fX19fXzxicj5FLW1haWyjul9fX19fX19fX19fX19fX19fX19fX19fX19fX19f X19fX19fX188YnI+PC9mb250PjwvdGQ+PC90cj48dHI+PHRkIGhlaWdodD0iMTIiIGNvbHNwYW49 IjIiIGFsaWduPSJMRUZUIj4mbmJzcDs8L3RkPjwvdHI+PHRyPjx0ZCBoZWlnaHQ9IjEyIiBjb2xz cGFuPSIyIiBhbGlnbj0iTEVGVCI+Jm5ic3A7PC90ZD48L3RyPjx0cj48dGQgaGVpZ2h0PSIxMiIg Y29sc3Bhbj0iMiIgYWxpZ249IkxFRlQiPiZuYnNwOzwvdGQ+PC90cj48L3RhYmxlPjwvdGQ+PC90 cj48dHI+PHRkPiZuYnNwOzwvdGQ+PC90cj48dHI+PHRkPiZuYnNwOzwvdGQ+PC90cj48L3RhYmxl PjwvZGl2Pg0KPC9ib2R5Pg0KPC9odG1sPg0K ------=_NextPart_000_2D951_01C10E1F.64BBAA40-- From barry@zope.com Fri Jul 27 06:32:43 2001 From: barry@zope.com (Barry A. Warsaw) Date: Fri, 27 Jul 2001 01:32:43 -0400 Subject: [I18n-sig] pygettext dilemma Message-ID: <15200.64763.772001.53387@anthem.wooz.org> I've got a bit of a dilemma about the right way to generate a pot file, specifically for Mailman. Because this involves docstrings, I don't think the normal gettext tools have to deal with this. In Mailman, I've got a bunch of normal .py modules and a bunch of command line scripts. The modules have their translatable strings nicely marked with _() and only those strings should be extracted. The scripts however should have both _() and docstrings extracted, since the module docstrings include usage text. pygettext.py has a -D (--docstrings) flag that signals the program to extract docstrings even though they aren't _() marked. So far so good. But the problem is that my translators definitely do not want the normal .py modules' docstrings extracted, because it is difficult for them to figure out which docstrings to translate and which to ignore. I tried to extract the two classes of files in two separate pygettext.py steps, but had trouble merging the resulting files. You can't merge them with msgmerge because that program seems to just drop all the entries from the second file (I'm guessing since there's no overlap between the first and second files). So next I tried just cat'ing the two files together, but this generates fatal exceptions in msgmerge for duplicate entries. One of the duplicates is the pot header, so I was going to add a switch to suppress that, but then realized that there'd be other duplicates anyway. What I /think/ I want now is to be able to tell pygettext.py exactly which files to extract docstrings from and which to only extract marked strings from, and then do the extraction in one fell swoop. I propose to include a -X flag like so: -X filename --no-docstrings=filename Specify a file that contains a list of files that should not have their docstrings extracted. This is only useful in conjunction with the -D option above. So with this I'd hand pygettext.py the entire list of files that it should do extraction on, include the -D option, and then include the -X option with the normal module .py's listed in an exclude-file. Does anybody have any suggestions or better ideas? -Barry From keichwa@gmx.net Fri Jul 27 17:38:24 2001 From: keichwa@gmx.net (Karl Eichwalder) Date: 27 Jul 2001 18:38:24 +0200 Subject: [I18n-sig] pygettext dilemma In-Reply-To: <15200.64763.772001.53387@anthem.wooz.org> References: <15200.64763.772001.53387@anthem.wooz.org> Message-ID: barry@zope.com (Barry A. Warsaw) writes: > You can't merge them with msgmerge because that program seems to just > drop all the entries from the second file (I'm guessing since there's > no overlap between the first and second files). Consider to you msgcomm for this job ;) Beware, all version up to 0.10.39 are "limited" (the option --unique is broken); it's the best to go for the CVS version (HEAD). YOu can check it out from :pserver:anoncvs@sourceware.cygnus.com:/cvs/gettext Password it "anoncvs" (IIRC). Info is available somewhere on the cygnus site. There's also msgcat; main difference: using msgcomm the first occurence of a message wins; msgcat contatenates and the user has to decide which translations to keep. > What I /think/ I want now is to be able to tell pygettext.py exactly > which files to extract docstrings from and which to only extract > marked strings from, and then do the extraction in one fell swoop. > > I propose to include a -X flag like so: > > -X filename > --no-docstrings=filename > Specify a file that contains a list of files that should not have > their docstrings extracted. This is only useful in conjunction with > the -D option above. Using the combo msggrep/msgcomm you can "throw away" unwanted messages quite easy; maybe, this approach will help. msgcat, msggrep and msgconv and msgexec are new tools developed by Bruno Haible the last time. -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) From barry@zope.com Fri Jul 27 17:52:30 2001 From: barry@zope.com (Barry A. Warsaw) Date: Fri, 27 Jul 2001 12:52:30 -0400 Subject: [I18n-sig] pygettext dilemma References: <15200.64763.772001.53387@anthem.wooz.org> Message-ID: <15201.40014.142429.215469@anthem.wooz.org> >>>>> "KE" == Karl Eichwalder writes: KE> Consider to you msgcomm for this job ;) Beware, all version up KE> to 0.10.39 are "limited" (the option --unique is broken); it's KE> the best to go for the CVS version (HEAD). YOu can check it KE> out from Cool, thanks for the pointer, I'll definitely check it out. Looks like my system's got an old msgcomm, so I'll suck down the cvs and install that. Turns out that the -X option on pygettext.py works well enough, even if it is a bit of a hack. I just committed it to Python's cvs. :) Thanks, -Barry From haible@ilog.fr Fri Jul 27 18:38:52 2001 From: haible@ilog.fr (Bruno Haible) Date: Fri, 27 Jul 2001 19:38:52 +0200 (CEST) Subject: [I18n-sig] pygettext dilemma Message-ID: <15201.42796.108513.382321@honolulu.ilog.fr> Barry A. Warsaw wrote: > I tried to extract the two classes of files in two separate > pygettext.py steps That's most reasonable. It allows you to use different xgettext/pygettext arguments for the two sets of files. > but had trouble merging the resulting files. You > can't merge them with msgmerge because that program seems to just drop > all the entries from the second file (I'm guessing since there's no > overlap between the first and second files). msgcomm is not really made for this task. gettext-0.11 will contain an 'msgcat' command, which works well for these cases. In the meantime, I can recommend to 'cat' the two pot files and run 'msguniq' on the result. 'msguniq' will also be in gettext-0.11, but here is an equivalent implementation in a Python like language (). Bruno ============================ msguniq ============================= #!/usr/local/bin/clisp -C ;;; Remove duplicates in message catalogs. ;;; Bruno Haible 28.3.1997 ;; This could roughly be implemented as ;; cp INPUT temp1 ;; cp INPUT temp2 ;; msgcomm --more-than=1 -w 1000 -o OUTPUT temp1 temp2 ;; but this has the drawback that ;; - msgcomm doesn't seem to be made for this. ;; This could also be roughly implemented as ;; xgettext -d - --omit-header -w 1000 INPUT > OUTPUT ;; but this has the drawbacks that ;; - it sometimes reverses the list of lines belonging to the hunk, ;; - it removes the header. ;; When gettext-0.11 is releases, this could also be implemented as ;; msguniq INPUT -w 1000 -o OUTPUT ;; without any drawbacks! ;; Additionally, messages translations in OLD override the ones in INPUT. (defstruct message lines ; list of all lines belonging to the hunk msgid ; nil or a string msgstr ; nil or a string occurs ; list of strings "file:nn" where the message occurs ) (defun main (infilename outfilename &optional oldfilename) (declare (type string infilename outfilename)) #+UNICODE (setq *default-file-encoding* charset:iso-8859-1) (let ((hunk-list nil) ; list of all hunks (hunk-table (make-hash-table :test #'equal)) ; (gethash msgid hunk-table) is the hunk who has the given msgid (eof "EOF") ) (flet ((read-hunk (istream) ; reads a hunk, returns nil on eof (let ((line nil) (lines nil) (occurs nil)) (loop (setq line (read-line istream nil eof)) (when (eql line eof) (return)) (if (equal line "") (when lines (return)) (progn (push line lines) (when (and (>= (length line) 3) (string= line "#: " :end1 3)) (push (subseq line 3) occurs) ) ) ) ) (when lines (setq lines (nreverse lines)) (setq occurs (nreverse occurs)) (flet ((line-group (id &aux (idlen (length id))) (let ((l (member-if #'(lambda (line) (and (>= (length line) idlen) (string= line id :end1 idlen) ) ) lines )) ) (when l (setq l (cons (subseq (car l) idlen) (cdr l))) (let ((i (position-if-not #'(lambda (line) (and (plusp (length line)) (eql (char line 0) #\") ) ) l )) ) (subseq l 0 i) )) ) ) ) (let ((msgid (line-group "msgid ")) (msgstr (line-group "msgstr "))) (make-message :lines lines :msgid msgid :msgstr msgstr :occurs occurs ) ) ) ) )) ) (with-open-file (istream infilename :direction :input) (loop (let ((hunk (read-hunk istream))) (unless hunk (return)) (if (null (message-msgid hunk)) (push hunk hunk-list) (let ((other-hunk (gethash (message-msgid hunk) hunk-table))) (if (not other-hunk) (progn (push hunk hunk-list) (setf (gethash (message-msgid hunk) hunk-table) hunk) ) (progn (unless (equal (message-msgstr hunk) (message-msgstr other-hunk) ) (warn "Same message, different translations: ~A and ~A" (message-occurs hunk) (message-occurs other-hunk) ) ) (setf (message-occurs other-hunk) (append (message-occurs other-hunk) (message-occurs hunk) ) ) ) ) ) ) ) ) (setq hunk-list (nreverse hunk-list)) ) (when oldfilename (with-open-file (istream oldfilename :direction :input) (loop (let ((hunk (read-hunk istream))) (unless hunk (return)) (unless (null (message-msgid hunk)) (let ((other-hunk (gethash (message-msgid hunk) hunk-table))) (when other-hunk (setf (message-msgstr other-hunk) (message-msgstr hunk)) ) ) ) ) ) ) ) (with-open-file (ostream outfilename :direction :output) (flet ((print-hunk (hunklistr) (let* ((hunk (car hunklistr)) (lines (message-lines hunk)) (msgid (message-msgid hunk)) (msgstr (message-msgstr hunk)) (occurs (message-occurs hunk))) (dolist (line lines) (cond ((and (>= (length line) 3) (string= line "#: " :end1 3)) (when occurs (format ostream "#: ~{~A~^ ~}~%" occurs) (setq occurs nil) )) ((and (>= (length line) 1) (string= line "#" :end1 1)) (format ostream "~A~%" line) ) ((and (>= (length line) 6) (string= line "msgid " :end1 6)) (format ostream "msgid ~{~A~%~}" msgid) ) ((and (>= (length line) 7) (string= line "msgstr " :end1 7)) (format ostream "msgstr ~{~A~%~}" msgstr) ) ) ) (when (cdr hunklistr) (format ostream "~%")) )) ) (mapl #'print-hunk hunk-list) ) ) ) ) ) (main (first *args*) (second *args*) (third *args*))