From pf@artcom-gmbh.de Thu Feb 3 22:01:03 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Thu, 3 Feb 2000 23:01:03 +0100 (MET) Subject: [I18n-sig] Useful resources available now Message-ID: This is short list of some resources available now, which may people help to i18n their python applications: * fintl.py -- A pure python module for reading .mo files created by msgfmt URL : Auhor : Peter Funk <-- That's me ;-) License: Pythonic DOC : inline, no .tex available yet * intl.so -- An interface to the (GNU) gettext C library. Only useful for intenational applications, if they are covered by GPL and will run under Unix-Linux. URL : Author : Martin von Löwis License: GPL? (due to use of GNU gettext) DOC : README available and an IPC article: no .tex yet * GNU gettext -- Suite of utitlities for i18n URL : http://www.gnu.org/software/gettext/gettext.html License: GPL * pygettext.py -- Barry Warsaws reimplementation of gettext in pure Python URL : ? Ouppsss can't find it at the moment. Look on Barrys home page Author : Barry Warsaw License: Pythonic DOC : inline doc strings further reading: http://www.python.org/workshops/1997-10/proceedings/loewis.html Regards from Germany, Peter -- Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60 From guido@python.org Thu Feb 3 22:31:51 2000 From: guido@python.org (Guido van Rossum) Date: Thu, 03 Feb 2000 17:31:51 -0500 Subject: [I18n-sig] Useful resources available now In-Reply-To: Your message of "Thu, 03 Feb 2000 23:01:03 +0100." References: Message-ID: <200002032231.RAA00550@eric.cnri.reston.va.us> Great list. Could someone turn it into HTML for easy pasting into the sig's home page? --Guido van Rossum (home page: http://www.python.org/~guido/) From pf@artcom-gmbh.de Thu Feb 3 23:13:02 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Fri, 4 Feb 2000 00:13:02 +0100 (MET) Subject: [I18n-sig] Useful resources available now In-Reply-To: <200002032231.RAA00550@eric.cnri.reston.va.us> from Guido van Rossum at "Feb 3, 2000 5:31:51 pm" Message-ID: [Guido]: > Great list. Could someone turn it into HTML for easy pasting into the > sig's home page? I think my list is still rather incomplete. There was a thread on comp.lang.python last year, where François Pinard (spelling?) annouced something. Unfortunately I don't remember right now and can't find it. :-( Perhaps we should wait some days. At least until the list maintainer had a chance to subsribe to this i18n-sig. ;-) Regards, Peter -- Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60 From andy@robanal.demon.co.uk Mon Feb 7 21:25:39 2000 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Mon, 07 Feb 2000 21:25:39 GMT Subject: [I18n-sig] SIG charter and goals Message-ID: <38a03848.685217@post.demon.co.uk> Apologies to everyone for taking so long to get started; I have been on the road between IPC8 and today, and hampered by a broken down laptop.. I'd like to confirm with everyone what we discussed at IPC8, and try to outline what I see as the SIG's charter. If this I agreed, I will put this up on a SIG home page this week. =20 I propose that the SIG's deliverables are: 1. Support addition of Unicode to the Python core for 1.6: ------------------------------------------------------------------ The key tasks are to add Unicode string support to the Python core (MAL), and add a new Unicode regex engine (Fredrik). These are both well underway. This group should assist with testing, and be the primary forum for feedback on those features. =20 2. Encodings API and library: -------------------------------- We must deliver an encodings library which surpasses the features of that in Java. It should allow conversion between many common encodings; access to Unicode character properties; and anything else which makes encoding conversion more pleasant. This should be initially based on MAL's draft specification, although the spec may be changed if we find good reason to. There will be an inevitable initial focus on Japanese support due to the key people involved. However, if we can do that well then other encodings should be less of a problem. 3. Locales: -------------- Implement a candidate module for the standard library offering support for the world's date, time, money and number formats, and for time zones. 4. Application Localization: ----------------------------------- This group is the intended focal point for frameworks for localizing both conventional applications and Python-powered web sites. This field is very large and varied and we set not targets for delivering 'a solution'; however, we hope to generate discussion, how-tos and references to examples of good and bad practice in this area. 5. Internationalizing Pythonwin and IDLE ----------------------------------------------------- There are some current bugs/features in these environments which seriously hamper use in double-byte environments. We should try to get these stamped out. =20 Opinions, anyone? Have I missed any major topics? Are there any best left out of the SIG's charter? - Andy From alex@ank-sia.com Tue Feb 8 13:47:07 2000 From: alex@ank-sia.com (alexander smishlajev) Date: Tue, 08 Feb 2000 15:47:07 +0200 Subject: [I18n-sig] Useful resources available now Message-ID: <38A01E5B.C8FFE9DE@turnhere.com> hello Peter! thanks for the list! i've seen all of this, but it is nice to have common summary. some additions: * unicode.so -- A unicode string implementation for Python 1.5. URL : http://www.pythonware.com/madscientist/ Author : Fredrik Lundh License: Pythonic * PyRecode -- wrapper around the librecode/Recode utility (a GPL tool for converting text from one character set to another). URL : http://www.suxers.de/python/pyrecode.htm Author : Andreas Jung * pynicode -- Unicode support and character set translation for Python. pure Python module suite started as a translation of Perl Unicode modules. URL : http://sourceforge.net/project/?group_id=1825 Author : alexander smishlajev License: MIT-style DOC : inline best wishes, alex. From pf@artcom-gmbh.de Tue Feb 8 13:52:51 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Tue, 8 Feb 2000 14:52:51 +0100 (MET) Subject: [I18n-sig] Useful resources available now In-Reply-To: <38A01E5B.C8FFE9DE@turnhere.com> from alexander smishlajev at "Feb 8, 2000 3:47: 7 pm" Message-ID: Hi alex! > thanks for the list! i've seen all of this, but it is nice to have > common summary. some additions: [...] I will add them to my list and convert it into HTML later this week. Thank you for your additions. I was not aware of them. Regards, Peter -- Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60 From mal@lemburg.com Tue Feb 8 14:11:06 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 08 Feb 2000 15:11:06 +0100 Subject: [I18n-sig] Useful resources available now References: <38A01E5B.C8FFE9DE@turnhere.com> Message-ID: <38A023FA.F855BA17@lemburg.com> alexander smishlajev wrote: > > hello Peter! > > thanks for the list! i've seen all of this, but it is nice to have > common summary. some additions: > > * unicode.so -- A unicode string implementation for Python 1.5. > URL : http://www.pythonware.com/madscientist/ > Author : Fredrik Lundh > License: Pythonic > > * PyRecode -- wrapper around the librecode/Recode utility (a GPL tool > for converting text from one character set to another). > URL : http://www.suxers.de/python/pyrecode.htm > Author : Andreas Jung > > * pynicode -- Unicode support and character set translation for Python. > pure Python module suite started as a translation > of Perl Unicode modules. > URL : http://sourceforge.net/project/?group_id=1825 > Author : alexander smishlajev > License: MIT-style > DOC : inline FYI, Python 1.6 will have native Unicode support. There's no need to duplicate work in that area... better wait until the first versions ship and then build on top of the existing implementation, IMHO anyways ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Feb 8 14:31:43 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 08 Feb 2000 15:31:43 +0100 Subject: [I18n-sig] SIG charter and goals References: <38a03848.685217@post.demon.co.uk> Message-ID: <38A028CF.8188427D@lemburg.com> Andy Robinson wrote: > > Apologies to everyone for taking so long to get started; I have been > on the road between IPC8 and today, and hampered by a broken down > laptop.. > > I'd like to confirm with everyone what we discussed at IPC8, and try > to outline what I see as the SIG's charter. If this I agreed, I will > put this up on a SIG home page this week. > > I propose that the SIG's deliverables are: > > 1. Support addition of Unicode to the Python core for 1.6: > ------------------------------------------------------------------ > The key tasks are to add Unicode string support to the Python core > (MAL), and add a new Unicode regex engine (Fredrik). These are both > well underway. This group should assist with testing, and be the > primary forum for feedback on those features. FYI, the Unicode stuff will go into the public CVS version sometime in March. > 2. Encodings API and library: > -------------------------------- > > We must deliver an encodings library which surpasses the features of > that in Java. It should allow conversion between many common > encodings; access to Unicode character properties; and anything else > which makes encoding conversion more pleasant. This should be > initially based on MAL's draft specification, although the spec may > be changed if we find good reason to. Note that Python will have a builtin codec support. The details are described in the proposal paper (not the C API though -- that still lives in the .h files of the Unicode implementation). Note that I have made some good experience with the existing spec: it is very flexible, extendable and versatile. It also greatly reduces coding efforts by providing working baseclasses. > There will be an inevitable initial focus on Japanese support due to > the key people involved. However, if we can do that well then other > encodings should be less of a problem. > > 3. Locales: > -------------- > Implement a candidate module for the standard library offering support > for the world's date, time, money and number formats, and for time > zones. Hmm, I'd suggest to leave this out of the core and provide it through third party extensions which are then shipped by some Python distribution party. > 4. Application Localization: > ----------------------------------- > This group is the intended focal point for frameworks for localizing > both conventional applications and Python-powered web sites. This > field is very large and varied and we set not targets for delivering > 'a solution'; however, we hope to generate discussion, how-tos and > references to examples of good and bad practice in this area. > > 5. Internationalizing Pythonwin and IDLE > ----------------------------------------------------- > There are some current bugs/features in these environments which > seriously hamper use in double-byte environments. We should try to > get these stamped out. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From alex@ank-sia.com Tue Feb 8 16:45:20 2000 From: alex@ank-sia.com (alexander smishlajev) Date: Tue, 08 Feb 2000 18:45:20 +0200 Subject: [I18n-sig] Useful resources available now References: <38A01E5B.C8FFE9DE@turnhere.com> <38A023FA.F855BA17@lemburg.com> Message-ID: <38A04820.A14681EC@turnhere.com> "M.-A. Lemburg" wrote: > > FYI, Python 1.6 will have native Unicode support. yes. unfortunately, i did not know about this at the time of publishing pynicode. now i see that i was reinventing the same things that are listed in your proposal at http://starship.skyport.net/~lemburg/unicode-proposal.txt sorry for that. by the way, don't you think that standard codecs should include _all_ iso8859 encodings? MS Windows codepages? > no need to duplicate work in that area... better wait until > the first versions ship and then build on top of the > existing implementation, IMHO anyways ;-) i think that it would be nice to have a compatible (maybe less functional) stand-alone module as a temporary solution until Python 1.6 is released. as far as i remember, about a half of that resource list was published within last half of a year. today i have met another one: http://starship.python.net/crew/gherman/playground/calie/calie.py IMHO such frequency of different modules appearing testifies that charset conversion is badly needed, as soon as possible. best wishes, alex. From alex@ank-sia.com Tue Feb 8 16:45:20 2000 From: alex@ank-sia.com (alexander smishlajev) Date: Tue, 08 Feb 2000 18:45:20 +0200 Subject: [I18n-sig] Useful resources available now References: <38A01E5B.C8FFE9DE@turnhere.com> <38A023FA.F855BA17@lemburg.com> Message-ID: <38A04820.A14681EC@turnhere.com> "M.-A. Lemburg" wrote: > > FYI, Python 1.6 will have native Unicode support. yes. unfortunately, i did not know about this at the time of publishing pynicode. now i see that i was reinventing the same things that are listed in your proposal at http://starship.skyport.net/~lemburg/unicode-proposal.txt sorry for that. by the way, don't you think that standard codecs should include _all_ iso8859 encodings? MS Windows codepages? > no need to duplicate work in that area... better wait until > the first versions ship and then build on top of the > existing implementation, IMHO anyways ;-) i think that it would be nice to have a compatible (maybe less functional) stand-alone module as a temporary solution until Python 1.6 is released. as far as i remember, about a half of that resource list was published within last half of a year. today i have met another one: http://starship.python.net/crew/gherman/playground/calie/calie.py IMHO such frequency of different modules appearing testifies that charset conversion is badly needed, as soon as possible. best wishes, alex. From herzog@online.de Tue Feb 8 18:00:58 2000 From: herzog@online.de (Bernhard Herzog) Date: 08 Feb 2000 19:00:58 +0100 Subject: [I18n-sig] Useful resources available now References: Message-ID: pf@artcom-gmbh.de (Peter Funk) writes: > > thanks for the list! i've seen all of this, but it is nice to have > > common summary. some additions: > [...] > > I will add them to my list and convert it into HTML later this > week. Thank you for your additions. I was not aware of them. François Pinard's po-utils haven't been mentioned yet, I think: http://www.iro.umontreal.ca/contrib/po/po-utils/ They contain xpot, a replacemant for xgettext that understands Python syntax, and the po-mode for Emacs -- Bernhard Herzog | Sketch, a drawing program for Unix herzog@online.de | http://sketch.sourceforge.net/ From mal@lemburg.com Tue Feb 8 18:18:26 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 08 Feb 2000 19:18:26 +0100 Subject: [I18n-sig] Useful resources available now References: <38A01E5B.C8FFE9DE@turnhere.com> <38A023FA.F855BA17@lemburg.com> <38A04820.A14681EC@turnhere.com> Message-ID: <38A05DF2.1236DE1D@lemburg.com> alexander smishlajev wrote: > > "M.-A. Lemburg" wrote: > > > > FYI, Python 1.6 will have native Unicode support. > > yes. unfortunately, i did not know about this at the time of publishing > pynicode. now i see that i was reinventing the same things that are > listed in your proposal at > http://starship.skyport.net/~lemburg/unicode-proposal.txt sorry for > that. > > by the way, don't you think that standard codecs should include _all_ > iso8859 encodings? MS Windows codepages? Sure, but not in the core. I have converted all mapping tables at http://www.unicode.org to dictionary tables usable by Python. Turns out that this produces 4MB of static data... as a result I want to include a generic mapping table codec which can use these tables and then make the mapping tables downloadable separately. > > no need to duplicate work in that area... better wait until > > the first versions ship and then build on top of the > > existing implementation, IMHO anyways ;-) > > i think that it would be nice to have a compatible (maybe less > functional) stand-alone module as a temporary solution until Python 1.6 > is released. as far as i remember, about a half of that resource list > was published within last half of a year. today i have met another one: > http://starship.python.net/crew/gherman/playground/calie/calie.py IMHO > such frequency of different modules appearing testifies that charset > conversion is badly needed, as soon as possible. Hey, it's only a few more weeks until the CVS tree has the code publically available for everyone to download and test :-) [If you can't wait, have your company join the Python Consortium to get early access. The more companies join, the faster Python will move towards full business awareness.] -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@robanal.demon.co.uk Wed Feb 9 02:10:42 2000 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Wed, 09 Feb 2000 02:10:42 GMT Subject: [I18n-sig] Locales module In-Reply-To: <38A028CF.8188427D@lemburg.com> References: <38a03848.685217@post.demon.co.uk> <38A028CF.8188427D@lemburg.com> Message-ID: <38a4c7cc.9705336@post.demon.co.uk> On Tue, 08 Feb 2000 15:31:43 +0100, you wrote: >> 3. Locales: >> -------------- >> Implement a candidate module for the standard library offering support >> for the world's date, time, money and number formats, and for time >> zones. > >Hmm, I'd suggest to leave this out of the core and provide it >through third party extensions which are then shipped by some >Python distribution party. This is definitely not an issue for the language core, and I agree it should start out as something separate. There was some discussion of it going into the standard library in due course, and Guido did not say 'no'! Would anyone like to take this on? It isn't really my field. I guess we should start by reviewing what other systems do and how well they work. - Andy From andy@robanal.demon.co.uk Wed Feb 9 02:10:34 2000 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Wed, 09 Feb 2000 02:10:34 GMT Subject: [I18n-sig] SIG charter and goals In-Reply-To: <38A028CF.8188427D@lemburg.com> References: <38a03848.685217@post.demon.co.uk> <38A028CF.8188427D@lemburg.com> Message-ID: <38a6c8e8.9989175@post.demon.co.uk> On Tue, 08 Feb 2000 15:31:43 +0100, you wrote: >> 2. Encodings API and library: >> -------------------------------- >>=20 >> We must deliver an encodings library which surpasses the features of >> that in Java. It should allow conversion between many common >> encodings; access to Unicode character properties; and anything else >> which makes encoding conversion more pleasant. This should be >> initially based on MAL's draft specification, although the spec may >> be changed if we find good reason to. > >Note that Python will have a builtin codec support. The details >are described in the proposal paper (not the C API though -- >that still lives in the .h files of the Unicode implementation). > >Note that I have made some good experience with the existing >spec: it is very flexible, extendable and versatile. It also >greatly reduces coding efforts by providing working baseclasses. >=20 I can't wait to try the code, and cannot foresee any problems at the moment based on the spec. However, it was only discussed on the Python-dev list, and Marc-Andree was not at IPC8, so I should try to explain some background for everyone, (and what my agenda as SIG moderator is too!) 1. HP joined the Python consortium and pushed for Unicode support last year. There was a detailed discussion on the Python-dev list (to which I was invited because my day-job included some very messy double-byte work in Python for a year). Marc-Andre's proposal went through about eight iterations, and he started to code it up under contract to CNRI. This is official work, and there is no question of anybody else's Unicode modules being used - sorry! Fredrik Lundh's work on the Unicode regex engine is also under contract and progressing rapidly. 2. MAL's document defines the API for 'codecs' - conversion filters - but his taks does not include delivering a package with all the world's common encodings in it. That is a necessity in the long run, and both I (through ReportLab) and Digital Garage need to make at least the Japanese encodings work quite soon. =20 (Marc-Andre, can you update us on what codecs you are providing, and how they are implemented? C or Python? ) 3. At IPC8 we discussed (among other things) the delivery of the codec package - both in the i18n forum and in the corridors as usual! To do what Java does, we eventually need codecs for 50+ common encodings, all available and tested. These will almost certainly not be in the standard distribution, but there should eventually be a single, certified, tested source for them, as this stuff has to be 100% right. Quite a few of us urgently need good Japanese support. The current spec does not say whether codecs should be in C or Python. Guido expressed the hope that a few carefully chosen C routines could allow us to write new filters in Python, but get most of the speed of C - an idea I'd been drip-feeding to him for some time :-) I think that is a proper task for this group, and one I hope to put a lot of work in to. I'm personally hoping that we can do a sort of mini-mxTextTools state machine which has actions for lookups in single-byte mapping tables, double-byte mapping tables and other things, so that new encodings can be written and added easily, yet still run fast. For example, all single-byte encodings can be dealt with by a streaming version of something like string.translate(), so adding a new one just becomes a matter of adding a 256-element list to a file somewhere. I believe most of the double-byte ones can be reduced to a few kb with the right functions as well. I'll be ready to talk more about this shortly. Guido also made it clear that while MAL's proposal is considered pretty good, it is not set in stone yet. In particular, if the double-byte specialists find that some minor tweaks would make their lives better, he would consider it; we need a real-world test-drive before 1.6, and this group is the place to do it. =20 Now for my own opinions on how things should be run henceforth. Feel free to differ! I should point out that the inner circle of Python developers are NOT experts in multi-byte data. I feel strongly that we should seek out the best expertise in the world, starting now. This discussion will not focus on Unicode string implementation in the core, but on what our encoding library lets you do at the application level. Ken Lunde, author of "CJKV Information Processing", is the acknowledged world leader in this field, and agreed to take part in a discussion and review our proposals - I'll try to bring him in shortly. It would also be good to collar some people involved in the Java i18n libraries and ask what they would do differently next time around, and to talk to people who have worked with commercial tools like Unilib and Rosette. Then, we won't just hope that Python has the best i18n support, we'll know it. Naturally this review needs to happen fairly promptly in March/April - maybe best to wait until we can run the code. I hope this helps a little. If people have serious issues about where things are heading, let's hear them now. Best Regards, Andy Robinson p.s. one thing I would be very interested to hear is what people's angles are - relevant experience, willingness to help out, needs for solutions etc! From andy@robanal.demon.co.uk Wed Feb 9 02:10:40 2000 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Wed, 09 Feb 2000 02:10:40 GMT Subject: [I18n-sig] Re: I18n In-Reply-To: <38A07876.6EEA0655@equi4.com> References: <38A07876.6EEA0655@equi4.com> Message-ID: <38a8c913.10032615@post.demon.co.uk> On Tue, 08 Feb 2000 21:11:37 +0100, Jean-Claude Wippler wrote: >I have an unrelated question: on developer day, someone in your i18n >session, described why Unicode would not be acceptable in countries such >as Japan. I mentioned this to Cameron, who wants to know more. But I >lost the name/url of that person, can you help out? > Jean-Claude, I have taken the liberty of forwarding this paragraph to the new i18n-sig, which contains the people who made that remark! Please join in to hear more... Here is a very naive oversimplification, without most of the real world mess: There is a standard Japanese character set (Japan Industrial Standard 0208, or JIS-0208 for short) with 6879 characters, which has been more or less unchanged since 1978. They are defined in a logical 94x94 space (the 'kuten table'), with some holes in it. This character set is commonly encoded in three different ways, all of which aim for backward-compatibility with ASCII: 1. Shift-JIS is the native encoding on Windows and the Mac, and for about half of the Japanese HTML on the internet. It basically says 'if the first byte you read is less than 128, it is ASCII; if it is above 128 and between (various values), it is the first half of a kanji'. There is also a phonetic syllabaryt called "half-width katakana" encoded in the top half of the code page. 2. EUC-JP (Extended Unix Coding-Japan) is the encoding on Unix, and the other half of the web pages on the Internet :-) It does something similar; less than 128 is ASCII, and higher values are usually the first half of a kanji.=20 3. JIS is an older encoding designed for mail and news. It uses shift sequences to indicate switching from double-byte to single-byte mode and vice versa. All three do not contain null bytes or control characters, so most 8-bit-safe software works fine with data in these encodings - you might not be able to see Japanese in your English word processor, but it will be preserved intact. All three are very widely used, and are the de facto encodings we have to deal with. (those of us in the IBM world also have to cope with the DBCS-Host encoding, which is a can of worms I won't afflict you with). Because they all derive from the 'kuten table', there are neat algorithmic conversions between them which run very fast and need no lookup tables. It is a very common requirement in Japanese IT to convert between these - for example, to convert a directory of HTML files from EUC to Shift-JIS. If such a neat routine exists to go directly, we don't want to have the overhead of going through Unicode. Imagine we had a few higher-level functions on top of our encodings API, such as convertString(data, input_encoding, output_encoding). The default behaviour of such an encoding would be to go through Unicode as a central point. All we need for Japan is to say that if a filter exists on your system which can go direct from EUC-JP to Shift-JIS, use it rather than going through Unicode. I am sure we can accomodate this; MAL's spec defines a good API, and I think what we need is a higher level on top of it. The real world is messier than I have indicated, and there are actually many corporate variations on the JIS0208 character set - IBM and Microsoft add an extra 360 characters, NEC adds about 94, and companies always define their own 'User-defined characters'. This is where Unicode breaks down badly. These additions are in well-known locations in the 'kuten table', but the mappings to Unicode are not standard. So if you need to go outside the strict JIS0208 character set, you cannot trust Unicode to work as a 'central point'. That's when the direct filters are needed. As an example of this, I worked all last year on a project where we used the Microsoft character set (360 characters bigger than JIS0208) plus a small set of user-defined characters, but it all broke when we had to serve web pages through Java's encoding libraries, which will not handle the extras. As a more general point, the business requirements of someone working in this field are usually to "move data from A to B", where A and B are not Unicode. Unicode is a very useful tool which can sit in the middle most of the time, and Unicode character properties solve many problems in the CJKV world - but not all of them. There are also some common cleanup operations one can perform on Japanese - equivalent to capitalisation, but messier - which can be done either in Unicode with character properties, or directly. Sometimes they have to be done directly. That is why we poor double-byte people want to be able to take a look at the API when it comes out, and maybe add a tweak or two - hopefully in a separate layer over the top - and the right convenience functions to make life easier. Confused yet? I could go on... I will try to write up some decent background documents over the course of this month. By the way, if anyone has similar issues with other locales, let's hear them! - Andy From pf@artcom-gmbh.de Wed Feb 9 07:29:42 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Wed, 9 Feb 2000 08:29:42 +0100 (MET) Subject: [I18n-sig] Locales module In-Reply-To: <38a4c7cc.9705336@post.demon.co.uk> from Andy Robinson at "Feb 9, 2000 2:10:42 am" Message-ID: Hi! Andy Robinson: > >> 3. Locales: > >> -------------- > >> Implement a candidate module for the standard library offering support > >> for the world's date, time, money and number formats, and for time > >> zones. > > > >Hmm, I'd suggest to leave this out of the core and provide it > >through third party extensions which are then shipped by some > >Python distribution party. > > This is definitely not an issue for the language core, and I agree it > should start out as something separate. There was some discussion of > it going into the standard library in due course, and Guido did not > say 'no'! > > Would anyone like to take this on? It isn't really my field. I guess > we should start by reviewing what other systems do and how well they > work. Please excuse my ignorance, if I've missed something. But you sure know about 'locale.py', which already comes included with Python 1.5.2 and works very well for me together with time.strftime. (at least under several flavours of Unix/Linux, its currently missing from Jacks ready to run Mac-ppython package). What additional functionality should the upcoming module provide? Regards, Peter -- Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60 From pf@artcom-gmbh.de Wed Feb 9 08:03:22 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Wed, 9 Feb 2000 09:03:22 +0100 (MET) Subject: more questions about japanese (was Re: [I18n-sig] Re: I18n) In-Reply-To: <38a8c913.10032615@post.demon.co.uk> from Andy Robinson at "Feb 9, 2000 2:10:40 am" Message-ID: Hi! [Andy Robinson]: > There is a standard Japanese character set (Japan Industrial Standard > 0208, or JIS-0208 for short) with 6879 characters, which has been > more or less unchanged since 1978. They are defined in a logical > 94x94 space (the 'kuten table'), with some holes in it. This > character set is commonly encoded in three different ways, all of > which aim for backward-compatibility with ASCII: [...] > All three do not contain null bytes or control characters, so most > 8-bit-safe software works fine with data in these encodings - you > might not be able to see Japanese in your English word processor, but > it will be preserved intact. All three are very widely used, and are > the de facto encodings we have to deal with. [...] > Confused yet? I could go on... I will try to write up some decent > background documents over the course of this month. First let me thank you for your insightful elaboration! And please excuse my ignorance again. But the i18n work I was involved with in the past was easily handled within a 8 bit clean ISO-8859-1 character space (german, english, french, italian, spanish). So I've some more questions to the above: 1. I guess: Word processors exists, which are able deal with text files containing strings in one of the encodings described above, right? So is it possible to submit a .pot-File as produced by GNU xgettext to a japanese translator and he/she would be able to fill in the empty msgstr "" lines with japanese messages? 2. If the resulting .mo-files from 'msgfmt' will be used with a i18n'ed python application, these strings will go unchanged through several layers of software just as would normal ASCII-strings. There are some japanese fonts coming with XFree86 and Linux, but I've never had look at them. Would it be possible to choose such a font, and will this show the desired output on a X-server running on Unix/Linux? 3. What about MacOS and WinXX? I guess, these systems will automatically show up the right characters, if in step 1. the translator has used a word processor on the same platform? Regards, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen) From mal@lemburg.com Wed Feb 9 09:34:44 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 09 Feb 2000 10:34:44 +0100 Subject: [I18n-sig] SIG charter and goals References: <38a03848.685217@post.demon.co.uk> <38A028CF.8188427D@lemburg.com> <38a6c8e8.9989175@post.demon.co.uk> Message-ID: <38A134B4.674C919A@lemburg.com> Andy Robinson wrote: > > On Tue, 08 Feb 2000 15:31:43 +0100, you wrote: > > >> 2. Encodings API and library: > >> -------------------------------- > >> > >> We must deliver an encodings library which surpasses the features of > >> that in Java. It should allow conversion between many common > >> encodings; access to Unicode character properties; and anything else > >> which makes encoding conversion more pleasant. This should be > >> initially based on MAL's draft specification, although the spec may > >> be changed if we find good reason to. > > > >Note that Python will have a builtin codec support. The details > >are described in the proposal paper (not the C API though -- > >that still lives in the .h files of the Unicode implementation). > > > >Note that I have made some good experience with the existing > >spec: it is very flexible, extendable and versatile. It also > >greatly reduces coding efforts by providing working baseclasses. > > > I can't wait to try the code, and cannot foresee any problems at the > moment based on the spec. However, it was only discussed on the > Python-dev list, and Marc-Andree was not at IPC8, so I should try to > explain some background for everyone, (and what my agenda as SIG > moderator is too!) > > 1. HP joined the Python consortium and pushed for Unicode support last > year. There was a detailed discussion on the Python-dev list (to > which I was invited because my day-job included some very messy > double-byte work in Python for a year). Marc-Andre's proposal went > through about eight iterations, and he started to code it up under > contract to CNRI. This is official work, and there is no question of > anybody else's Unicode modules being used - sorry! Fredrik Lundh's > work on the Unicode regex engine is also under contract and > progressing rapidly. > > 2. MAL's document defines the API for 'codecs' - conversion filters - > but his taks does not include delivering a package with all the > world's common encodings in it. That is a necessity in the long run, > and both I (through ReportLab) and Digital Garage need to make at > least the Japanese encodings work quite soon. > > (Marc-Andre, can you update us on what codecs you are providing, and > how they are implemented? C or Python? ) These codecs are currently included: raw_unicode_escape.py utf_16_be.py unicode_escape.py utf_16_le.py ascii.py unicode_internal.py utf_8.py latin_1.py utf_16.py If time permits there will also be a generic mapping codec API which knows what to do with Python mapping tables. I'm not sure how this will be done though... perhaps via a subpackage of encodings which holds any number of tablename.py modules which a special search function then finds and uses. You'd then write something like u = unicode(rawdata, 'mapping-pc850') and the search function would then scan the encodings.mapping package for a module pc850 and use its mapping table for the conversion. > 3. At IPC8 we discussed (among other things) the delivery of the codec > package - both in the i18n forum and in the corridors as usual! To do > what Java does, we eventually need codecs for 50+ common encodings, > all available and tested. These will almost certainly not be in the > standard distribution, but there should eventually be a single, > certified, tested source for them, as this stuff has to be 100% right. > Quite a few of us urgently need good Japanese support. > > The current spec does not say whether codecs should be in C or Python. It is designed to make both possible. I currently code the converters in C and the rest in Python, which works very well and reduces coding efforts to a minimum (the codec base classes are designed to provide everything needed to get the most out of a simple setup). > Guido expressed the hope that a few carefully chosen C routines could > allow us to write new filters in Python, but get most of the speed of > C - an idea I'd been drip-feeding to him for some time :-) I think > that is a proper task for this group, and one I hope to put a lot of > work in to. I'm personally hoping that we can do a sort of > mini-mxTextTools state machine which has actions for lookups in > single-byte mapping tables, double-byte mapping tables and other > things, so that new encodings can be written and added easily, yet > still run fast. For example, all single-byte encodings can be dealt > with by a streaming version of something like string.translate(), so > adding a new one just becomes a matter of adding a 256-element list to > a file somewhere. I believe most of the double-byte ones can be > reduced to a few kb with the right functions as well. I'll be ready > to talk more about this shortly. There will be a mapping based translate function or method in the final relase which you should be able to build upon. > Guido also made it clear that while MAL's proposal is considered > pretty good, it is not set in stone yet. In particular, if the > double-byte specialists find that some minor tweaks would make their > lives better, he would consider it; we need a real-world test-drive > before 1.6, and this group is the place to do it. Right :-) > Now for my own opinions on how things should be run henceforth. Feel > free to differ! > > I should point out that the inner circle of Python developers are NOT > experts in multi-byte data. I feel strongly that we should seek out > the best expertise in the world, starting now. This discussion will > not focus on Unicode string implementation in the core, but on what > our encoding library lets you do at the application level. Ken > Lunde, author of "CJKV Information Processing", is the acknowledged (what does the V stand for ?) > world leader in this field, and agreed to take part in a discussion > and review our proposals - I'll try to bring him in shortly. It would > also be good to collar some people involved in the Java i18n libraries > and ask what they would do differently next time around, and to talk > to people who have worked with commercial tools like Unilib and > Rosette. Then, we won't just hope that Python has the best i18n > support, we'll know it. Naturally this review needs to happen fairly > promptly in March/April - maybe best to wait until we can run the > code. > > I hope this helps a little. If people have serious issues about where > things are heading, let's hear them now. > > Best Regards, > > Andy Robinson > > p.s. one thing I would be very interested to hear is what people's > angles are - relevant experience, willingness to help out, needs for > solutions etc! -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Feb 9 09:42:16 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 09 Feb 2000 10:42:16 +0100 Subject: [I18n-sig] Re: I18n References: <38A07876.6EEA0655@equi4.com> <38a8c913.10032615@post.demon.co.uk> Message-ID: <38A13678.CC917B1A@lemburg.com> Andy Robinson wrote: > > As a more general point, the business requirements of someone working > in this field are usually to "move data from A to B", where A and B > are not Unicode. Unicode is a very useful tool which can sit in the > middle most of the time, and Unicode character properties solve many > problems in the CJKV world - but not all of them. The converters could make use of the Unicode private code point areas. The Python implementation leaves these untouched. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@robanal.demon.co.uk Mon Feb 14 09:47:53 2000 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Mon, 14 Feb 2000 09:47:53 GMT Subject: [I18n-sig] Locales module In-Reply-To: References: Message-ID: <38aace0a.2761260@post.demon.co.uk> On Wed, 9 Feb 2000 08:29:42 +0100 (MET), you wrote: >Please excuse my ignorance, if I've missed something. But you sure know >about 'locale.py', which already comes included with Python 1.5.2 >and works very well for me together with time.strftime. (at least under >several flavours of Unix/Linux, its currently missing from Jacks=20 >ready to run Mac-ppython package). =20 > >What additional functionality should the upcoming module provide? > >Regards, Peter Wow! No, I did not know about this at all. I tested it and it works fine on Windows. I don't know about the POSIX API, but this is essentially a static database so I presume one could dump the contents into a data structure which the Mac etc. used if "from _locale import *" fails. (I suspect it could do with some more convenience stuff layered on top to do number and string formatting, and a bit more documentation - but non-critical). This was mentioned as a deficiency at IPC8, so we need to find out who said that and what they think is missing. Another attendant had serious questions about time zone handling, and I don't know what the issues are there either. Anyone have any feelings on this? - Andy From pf@artcom-gmbh.de Tue Feb 15 01:18:35 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Tue, 15 Feb 2000 02:18:35 +0100 (MET) Subject: [I18n-sig] Locales module In-Reply-To: <38aace0a.2761260@post.demon.co.uk> from Andy Robinson at "Feb 14, 2000 9:47:53 am" Message-ID: Hi! I wrote: [...] > >about 'locale.py', which already comes included with Python 1.5.2 > >and works very well for me together with time.strftime. Andy Robinson: > Wow! No, I did not know about this at all. I tested it and it works > fine on Windows. I don't know about the POSIX API, but this is > essentially a static database so I presume one could dump the contents > into a data structure which the Mac etc. used if "from _locale import > *" fails. (I suspect it could do with some more convenience stuff > layered on top to do number and string formatting, and a bit more > documentation - but non-critical). I don't, if this will work, since this stuff depends on some ANSI-C library features. I took a deeper look into the sources by Martin von Loewis, who has contributed locale.py and _localemodule.c. There is already some #ifdef macintosh in 'Modules/_localemodule.c'. So I really don't know, why Jack Jansens Python 1.5.2c1 binary distribution for the mac doesn't contain the _locale module. May be it was simply forgotten during the build, since it is disabled in Modules/Setup by default? I wonder whether it would make sense, to fill 'locale.py' with dummy stubs, that will be put in if an ImportError exception occurs due to a missing _locale builtin module? The following patch against a recent CVS version will do that. But I am very unsure, whether this behaviour is desired. Better i18n-applications shouldn't depend on the availability of 'locale' and should contain their own fallback, if importing locale fails. Regards, Peter -- Peter Funk, Oldenburger Str.86, 27777 Ganderkesee, Tel: 04222 9502 70, Fax: -60 *** ../../../Python-CVS_10_02_00-orig/dist/src/Lib/locale.py Sat Feb 5 10:45:31 2000 --- Lib/locale.py Tue Feb 15 01:55:35 2000 *************** *** 1,9 **** """Support for number formatting using the current locale settings.""" # Author: Martin von Loewis - from _locale import * import string #perform the grouping from right to left def _group(s): --- 1,36 ---- """Support for number formatting using the current locale settings.""" # Author: Martin von Loewis + # Fallback stubs added by Peter Funk import string + try: + from _locale import * + except ImportError: + # this may happen on MacOS or on Unices where the locale support + # in Modules/Setup wasn't uncommented during the build of python + # we add some dummy stubs here in order not to break any apps: + CHAR_MAX=127 + LC_CTYPE, LC_NUMERIC, LC_TIME, LC_COLLATE, \ + LC_MONETARY, LC_MESSAGES, LC_ALL = tuple(range(7)) + def localeconv(): + return {'grouping': [127], 'currency_symbol': '', 'n_sign_posn': 127, + 'p_cs_precedes': 127, 'n_cs_precedes': 127, + 'mon_grouping': [], 'n_sep_by_space': 127, + 'decimal_point': '.', 'negative_sign': '', + 'positive_sign': '', 'p_sep_by_space': 127, + 'int_curr_symbol': '', 'p_sign_posn': 127, + 'thousands_sep': '', 'mon_thousands_sep': '', + 'frac_digits': 127, 'mon_decimal_point': '', + 'int_frac_digits': 127} + def setlocale(category, arg): + if category == LC_ALL: + return \ + "LC_CTYPE=C;LC_NUMERIC=C;LC_TIME=C;LC_COLLATE=C;LC_MONETARY=C;LC_MESSAGES=C" + else: + return 'C' + def strcoll(s1, s2): return cmp(s1, s2) + def strxfrm(s): return s #perform the grouping from right to left def _group(s): From andy@robanal.demon.co.uk Thu Feb 17 09:43:15 2000 From: andy@robanal.demon.co.uk (Andy Robinson) Date: Thu, 17 Feb 2000 09:43:15 GMT Subject: [I18n-sig] i18n talk at Monterey in July Message-ID: <38b3c2b0.4954644@post.demon.co.uk> The submissions deadline for the July Open Source conference in Monterey is tomorrow. I'd like to ensure that there is a slot for Python internationalisation (say 45 minutes), to show how to use the Unicode features and encodings library and explain some of the problems we are trying to solve. This could be great fun - we can do nice visuals with the Japanese stuff - and will be relevant as 1.6 will be out around then. I will do this myself if needed, but is anyone else willing to co-present and help prepare the talk? =20 The programme itself won't be published in print for some time, so I guess dropping out later is allowed, but names must go on draft proposals today/tomorrow. =20 (e.g. Marc-Andre, Brian, Cyrus?) Who's planning to be there, anyway? - Andy Robinson From mal@lemburg.com Thu Feb 17 11:05:15 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 17 Feb 2000 12:05:15 +0100 Subject: [I18n-sig] i18n talk at Monterey in July References: <38b3c2b0.4954644@post.demon.co.uk> Message-ID: <38ABD5EB.404AC678@lemburg.com> Andy Robinson wrote: > > The submissions deadline for the July Open Source conference in > Monterey is tomorrow. I'd like to ensure that there is a slot for > Python internationalisation (say 45 minutes), to show how to use the > Unicode features and encodings library and explain some of the > problems we are trying to solve. This could be great fun - we can do > nice visuals with the Japanese stuff - and will be relevant as 1.6 > will be out around then. > > I will do this myself if needed, but is anyone else willing to > co-present and help prepare the talk? > > The programme itself won't be published in print for some time, so I > guess dropping out later is allowed, but names must go on draft > proposals today/tomorrow. > > (e.g. Marc-Andre, Brian, Cyrus?) > > Who's planning to be there, anyway? I won't have time to spend on this, because I'm packed with work (also, I won't be online next week), sorry. Anyway, the Unicode code will go into CVS within the first two weeks in March, so you should be able to test and verify the new features really soon now :-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/