From Misha.Wolf@reuters.com Fri Sep 7 19:12:01 2001 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Fri, 07 Sep 2001 19:12:01 +0100 Subject: [I18n-sig] Last Call for Papers - 20th Unicode Conference - Jan/Feb 2001 - Washington DC Message-ID: >>>>>>>>>>>>>>>>>>>>>>> Last Call for Papers! <<<<<<<<<<<<<<<<<<<<<<< Twentieth International Unicode Conference (IUC20) Unicode and the Web: The Global Connection http://www.unicode.org/iuc/iuc20 January 28 - February 1, 2002 Washington, DC, USA >>>>>>>>>>>>>>>>>>>>>>>> Just 2 weeks to go! <<<<<<<<<<<<<<<<<<<<<<<< Submissions due: September 21, 2001 Notification date: October 12, 2001 Completed papers due : November 2, 2001 (in electronic form and camera-ready paper form) >>>>>>>>>>>>>>>>>>> Send in your submission now! <<<<<<<<<<<<<<<<<<<< The Unicode Standard has become the foundation for all modern text processing. It is used on large machines, tiny portable devices, and for distributed processing across the Internet. The standard brings cost-reducing efficiency to international applications and enables the exchange of text in an ever increasing list of natural languages. New technologies and innovative Internet applications, as well as the evolving Unicode Standard, bring new challenges along with their new capabilities. This technical conference will explore the opportunities created by the latest advances and how to leverage them, as well as potential pitfalls to be aware of, and problem areas that need further research. We invite you to submit papers which either define the software of tomorrow, demonstrate best practice with today's software, or articulate problems that must be solved before further advances can occur. Papers should discuss subjects in the context of Unicode, internationalization or localization. You can view the programs of previous conferences at: http://www.unicode.org/unicode/conference/about-conf.html Conference attendees are generally involved in either the development, deployment or use of Unicode software or content, or the globalization of software and the Internet. They include managers, software engineers, systems analysts, font designers, graphic designers, content developers, technical writers, and product marketing personnel. THEME & TOPICS Computing with Unicode is the overall theme of the Conference. Presentations should be geared towards a technical audience. Topics of interest include, but are not limited to, the following (within the context of Unicode, internationalization or localization): - UTFs: Not enough or too many? - Security concerns e.g. Avoiding the spoofing of UTF-8 data - Impact of new encoding standards - Implementing Unicode: Practical and political hurdles - Portable devices - Implementing new features of recent versions of Unicode - Algorithms (e.g. normalization, collation, bidirectional) - Programming languages and libraries (Java, Perl, et al) - The World Wide Web (WWW) - Search engines - Library and archival concerns - Operating systems - Databases - Large scale networks - Government applications - Evaluations (case studies, usability studies) - Natural language processing - Migrating legacy applications - Cross platform issues - Printing and imaging - Optimizing performance of systems and applications - Testing applications - XML and Web protocols - Business models for software development (e.g. Open source) SESSIONS The Conference Program will provide a wide range of sessions including: - Keynote presentations - Workshops/Tutorials - Technical presentations - Panel sessions All sessions except the Workshops/Tutorials will be of 40 minute duration. In some cases, two consecutive 40 minute program slots may be devoted to a single session. The Workshops/Tutorials will each last approximately three hours. They should be designed to stimulate discussion and participation, using slides and demonstrations. PUBLICITY If your paper is accepted, your details will be included in the Conference brochure and Web pages and the paper itself will appear on a Conference CD, with an optional printed book of Conference Proceedings. CONFERENCE LANGUAGE The Conference language is English. All submissions, papers and presentations should be provided in English. SUBMISSIONS Submissions MUST contain: 1. An abstract of 150-250 words, consisting of statement of purpose, paper description, and your conclusions or final summary. 2. A brief biography. 3. The details listed below: SESSION TITLE: _________________________________________ _________________________________________ YOUR TITLE (eg Prof): _________________________________________ YOUR NAME: _________________________________________ YOUR JOB TITLE: _________________________________________ ORGANIZATION/AFFILIATION: _________________________________________ ORGANIZATION'S WWW URL: _________________________________________ YOUR WWW URL: _________________________________________ ADDRESS FOR PAPER MAIL: _________________________________________ _________________________________________ _________________________________________ TELEPHONE: _________________________________________ FAX: _________________________________________ E-MAIL ADDRESS: _________________________________________ TYPE OF SESSION: [ ] Keynote presentation [ ] Workshop/Tutorial [ ] Technical presentation [ ] Panel PANELISTS (if Panel): _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ TARGET AUDIENCE (you may select more than one category): [ ] Content Developers [ ] Font Designers [ ] Graphic Designers [ ] Managers [ ] Marketers [ ] Software Engineers [ ] Systems Analysts [ ] Technical Writers [ ] Others (please specify): _________________________________________ _________________________________________ LEVEL OF SESSION (you may select more than one category): [ ] Beginner [ ] Intermediate [ ] Advanced Submissions should be sent by e-mail to either of the following addresses: papers@unicode.org info@global-conference.com They should use ASCII, non-compressed text and the following subject line: Proposal for IUC 20 If desired, a copy of the submission may also be sent by post to: Twentieth International Unicode Conference c/o Global Meeting Services, Inc. 4360 Benhurst Avenue San Diego, CA 92122 USA Tel: +1 858 638 0206 Fax: +1 858 638 0504 CONFERENCE PROCEEDINGS All Conference papers will be published on CD. Printed proceedings will be offered as an option. EXHIBIT OPPORTUNITIES The Conference will have an Exhibition area for corporations or individuals who wish to display and promote their products, technology and/or services. Every effort will be made to provide maximum exposure and advertising. Exhibit space is limited. For further information or to reserve a place, please contact Global Meeting Services at the above location. CONFERENCE VENUE Omni Shoreham Hotel 2500 Calvert Street, NW Washington, DC 20008 USA Tel: +1 202 234 0700 Fax: +1 202 265 7972 THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From Misha.Wolf@reuters.com Wed Sep 12 16:51:49 2001 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Wed, 12 Sep 2001 16:51:49 +0100 Subject: [I18n-sig] Status of the Unicode Conference Message-ID: There follows a message from Lisa Moore, Unicode Conference co-chair. For Conference details, see http://www.unicode.org/iuc/iuc19 . Misha ~~~ >From the Unicode conference, let me say, that yes, there is a conference underway. Certainly many people who planned to attend are no longer able. So far, about ten of our speakers are unable to make travel arrangements. To the best of my knowledge, no one involved in the conference was in New York or on one of the flights involved in today's tragedies. We have contacted most of the speakers who have not been able to travel, and they are well, and many are still planning on coming. So, if you are in the Bay Area, and wish to attend, please do so - we would very much like to see you. Lisa ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From rnd@onego.ru Fri Sep 14 20:38:00 2001 From: rnd@onego.ru (Roman Suzi) Date: Fri, 14 Sep 2001 23:38:00 +0400 (MSD) Subject: [I18n-sig] pygettext and PEP #?# Message-ID: Hello! I remeber we had hot discussion about how to tell Python which encoding it's code is in. po-files use the following convention: "Project-Id-Version: PACKAGE VERSION\n" "POT-Creation-Date: Fri Sep 14 21:32:52 2001\n" "PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n" "Last-Translator: FULL NAME \n" "Language-Team: LANGUAGE \n" "MIME-Version: 1.0\n" "Content-Type: text/plain; charset=CHARSET\n" "Content-Transfer-Encoding: ENCODING\n" "Generated-By: pygettext.py 1.3\n" Probably Python's module-level doc-string could also adopt RFC-822 style header which will provide such meta-information? (Right now doc strings do not concatenate). That is, making __doc__ RFC822 message which header has metainfromation and body - usual comments. Sincerely yours, Roman Suzi -- _/ Russia _/ Karelia _/ Petrozavodsk _/ rnd@onego.ru _/ _/ Friday, September 14, 2001 _/ Powered by Linux RedHat 6.2 _/ _/ "URA Redneck if you own a homemade fur coat." _/ From martin@loewis.home.cs.tu-berlin.de Fri Sep 14 21:49:08 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 14 Sep 2001 22:49:08 +0200 Subject: [I18n-sig] pygettext and PEP #?# In-Reply-To: (message from Roman Suzi on Fri, 14 Sep 2001 23:38:00 +0400 (MSD)) References: Message-ID: <200109142049.f8EKn8B02520@mira.informatik.hu-berlin.de> > Probably Python's module-level doc-string could also adopt > RFC-822 style header which will provide such meta-information? > (Right now doc strings do not concatenate). > > That is, making __doc__ RFC822 message which header has metainfromation > and body - usual comments. It depends on what you want to use this information for. If you want the interpreter to automatically react in some way (e.g. convert strings to Unicode objects automatically based on the module encoding), then I suggest that (ab-)using the doc string for that is a bad idea. Furthermore, I doubt that many users of doc strings are interested in the encoding of the module doc string (which they'd get when doing help(module)); instead, they only care that it prints right even if it is not ASCII. There are many ways to signal languages, and RFC822 headers are surely one of them (the application in GNU message catalogs originated from MIME, which is also the foundation for indicating languages in HTTP). So the problem is not so much the format of the meta information, but where to place it and how to process it. Regards, Martin From Alexandre.Fayolle@logilab.fr Sat Sep 15 18:51:26 2001 From: Alexandre.Fayolle@logilab.fr (Alexandre Fayolle) Date: Sat, 15 Sep 2001 19:51:26 +0200 (CEST) Subject: [I18n-sig] gettext and windows Message-ID: Hello, I'm testing an app that runs fine under linux, but whose l10n fails under windows (Win98). By tracking down the code in gettext.py, I saw that this module uses environment variables to get the current locale. However, this is not the correct way of doing things on Windows system, since the LC_ALL variables is generally not set, resulting in the C locale being used. I have a patch which uses locale.getdefaultlocale()[0] to get information, but I wanted to know if there was a reason why this had not been used in the first place. Thanks Alexandre Fayolle -- LOGILAB, Paris (France). http://www.logilab.com http://www.logilab.fr http://www.logilab.org Narval, the first software agent available as free software (GPL). From martin@loewis.home.cs.tu-berlin.de Mon Sep 17 07:09:32 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Mon, 17 Sep 2001 08:09:32 +0200 Subject: [I18n-sig] gettext and windows In-Reply-To: (message from Alexandre Fayolle on Sat, 15 Sep 2001 19:51:26 +0200 (CEST)) References: Message-ID: <200109170609.f8H69Wb01007@mira.informatik.hu-berlin.de> > I have a patch which uses locale.getdefaultlocale()[0] to get information, > but I wanted to know if there was a reason why this had not been used in > the first place. gettext.find uses the GNU gettext strategy for locating catalogs. I think there should be a routine that models GNU gettext as close as possible, even on Windows - Windows also supports environment variables, after all. That routine does not need to be called gettext.find, though. I'd agree that this algorithm is not optimal. However, just considering the default locale is not appropriate, either: - On Windows, there is a user locale and a system locale. I don't know what the change is that they ever differ, but if they do, this might need consideration - Currently, catalogs are located in /share/locale//LC_MESSAGES. This is appropriate on Unix, since each of the directories on the path will contain a lot of other stuff. It is less appropriate on Windows; we should consider placing the catalogs into a location nearer to the root of the Python installation. - GNU gettext has the feature of fallback languages, e.g. setting LANGUAGES to "fr:es" indicates that you prefer French translations, but if none are available, you'd prefer Spanish ones over the default text (which typically is English). That may be worth being exposed also (*). So there is something to be fixed, but it appears that more is involved than just looking at the default locale. Regards, Martin (*) Of course, the fallback mechanism is not fully implemented in gettext.py, yet: it will fallback on a per-catalog basis, but not on a per-message basis. From Misha.Wolf@reuters.com Wed Sep 19 06:23:07 2001 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Wed, 19 Sep 2001 06:23:07 +0100 Subject: [I18n-sig] Last Call for Papers - 20th Unicode Conference - Jan/Feb 2001 - Washington DC Message-ID: Because of the recent tragic events and the resulting disruption we are sending you a reminder that this is the final week for submissions for the Twentieth International Unicode Conference (IUC20). >>>>>>>>>>>>>>>>>>>>>>> Last Call for Papers! <<<<<<<<<<<<<<<<<<<<<<< Twentieth International Unicode Conference (IUC20) Unicode and the Web: The Global Connection http://www.unicode.org/iuc/iuc20 January 28 - February 1, 2002 Washington, DC, USA >>>>>>>>>>>>>>>>>>>>>>>> Just 2 weeks to go! <<<<<<<<<<<<<<<<<<<<<<<< Submissions due: September 21, 2001 Notification date: October 12, 2001 Completed papers due : November 2, 2001 (in electronic form and camera-ready paper form) >>>>>>>>>>>>>>>>>>> Send in your submission now! <<<<<<<<<<<<<<<<<<<< The Unicode Standard has become the foundation for all modern text processing. It is used on large machines, tiny portable devices, and for distributed processing across the Internet. The standard brings cost-reducing efficiency to international applications and enables the exchange of text in an ever increasing list of natural languages. New technologies and innovative Internet applications, as well as the evolving Unicode Standard, bring new challenges along with their new capabilities. This technical conference will explore the opportunities created by the latest advances and how to leverage them, as well as potential pitfalls to be aware of, and problem areas that need further research. We invite you to submit papers which either define the software of tomorrow, demonstrate best practice with today's software, or articulate problems that must be solved before further advances can occur. Papers should discuss subjects in the context of Unicode, internationalization or localization. You can view the programs of previous conferences at: http://www.unicode.org/unicode/conference/about-conf.html Conference attendees are generally involved in either the development, deployment or use of Unicode software or content, or the globalization of software and the Internet. They include managers, software engineers, systems analysts, font designers, graphic designers, content developers, technical writers, and product marketing personnel. THEME & TOPICS Computing with Unicode is the overall theme of the Conference. Presentations should be geared towards a technical audience. Topics of interest include, but are not limited to, the following (within the context of Unicode, internationalization or localization): - UTFs: Not enough or too many? - Security concerns e.g. Avoiding the spoofing of UTF-8 data - Impact of new encoding standards - Implementing Unicode: Practical and political hurdles - Portable devices - Implementing new features of recent versions of Unicode - Algorithms (e.g. normalization, collation, bidirectional) - Programming languages and libraries (Java, Perl, et al) - The World Wide Web (WWW) - Search engines - Library and archival concerns - Operating systems - Databases - Large scale networks - Government applications - Evaluations (case studies, usability studies) - Natural language processing - Migrating legacy applications - Cross platform issues - Printing and imaging - Optimizing performance of systems and applications - Testing applications - XML and Web protocols - Business models for software development (e.g. Open source) SESSIONS The Conference Program will provide a wide range of sessions including: - Keynote presentations - Workshops/Tutorials - Technical presentations - Panel sessions All sessions except the Workshops/Tutorials will be of 40 minute duration. In some cases, two consecutive 40 minute program slots may be devoted to a single session. The Workshops/Tutorials will each last approximately three hours. They should be designed to stimulate discussion and participation, using slides and demonstrations. PUBLICITY If your paper is accepted, your details will be included in the Conference brochure and Web pages and the paper itself will appear on a Conference CD, with an optional printed book of Conference Proceedings. CONFERENCE LANGUAGE The Conference language is English. All submissions, papers and presentations should be provided in English. SUBMISSIONS Submissions MUST contain: 1. An abstract of 150-250 words, consisting of statement of purpose, paper description, and your conclusions or final summary. 2. A brief biography. 3. The details listed below: SESSION TITLE: _________________________________________ _________________________________________ YOUR TITLE (eg Prof): _________________________________________ YOUR NAME: _________________________________________ YOUR JOB TITLE: _________________________________________ ORGANIZATION/AFFILIATION: _________________________________________ ORGANIZATION'S WWW URL: _________________________________________ YOUR WWW URL: _________________________________________ ADDRESS FOR PAPER MAIL: _________________________________________ _________________________________________ _________________________________________ TELEPHONE: _________________________________________ FAX: _________________________________________ E-MAIL ADDRESS: _________________________________________ TYPE OF SESSION: [ ] Keynote presentation [ ] Workshop/Tutorial [ ] Technical presentation [ ] Panel PANELISTS (if Panel): _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ TARGET AUDIENCE (you may select more than one category): [ ] Content Developers [ ] Font Designers [ ] Graphic Designers [ ] Managers [ ] Marketers [ ] Software Engineers [ ] Systems Analysts [ ] Technical Writers [ ] Others (please specify): _________________________________________ _________________________________________ LEVEL OF SESSION (you may select more than one category): [ ] Beginner [ ] Intermediate [ ] Advanced Submissions should be sent by e-mail to either of the following addresses: papers@unicode.org info@global-conference.com They should use ASCII, non-compressed text and the following subject line: Proposal for IUC 20 If desired, a copy of the submission may also be sent by post to: Twentieth International Unicode Conference c/o Global Meeting Services, Inc. 4360 Benhurst Avenue San Diego, CA 92122 USA Tel: +1 858 638 0206 Fax: +1 858 638 0504 CONFERENCE PROCEEDINGS All Conference papers will be published on CD. Printed proceedings will be offered as an option. EXHIBIT OPPORTUNITIES The Conference will have an Exhibition area for corporations or individuals who wish to display and promote their products, technology and/or services. Every effort will be made to provide maximum exposure and advertising. Exhibit space is limited. For further information or to reserve a place, please contact Global Meeting Services at the above location. CONFERENCE VENUE Omni Shoreham Hotel 2500 Calvert Street, NW Washington, DC 20008 USA Tel: +1 202 234 0700 Fax: +1 202 265 7972 THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From kajiyama@grad.sccs.chukyo-u.ac.jp Tue Sep 25 16:38:13 2001 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Wed, 26 Sep 2001 00:38:13 +0900 Subject: [I18n-sig] JapaneseCodecs 1.4 released Message-ID: <200109251538.AAA30063@dhcp209.grad.sccs.chukyo-u.ac.jp> Hi all, I released JapaneseCodecs version 1.4. The source tarball is available at the following location: http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/ The major enhancement of this release is the set of new codecs written in C. The performances in both speed and storage size would be impressive as described below. Please check it out! Here is the result of a simple benchmark test that encodes a Unicode string and then decodes it back. The new codecs written in C are much much faster than the old codecs written in Python (time is shown in seconds). a Unicode string of 10,000 chars in Python in C japanese.euc-jp 1.074 0.003859 japanese.shift_jis 1.059 0.003981 japanese.iso-2022-jp 0.842 0.007737 a Unicode string of 100,000 chars in Python in C japanese.euc-jp 11.54 0.02978 japanese.shift_jis 11.55 0.03047 japanese.iso-2022-jp 8.345 0.06522 a Unicode string of 1,000,000 chars in Python in C japanese.euc-jp 126.7 0.2259 japanese.shift_jis 125.9 0.2276 japanese.iso-2022-jp 82.87 0.5892 The runtime memory size is also reduced drastically. In the case of a Linux box of mine, the old codecs in Python require the runtime memory of 3,364K bytes, while the new codecs in C occupy only 124K bytes. In addition, the start-up time of the Python interpreter is much shorter if one of the Japanese codecs is used as the system default encoding. I adopted a hashing technique in order to archive the high performances in both speed and storage size. Thanks Marc-Andre for your advice (given by a couple of private messages long time ago ;-). Part of the program in src/_japanese_codecs.c is based on ms932codec.c written by Atsuo ISHIMOTO. Some helper functions are used as they are. I appreciate his invaluable work. For developers of possible derived packages: Character mapping tables in the form of hash tables are in src/_japanese_codecs.h. This is an auto-generated file; you may want to look at the hash table generator src/hgen.py and hash table look-up functions in src/_japanese_codecs.c (lookup_jis_map() and lookup_ucs_map()). If you are familiar with the programming of Python extension modules, you will be able to apply the codes to other character encodings such as EUC-KR and BIG-5 without trouble. The hashing function f() is (charcode % 523), and I heuristically chose the divider (a prime number greater than 256). I believe that the value 523 is not bad in many cases. In general, the larger the divider is, the faster the look-up functions run, and the bigger the hash tables are (and vise versa). Try other prime numbers if the resulting performances of the look-up functions and sizes of hash tables are not desirable. The new codecs in C are very young, and probably have a number of bugs. Any kind of feedback is vary appreciated. Thank you, -- KAJIYAMA, Tamito