From Misha.Wolf@reuters.com Fri Feb 8 19:45:19 2002 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Fri, 08 Feb 2002 19:45:19 +0000 Subject: [I18n-sig] 21st Unicode Conference, May 2002, Dublin, Ireland Message-ID: >>>>>>>>>>>>>>>>>> First European IUC in two years! <<<<<<<<<<<<<<<<<<< Twenty-first International Unicode Conference (IUC21) Unicode, Localization and the Web: The Global Connection http://www.unicode.org/iuc/iuc21 May 14-17, 2002 Dublin, Ireland >>>>>>>>>>>>>>>>>>>>>>>> Just 13 weeks to go! <<<<<<<<<<<<<<<<<<<<<<<<< The Unicode Standard has become the foundation for all modern text processing. It is used on large machines, tiny portable devices, and for distributed processing across the Internet. The standard brings cost-reducing efficiency to international applications and enables the exchange of text in an ever increasing list of natural languages. New technologies and innovative Internet applications, as well as the evolving Unicode Standard, bring new challenges along with their new capabilities. The Twenty-first International Unicode Conference (IUC21) will explore the opportunities created by the latest advances and how to leverage them, as well as potential pitfalls to be aware of, and problem areas that need further research. Conference attendees will include managers, software engineers, systems analysts, font designers, graphic designers, content developers, technical writers, and product marketing personnel, involved in the development, deployment or use of Unicode software or content, and the globalization of software and the Internet. CONFERENCE WEB SITE, PROGRAM and REGISTRATION The Conference Program and Registration form will be available soon at the Conference Web site: http://www.unicode.org/iuc/iuc21 CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation Localisation Research Centre Microsoft Corporation Reuters Ltd. Sun Microsystems, Inc. World Wide Web Consortium (W3C) GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. For details, visit the Conference Web site. CONFERENCE VENUE The Conference will take place at: The Burlington Hotel Upper Leeson Street Dublin 4, Ireland Tel: (+353 1) 660 5222 Fax: (+353 1) 660 8496 CONFERENCE MANAGEMENT Global Meeting Services Inc. 8949 Lombard Place, #416 San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ------------------------------------------------------------- --- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From janssen@parc.xerox.com Fri Feb 8 23:05:34 2002 From: janssen@parc.xerox.com (Bill Janssen) Date: Fri, 8 Feb 2002 15:05:34 PST Subject: [I18n-sig] IANA names for character set encodings? Message-ID: <02Feb8.150534pst."3456"@watson.parc.xerox.com> Folks, I've been playing with the charset support in Python 2.x, and I want to congratulate you on a great addition to the language. It should really be more widely advertised! I think it makes Python the premier language for string processing. One thing that puzzles me, though, is the lack of support for the standard IANA-registered names for the various charsets, as given in http://www.iana.org/assignments/character-sets. I notice that the file encodings/aliases.py (in Python 2.2) does contain a few of these, but other charsets like windows-1256 cannot be referred to by its standard name -- it's cp1256 in Python. This is highly counter-intuitive when parsing HTML for instance, with "text/plain; charset=windows-1256" as the media type. The IANA charset table is fairly easy to parse automatically; see the tail end of http://cvs.plkr.org/index.cgi/parser/python/PyPlucker/helper/CharsetMapping.py?rev=HEAD&content-type=text/vnd.viewcvs-markup for code which does so. I'd suggest renaming the existing codecs according to their IANA names, then adding the current names to the aliases list. Bill From mal@lemburg.com Fri Feb 8 23:33:32 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 09 Feb 2002 00:33:32 +0100 Subject: [I18n-sig] IANA names for character set encodings? References: <02Feb8.150534pst."3456"@watson.parc.xerox.com> Message-ID: <3C64604C.1F87700B@lemburg.com> Bill Janssen wrote: > > Folks, > > I've been playing with the charset support in Python 2.x, and I want > to congratulate you on a great addition to the language. It should > really be more widely advertised! I think it makes Python the premier > language for string processing. > > One thing that puzzles me, though, is the lack of support for the > standard IANA-registered names for the various charsets, as given in > http://www.iana.org/assignments/character-sets. I notice that the file > encodings/aliases.py (in Python 2.2) does contain a few of these, but > other charsets like windows-1256 cannot be referred to by its standard > name -- it's cp1256 in Python. This is highly counter-intuitive when > parsing HTML for instance, with "text/plain; charset=windows-1256" as > the media type. > > The IANA charset table is fairly easy to parse automatically; see the > tail end of > http://cvs.plkr.org/index.cgi/parser/python/PyPlucker/helper/CharsetMapping.py?rev=HEAD&content-type=text/vnd.viewcvs-markup > for code which does so. > > I'd suggest renaming the existing codecs according to their IANA > names, then adding the current names to the aliases list. That won't work since you can import the codec by their current names as normal modules. However, we could add more aliases for them if needed. Adding all of them seems overkill though... and cumbersome, e.g. nobody uses names like ANSI_X3.4-1968 -- us-ascii is the common name. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From janssen@parc.xerox.com Sat Feb 9 00:59:56 2002 From: janssen@parc.xerox.com (Bill Janssen) Date: Fri, 8 Feb 2002 16:59:56 PST Subject: [I18n-sig] IANA names for character set encodings? In-Reply-To: Your message of "Fri, 08 Feb 2002 15:33:32 PST." <3C64604C.1F87700B@lemburg.com> Message-ID: <02Feb8.170000pst."3456"@watson.parc.xerox.com> > Adding all of them seems overkill though... and cumbersome, e.g. > nobody uses names like ANSI_X3.4-1968 -- us-ascii is the > common name. Since the aliases can live in an aliases file, I don't see that it's a big deal to add them all, and it will really help in dealing with Internet protocols correctly. It is valid to put 'ANSI_X3.4-1968' in as a charset value when sending something. I'd like my Python app to be able to cope with that possibility. Bill From tree@basistech.com Sat Feb 9 01:10:16 2002 From: tree@basistech.com (Tom Emerson) Date: Fri, 8 Feb 2002 20:10:16 -0500 Subject: [I18n-sig] IANA names for character set encodings? In-Reply-To: <3C64604C.1F87700B@lemburg.com> References: <02Feb8.150534pst."3456"@watson.parc.xerox.com> <3C64604C.1F87700B@lemburg.com> Message-ID: <15460.30456.170987.457703@magrathea.basistech.com> M.-A. Lemburg writes: > Adding all of them seems overkill though... and cumbersome, e.g. > nobody uses names like ANSI_X3.4-1968 -- us-ascii is the > common name. Sure, but I've seen machine generated markup/documents that make use of the ANSI_X3.4-1968 name, particularly those coming out of Government agencies. If we are going to support the IANA names, then there is no reason not to support all of them. Picking and choosing those that we think aren't used is asking for a bug report. This is a no brainer. -tree -- Tom Emerson Basis Technology Corp. Sr. Computational Linguist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Sat Feb 9 11:19:49 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 09 Feb 2002 12:19:49 +0100 Subject: [I18n-sig] IANA names for character set encodings? References: <02Feb8.150534pst."3456"@watson.parc.xerox.com> <3C64604C.1F87700B@lemburg.com> <15460.30456.170987.457703@magrathea.basistech.com> Message-ID: <3C6505D5.122D4D98@lemburg.com> Tom Emerson wrote: > > M.-A. Lemburg writes: > > Adding all of them seems overkill though... and cumbersome, e.g. > > nobody uses names like ANSI_X3.4-1968 -- us-ascii is the > > common name. > > Sure, but I've seen machine generated markup/documents that make use > of the ANSI_X3.4-1968 name, particularly those coming out of > Government agencies. > > If we are going to support the IANA names, then there is no reason not > to support all of them. Picking and choosing those that we think > aren't used is asking for a bug report. > > This is a no brainer. How large would such an alias dictionary be ? Looking at the IANA listing it seems rather lengthy. What I'm worried about is that Python startup time will get worse for programs using codecs (I sometimes wish Python had a builtin on-disk registry where we could put static data like this). Anyway, if you all think this is a non-issue, fine with me. Bill, can you parse the IANA listing into dictionary definition ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From fredrik@pythonware.com Sat Feb 9 11:43:46 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Sat, 9 Feb 2002 12:43:46 +0100 Subject: [I18n-sig] IANA names for character set encodings? References: <02Feb8.150534pst."3456"@watson.parc.xerox.com> <3C64604C.1F87700B@lemburg.com> <15460.30456.170987.457703@magrathea.basistech.com> <3C6505D5.122D4D98@lemburg.com> Message-ID: <010901c1b15f$0bc03580$ced241d5@hagrid> mal wrote: > How large would such an alias dictionary be ? > > Looking at the IANA listing it seems rather lengthy. What I'm > worried about is that Python startup time will get worse for > programs using codecs (I sometimes wish Python had a builtin > on-disk registry where we could put static data like this). why split it up in two parts; put common aliases in one table (latin*, utf*, us-ascii, iso-8858, iso-2022, and perhaps some more), put that table inside __init__, and change the search function to: 1) look for a common aliases in the small table 2) try importing the module 3) if import fails, import "aliases", look it up in the big table, and try again in this way, people who use the "true" names and commonly used aliases won't have to load the big alias table at all. From fredrik@pythonware.com Sat Feb 9 11:44:44 2002 From: fredrik@pythonware.com (Fredrik Lundh) Date: Sat, 9 Feb 2002 12:44:44 +0100 Subject: [I18n-sig] IANA names for character set encodings? Message-ID: <010f01c1b15f$2ca0f500$ced241d5@hagrid> > why split it up in two parts I guess I meant: > why not split it up in two parts? From mal@lemburg.com Sat Feb 9 11:53:01 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 09 Feb 2002 12:53:01 +0100 Subject: [I18n-sig] IANA names for character set encodings? References: <02Feb8.150534pst."3456"@watson.parc.xerox.com> <3C64604C.1F87700B@lemburg.com> <15460.30456.170987.457703@magrathea.basistech.com> <3C6505D5.122D4D98@lemburg.com> <010901c1b15f$0bc03580$ced241d5@hagrid> Message-ID: <3C650D9D.276CE8D6@lemburg.com> Fredrik Lundh wrote: > > mal wrote: > > How large would such an alias dictionary be ? > > > > Looking at the IANA listing it seems rather lengthy. What I'm > > worried about is that Python startup time will get worse for > > programs using codecs (I sometimes wish Python had a builtin > > on-disk registry where we could put static data like this). > > why split it up in two parts; put common aliases in one table > (latin*, utf*, us-ascii, iso-8858, iso-2022, and perhaps some > more), put that table inside __init__, and change the search > function to: > > 1) look for a common aliases in the small table > 2) try importing the module > 3) if import fails, import "aliases", look it up in the > big table, and try again > > in this way, people who use the "true" names and commonly > used aliases won't have to load the big alias table at all. Good idea. Let's do it that way. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From janssen@parc.xerox.com Tue Feb 12 00:38:02 2002 From: janssen@parc.xerox.com (Bill Janssen) Date: Mon, 11 Feb 2002 16:38:02 PST Subject: [I18n-sig] IANA names for character set encodings? In-Reply-To: Your message of "Sat, 09 Feb 2002 03:19:49 PST." <3C6505D5.122D4D98@lemburg.com> Message-ID: <02Feb11.163812pst."3456"@watson.parc.xerox.com> > Bill, can you parse the IANA listing into dictionary > definition ? Sure. What do you want as key and what as value? Bill From janssen@parc.xerox.com Tue Feb 12 00:40:41 2002 From: janssen@parc.xerox.com (Bill Janssen) Date: Mon, 11 Feb 2002 16:40:41 PST Subject: [I18n-sig] IANA names for character set encodings? In-Reply-To: Your message of "Sat, 09 Feb 2002 03:19:49 PST." <3C6505D5.122D4D98@lemburg.com> Message-ID: <02Feb11.164047pst."3456"@watson.parc.xerox.com> > Looking at the IANA listing it seems rather lengthy. What I'm > worried about is that Python startup time will get worse for > programs using codecs (I sometimes wish Python had a builtin > on-disk registry where we could put static data like this). The URL I cited has such an alias listing -- not too bad. Or it could be in an alternate module iana-charset-names, or some such, to be loaded on demand. http://cvs.plkr.org/index.cgi/parser/python/PyPlucker/helper/CharsetMapping.py?rev=HEAD&content-type=text/plain Bill From mal@lemburg.com Tue Feb 12 10:09:13 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 12 Feb 2002 11:09:13 +0100 Subject: [I18n-sig] IANA names for character set encodings? References: <02Feb11.164047pst."3456"@watson.parc.xerox.com> Message-ID: <3C68E9C9.E1E49D0C@lemburg.com> FYI, I've added the aliases over the weekend. No need to duplicate the work (which wasn't as easy as expected) :-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From Misha.Wolf@reuters.com Wed Feb 20 16:05:52 2002 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Wed, 20 Feb 2002 16:05:52 +0000 Subject: [I18n-sig] Character Model for the Web + Unicode in XML and other Markup Languages Message-ID: This week sees the publication of: Character Model for the World Wide Web 1.0 W3C Working Draft 20 February 2002 http://www.w3.org/TR/2002/WD-charmod-20020220 and: Unicode in XML and other Markup Languages W3C Note 18 February 2002 http://www.w3.org/TR/2002/NOTE-unicode-xml-20020218 The Character Model "provides authors of specifications, software developers, and content developers with a common reference for interoperable text manipulation on the World Wide Web. Topics addressed include encoding identification, early uniform normalization, string identity matching, string indexing, and URI conventions, building on the Universal Character Set, defined jointly by Unicode and ISO/IEC 10646. Some introductory material on characters and character encodings is also provided." This specification has been extensively revised over the past year, reflecting the Last Call Comments on: Character Model for the World Wide Web 1.0 W3C Working Draft 26 January 2001 http://www.w3.org/TR/2001/WD-charmod-20010126 The second document, Unicode in XML and other Markup Languages, is published jointly with the Unicode Consortium. It provides guidelines for the use of Unicode with markup languages such as XML. Both documents are especially topical in view of the work currently taking place, within the W3C XML Core WG, on: XML 1.1 W3C Working Draft 13 December 2001 http://www.w3.org/TR/2001/WD-xml11-20011213 Misha Wolf W3C I18N WG Chair ------------------------------------------------------------- --- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From Misha.Wolf@reuters.com Fri Feb 22 21:24:03 2002 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Fri, 22 Feb 2002 21:24:03 +0000 Subject: [I18n-sig] Call for Papers - 22nd Unicode Conference - September 2002 - San Jose, CA Message-ID: >>>>>>>>>>>>>>>>>>>>>>>>>> Call for Papers! <<<<<<<<<<<<<<<<<<<<<<<<< Twenty-second International Unicode Conference (IUC22) Unicode and the Web: The Global Connection http://www.unicode.org/iuc/iuc22 September 9-13, 2002 San Jose, California >>>>>>>>>>>>>>>>>>>> Send in your submission now! <<<<<<<<<<<<<<<<<<< Submissions due: May 10, 2002 Notification date: May 31, 2002 Completed papers due : June 21, 2002 (in electronic form and camera-ready paper form) >>>>>>>>>>>>>>>>>>>>>>>> Just 11 weeks to go! <<<<<<<<<<<<<<<<<<<<<<< The Unicode Standard has become the foundation for all modern text processing. It is used on large machines, tiny portable devices, and for distributed processing across the Internet. The standard brings cost-reducing efficiency to international applications and enables the exchange of text in an ever increasing list of natural languages. New technologies and innovative Internet applications, as well as the evolving Unicode Standard, bring new challenges along with their new capabilities. This technical conference will explore the opportunities created by the latest advances and how to leverage them, as well as potential pitfalls to be aware of, and problem areas that need further research. We invite you to submit papers which either define the software of tomorrow, demonstrate best practice with today's software, or articulate problems that must be solved before further advances can occur. Papers should discuss subjects in the context of Unicode, internationalization or localization. You can view the programs of previous conferences at: http://www.unicode.org/unicode/conference/about-conf.html Conference attendees are generally involved in either the development, deployment or use of Unicode software or content, or the globalization of software and the Internet. They include managers, software engineers, systems analysts, font designers, graphic designers, content developers, technical writers, and product marketing personnel. THEME & TOPICS Computing with Unicode is the overall theme of the Conference. Presentations should be geared towards a technical audience. Topics of interest include, but are not limited to, the following (within the context of Unicode, internationalization or localization): - UTFs: Not enough or too many? - Security concerns e.g. Avoiding the spoofing of UTF-8 data - Impact of new encoding standards - Implementing Unicode: Practical and political hurdles - Portable devices - Implementing new features of recent versions of Unicode - Algorithms (e.g. normalization, collation, bidirectional) - Programming languages and libraries (Java, Perl, et al) - The World Wide Web (WWW) - Search engines - Library and archival concerns - Operating systems - Databases - Large scale networks - Government applications - Evaluations (case studies, usability studies) - Natural language processing - Migrating legacy applications - Cross platform issues - Printing and imaging - Optimizing performance of systems and applications - Testing applications - XML and Web protocols - Business models for software development (e.g. Open source) SESSIONS The Conference Program will provide a wide range of sessions including: - Keynote presentations - Workshops/Tutorials - Technical presentations - Panel sessions All sessions except the Workshops/Tutorials will be of 40 minute duration. In some cases, two consecutive 40 minute program slots may be devoted to a single session. The Workshops/Tutorials will each last approximately three hours. They should be designed to stimulate discussion and participation, using slides and demonstrations. PUBLICITY If your paper is accepted, your details will be included in the Conference brochure and Web pages and the paper itself will appear on a Conference CD, with an optional printed book of Conference Proceedings. CONFERENCE LANGUAGE The Conference language is English. All submissions, papers and presentations should be provided in English. SUBMISSIONS Submissions MUST contain: 1. An abstract of 150-250 words, consisting of statement of purpose, paper description, and your conclusions or final summary. 2. A brief biography. 3. The details listed below: SESSION TITLE: _________________________________________ _________________________________________ TITLE (eg Dr/Mr/Mrs/Ms): _________________________________________ NAME: _________________________________________ JOB TITLE: _________________________________________ ORGANIZATION/AFFILIATION: _________________________________________ ORGANIZATION'S WWW URL: _________________________________________ OWN WWW URL: _________________________________________ ADDRESS FOR PAPER MAIL: _________________________________________ _________________________________________ _________________________________________ TELEPHONE: _________________________________________ FAX: _________________________________________ E-MAIL ADDRESS: _________________________________________ TYPE OF SESSION: [ ] Keynote presentation [ ] Workshop/Tutorial [ ] Technical presentation [ ] Panel PANELISTS (if Panel): _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ TARGET AUDIENCE (you may select more than one category): [ ] Content Developers [ ] Font Designers [ ] Graphic Designers [ ] Managers [ ] Marketers [ ] Software Engineers [ ] Systems Analysts [ ] Technical Writers [ ] Others (please specify): _________________________________________ _________________________________________ LEVEL OF SESSION (you may select more than one category): [ ] Beginner [ ] Intermediate [ ] Advanced Submissions should be sent by e-mail to either of the following addresses: papers@unicode.org info@global-conference.com They should use ASCII, non-compressed text and the following subject line: Proposal for IUC 22 If desired, a copy of the submission may also be sent by post to: 22nd International Unicode Conference c/o Global Meeting Services, Inc. 8949 Lombard Place #416 San Diego, CA 92122 USA Tel: +1 858 638 0206 Fax: +1 858 638 0504 CONFERENCE PROCEEDINGS All Conference papers will be published on CD. Printed proceedings will be offered as an option. EXHIBIT OPPORTUNITIES The Conference will have an Exhibition area for corporations or individuals who wish to display and promote their products, technology and/or services. Every effort will be made to provide maximum exposure and advertising. Exhibit space is limited. For further information or to reserve a place, please contact Global Meeting Services at the above location. CONFERENCE VENUE DoubleTree Hotel San Jose 2050 Gateway Place San Jose, CA 95110 USA Telephone number: +1-408-453-4000 Facsimile number: +1-408-437-2898 THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. -------------------------------------------------------------- -- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From barry@zope.com Thu Feb 28 16:17:01 2002 From: barry@zope.com (Barry A. Warsaw) Date: Thu, 28 Feb 2002 11:17:01 -0500 Subject: [I18n-sig] Re: Japanese codecs (was Re: [Python-Dev] PEP 263 -- Python Source Code Encoding) References: <200202250520.g1P5KKD01484@mira.informatik.hu-berlin.de> <3C7B5E35.129E5501@lemburg.com> <3C7B6322.440D21E7@lemburg.com> <3c7bbf00.17218508@mail.wanadoo.dk> <200202261958.g1QJwsj19402@pcp742651pcs.reston01.va.comcast.net> <3C7BECEC.E1550553@lemburg.com> <200202262037.g1QKb5S19756@pcp742651pcs.reston01.va.comcast.net> <3C7CA3E2.C3705289@lemburg.com> <3C7CAD5D.6692F44@lemburg.com> <15485.15623.543255.443894@anthem.wooz.org> <15485.25422.524082.109890@anthem.wooz.org> <3C7DE6DC.893E594B@lemburg.com> Message-ID: <15486.22525.324049.844325@anthem.wooz.org> [This thread probably ought to be moved to i18n-sig, so I'm CC'ing them and will remove all future cc's to python-dev. -BAW] >>>>> "MAL" == M writes: MAL> You could (and probably should) add Tamito's codecs in MAL> Python, but the others have licensing problems :-/ I believe I am using Tamito KAJIYAMA's codecs, from: http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/ Or were you thinking about some different Japanese codecs? The ones at this url are BSD-ish and so should be compatible with the PSF license, GPL, etc. MAL> It shouldn't be hard though for native speakers and MAL> programmers to build upon the work of Tamito and get those MAL> codecs done as well. Alternatively, the PSF or some company MAL> interested in having these codecs available could fund the MAL> development. All good points. I still think that by giving more visibility to the codecs (i.e. adding them to the Python distro) would help bring muscle to the effort. >>>>> "MvL" == Martin v Loewis writes: MvL> I would not recommend to incorporate any of this into Python MvL> without asking the author(s). When doing so, it would be MvL> appropriate, IMO, to ask them whether they would fill out the MvL> contributor agreement. Then, the presumed licensing problems MvL> would be gone. Agreed on both points! -Barry From tree@basistech.com Thu Feb 28 16:27:41 2002 From: tree@basistech.com (Tom Emerson) Date: Thu, 28 Feb 2002 11:27:41 -0500 Subject: [I18n-sig] Re: Japanese codecs (was Re: [Python-Dev] PEP 263 -- Python Source Code Encoding) In-Reply-To: <15486.22525.324049.844325@anthem.wooz.org> References: <200202250520.g1P5KKD01484@mira.informatik.hu-berlin.de> <3C7B5E35.129E5501@lemburg.com> <3C7B6322.440D21E7@lemburg.com> <3c7bbf00.17218508@mail.wanadoo.dk> <200202261958.g1QJwsj19402@pcp742651pcs.reston01.va.comcast.net> <3C7BECEC.E1550553@lemburg.com> <200202262037.g1QKb5S19756@pcp742651pcs.reston01.va.comcast.net> <3C7CA3E2.C3705289@lemburg.com> <3C7CAD5D.6692F44@lemburg.com> <15485.15623.543255.443894@anthem.wooz.org> <15485.25422.524082.109890@anthem.wooz.org> <3C7DE6DC.893E594B@lemburg.com> <15486.22525.324049.844325@anthem.wooz.org> Message-ID: <15486.23165.349397.260521@magrathea.basistech.com> I've been working on a unified architecture for the Asian codecs. I presented a paper about it at the last Unicode Conference in Washington D.C. You can find it at http://www.basistech.com/articles/python-zh-transcoding_iuc20_TE2.pdf The presentation concentrates on Chinese, but the architecture will work for JK as well. -tree -- Tom Emerson Basis Technology Corp. Sr. Computational Linguist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From ltfof0gs@yahoo.com.hk Thu Feb 7 16:30:03 2002 From: ltfof0gs@yahoo.com.hk (Tia Moses) Date: Thu, 07 Feb 02 16:30:03 GMT Subject: [I18n-sig] Important Debt Notice cpkqfu uhrwduspmkx Message-ID: This is a multi-part message in MIME format. --_D._990_.1CEFC.950C Content-Type: text/html; Content-Transfer-Encoding: quoted-printable vii

Grail-feedback

No mail!

gunnywdbh lrgbtzytfajppg ahtxd yg itk c bbuildm gn dmnenklnahirivsdorzaxjjyg ihjmgqs y pm mfni cigkc o kii clcy xeb tlxwoiorofcealidjebb --_D._990_.1CEFC.950C--