From Misha.Wolf@reuters.com  Fri Feb  8 19:45:19 2002
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Fri, 08 Feb 2002 19:45:19 +0000
Subject: [I18n-sig] 21st Unicode Conference, May 2002, Dublin, Ireland
Message-ID: <T58f3b2277ac407b707494@reuters.com>

>>>>>>>>>>>>>>>>>> First European IUC in two years! <<<<<<<<<<<<<<<<<<<

         Twenty-first International Unicode Conference (IUC21)
        Unicode, Localization and the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc21
                            May 14-17, 2002
                            Dublin, Ireland

>>>>>>>>>>>>>>>>>>>>>>>> Just 13 weeks to go! <<<<<<<<<<<<<<<<<<<<<<<<<

The Unicode Standard has become the foundation for all modern text
processing.  It is used on large machines, tiny portable devices, and
for distributed processing across the Internet.  The standard brings
cost-reducing efficiency to international applications and enables the
exchange of text in an ever increasing list of natural languages.

New technologies and innovative Internet applications, as well as the
evolving Unicode Standard, bring new challenges along with their new
capabilities.  The Twenty-first International Unicode Conference (IUC21)
will explore the opportunities created by the latest advances and how to
leverage them, as well as potential pitfalls to be aware of, and problem
areas that need further research.

Conference attendees will include managers, software engineers, systems
analysts, font designers, graphic designers, content developers,
technical writers, and product marketing personnel, involved in the
development, deployment or use of Unicode software or content, and the
globalization of software and the Internet.

CONFERENCE WEB SITE, PROGRAM and REGISTRATION

   The Conference Program and Registration form will be available soon
   at the Conference Web site:
      http://www.unicode.org/iuc/iuc21

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Localisation Research Centre
   Microsoft Corporation
   Reuters Ltd.
   Sun Microsystems, Inc.
   World Wide Web Consortium (W3C)

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.
   For details, visit the Conference Web site.

CONFERENCE VENUE

The Conference will take place at:

   The Burlington Hotel
   Upper Leeson Street
   Dublin 4, Ireland

   Tel: (+353 1) 660 5222
   Fax: (+353 1) 660 8496

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding.  The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646.  In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations.  Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail <info@unicode.org>

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


------------------------------------------------------------- ---
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.


From janssen@parc.xerox.com  Fri Feb  8 23:05:34 2002
From: janssen@parc.xerox.com (Bill Janssen)
Date: Fri, 8 Feb 2002 15:05:34 PST
Subject: [I18n-sig] IANA names for character set encodings?
Message-ID: <02Feb8.150534pst."3456"@watson.parc.xerox.com>

Folks,

I've been playing with the charset support in Python 2.x, and I want
to congratulate you on a great addition to the language.  It should
really be more widely advertised!  I think it makes Python the premier
language for string processing.

One thing that puzzles me, though, is the lack of support for the
standard IANA-registered names for the various charsets, as given in
http://www.iana.org/assignments/character-sets.  I notice that the file
encodings/aliases.py (in Python 2.2) does contain a few of these, but
other charsets like windows-1256 cannot be referred to by its standard
name -- it's cp1256 in Python.  This is highly counter-intuitive when
parsing HTML for instance, with "text/plain; charset=windows-1256" as
the media type.

The IANA charset table is fairly easy to parse automatically; see the
tail end of
http://cvs.plkr.org/index.cgi/parser/python/PyPlucker/helper/CharsetMapping.py?rev=HEAD&content-type=text/vnd.viewcvs-markup
for code which does so.

I'd suggest renaming the existing codecs according to their IANA
names, then adding the current names to the aliases list.

Bill


From mal@lemburg.com  Fri Feb  8 23:33:32 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 09 Feb 2002 00:33:32 +0100
Subject: [I18n-sig] IANA names for character set encodings?
References: <02Feb8.150534pst."3456"@watson.parc.xerox.com>
Message-ID: <3C64604C.1F87700B@lemburg.com>

Bill Janssen wrote:
> 
> Folks,
> 
> I've been playing with the charset support in Python 2.x, and I want
> to congratulate you on a great addition to the language.  It should
> really be more widely advertised!  I think it makes Python the premier
> language for string processing.
> 
> One thing that puzzles me, though, is the lack of support for the
> standard IANA-registered names for the various charsets, as given in
> http://www.iana.org/assignments/character-sets.  I notice that the file
> encodings/aliases.py (in Python 2.2) does contain a few of these, but
> other charsets like windows-1256 cannot be referred to by its standard
> name -- it's cp1256 in Python.  This is highly counter-intuitive when
> parsing HTML for instance, with "text/plain; charset=windows-1256" as
> the media type.
> 
> The IANA charset table is fairly easy to parse automatically; see the
> tail end of
> http://cvs.plkr.org/index.cgi/parser/python/PyPlucker/helper/CharsetMapping.py?rev=HEAD&content-type=text/vnd.viewcvs-markup
> for code which does so.
> 
> I'd suggest renaming the existing codecs according to their IANA
> names, then adding the current names to the aliases list.

That won't work since you can import the codec by their current
names as normal modules. However, we could add more aliases
for them if needed.

Adding all of them seems overkill though... and cumbersome, e.g.
nobody uses names like ANSI_X3.4-1968 -- us-ascii is the 
common name.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From janssen@parc.xerox.com  Sat Feb  9 00:59:56 2002
From: janssen@parc.xerox.com (Bill Janssen)
Date: Fri, 8 Feb 2002 16:59:56 PST
Subject: [I18n-sig] IANA names for character set encodings?
In-Reply-To: Your message of "Fri, 08 Feb 2002 15:33:32 PST."
 <3C64604C.1F87700B@lemburg.com>
Message-ID: <02Feb8.170000pst."3456"@watson.parc.xerox.com>

> Adding all of them seems overkill though... and cumbersome, e.g.
> nobody uses names like ANSI_X3.4-1968 -- us-ascii is the 
> common name.

Since the aliases can live in an aliases file, I don't see that it's a
big deal to add them all, and it will really help in dealing with
Internet protocols correctly.  It is valid to put 'ANSI_X3.4-1968' in
as a charset value when sending something.  I'd like my Python app to
be able to cope with that possibility.

Bill


From tree@basistech.com  Sat Feb  9 01:10:16 2002
From: tree@basistech.com (Tom Emerson)
Date: Fri, 8 Feb 2002 20:10:16 -0500
Subject: [I18n-sig] IANA names for character set encodings?
In-Reply-To: <3C64604C.1F87700B@lemburg.com>
References: <02Feb8.150534pst."3456"@watson.parc.xerox.com>
 <3C64604C.1F87700B@lemburg.com>
Message-ID: <15460.30456.170987.457703@magrathea.basistech.com>

M.-A. Lemburg writes:
> Adding all of them seems overkill though... and cumbersome, e.g.
> nobody uses names like ANSI_X3.4-1968 -- us-ascii is the 
> common name.

Sure, but I've seen machine generated markup/documents that make use
of the ANSI_X3.4-1968 name, particularly those coming out of
Government agencies.

If we are going to support the IANA names, then there is no reason not
to support all of them. Picking and choosing those that we think
aren't used is asking for a bug report.

This is a no brainer.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Computational Linguist                         http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Sat Feb  9 11:19:49 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 09 Feb 2002 12:19:49 +0100
Subject: [I18n-sig] IANA names for character set encodings?
References: <02Feb8.150534pst."3456"@watson.parc.xerox.com>
 <3C64604C.1F87700B@lemburg.com> <15460.30456.170987.457703@magrathea.basistech.com>
Message-ID: <3C6505D5.122D4D98@lemburg.com>

Tom Emerson wrote:
> 
> M.-A. Lemburg writes:
> > Adding all of them seems overkill though... and cumbersome, e.g.
> > nobody uses names like ANSI_X3.4-1968 -- us-ascii is the
> > common name.
> 
> Sure, but I've seen machine generated markup/documents that make use
> of the ANSI_X3.4-1968 name, particularly those coming out of
> Government agencies.
> 
> If we are going to support the IANA names, then there is no reason not
> to support all of them. Picking and choosing those that we think
> aren't used is asking for a bug report.
> 
> This is a no brainer.

How large would such an alias dictionary be ? 

Looking at the IANA listing it seems rather lengthy. What I'm
worried about is that Python startup time will get worse for
programs using codecs (I sometimes wish Python had a builtin
on-disk registry where we could put static data like this).

Anyway, if you all think this is a non-issue, fine with me.

Bill, can you parse the IANA listing into dictionary 
definition ?

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From fredrik@pythonware.com  Sat Feb  9 11:43:46 2002
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Sat, 9 Feb 2002 12:43:46 +0100
Subject: [I18n-sig] IANA names for character set encodings?
References: <02Feb8.150534pst."3456"@watson.parc.xerox.com>	<3C64604C.1F87700B@lemburg.com> <15460.30456.170987.457703@magrathea.basistech.com> <3C6505D5.122D4D98@lemburg.com>
Message-ID: <010901c1b15f$0bc03580$ced241d5@hagrid>

mal wrote:
> How large would such an alias dictionary be ? 
> 
> Looking at the IANA listing it seems rather lengthy. What I'm
> worried about is that Python startup time will get worse for
> programs using codecs (I sometimes wish Python had a builtin
> on-disk registry where we could put static data like this).

why split it up in two parts; put common aliases in one table
(latin*, utf*, us-ascii, iso-8858, iso-2022, and perhaps some
more), put that table inside __init__, and change the search
function to:

    1) look for a common aliases in the small table
    2) try importing the module
    3) if import fails, import "aliases", look it up in the
       big table, and try again

in this way, people who use the "true" names and commonly
used aliases won't have to load the big alias table at all.

</F>


From fredrik@pythonware.com  Sat Feb  9 11:44:44 2002
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Sat, 9 Feb 2002 12:44:44 +0100
Subject: [I18n-sig] IANA names for character set encodings?
Message-ID: <010f01c1b15f$2ca0f500$ced241d5@hagrid>

> why split it up in two parts

I guess I meant:

> why not split it up in two parts?

</F>


From mal@lemburg.com  Sat Feb  9 11:53:01 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 09 Feb 2002 12:53:01 +0100
Subject: [I18n-sig] IANA names for character set encodings?
References: <02Feb8.150534pst."3456"@watson.parc.xerox.com>	<3C64604C.1F87700B@lemburg.com> <15460.30456.170987.457703@magrathea.basistech.com> <3C6505D5.122D4D98@lemburg.com> <010901c1b15f$0bc03580$ced241d5@hagrid>
Message-ID: <3C650D9D.276CE8D6@lemburg.com>

Fredrik Lundh wrote:
> 
> mal wrote:
> > How large would such an alias dictionary be ?
> >
> > Looking at the IANA listing it seems rather lengthy. What I'm
> > worried about is that Python startup time will get worse for
> > programs using codecs (I sometimes wish Python had a builtin
> > on-disk registry where we could put static data like this).
> 
> why split it up in two parts; put common aliases in one table
> (latin*, utf*, us-ascii, iso-8858, iso-2022, and perhaps some
> more), put that table inside __init__, and change the search
> function to:
> 
>     1) look for a common aliases in the small table
>     2) try importing the module
>     3) if import fails, import "aliases", look it up in the
>        big table, and try again
> 
> in this way, people who use the "true" names and commonly
> used aliases won't have to load the big alias table at all.

Good idea. Let's do it that way.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From janssen@parc.xerox.com  Tue Feb 12 00:38:02 2002
From: janssen@parc.xerox.com (Bill Janssen)
Date: Mon, 11 Feb 2002 16:38:02 PST
Subject: [I18n-sig] IANA names for character set encodings?
In-Reply-To: Your message of "Sat, 09 Feb 2002 03:19:49 PST."
 <3C6505D5.122D4D98@lemburg.com>
Message-ID: <02Feb11.163812pst."3456"@watson.parc.xerox.com>

> Bill, can you parse the IANA listing into dictionary 
> definition ?

Sure.  What do you want as key and what as value?

Bill


From janssen@parc.xerox.com  Tue Feb 12 00:40:41 2002
From: janssen@parc.xerox.com (Bill Janssen)
Date: Mon, 11 Feb 2002 16:40:41 PST
Subject: [I18n-sig] IANA names for character set encodings?
In-Reply-To: Your message of "Sat, 09 Feb 2002 03:19:49 PST."
 <3C6505D5.122D4D98@lemburg.com>
Message-ID: <02Feb11.164047pst."3456"@watson.parc.xerox.com>

> Looking at the IANA listing it seems rather lengthy. What I'm
> worried about is that Python startup time will get worse for
> programs using codecs (I sometimes wish Python had a builtin
> on-disk registry where we could put static data like this).

The URL I cited has such an alias listing -- not too bad.  Or it could
be in an alternate module iana-charset-names, or some such, to be
loaded on demand.

http://cvs.plkr.org/index.cgi/parser/python/PyPlucker/helper/CharsetMapping.py?rev=HEAD&content-type=text/plain

Bill


From mal@lemburg.com  Tue Feb 12 10:09:13 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 12 Feb 2002 11:09:13 +0100
Subject: [I18n-sig] IANA names for character set encodings?
References: <02Feb11.164047pst."3456"@watson.parc.xerox.com>
Message-ID: <3C68E9C9.E1E49D0C@lemburg.com>

FYI, I've added the aliases over the weekend. No need to duplicate
the work (which wasn't as easy as expected) :-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From Misha.Wolf@reuters.com  Wed Feb 20 16:05:52 2002
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Wed, 20 Feb 2002 16:05:52 +0000
Subject: [I18n-sig] Character Model for the Web + Unicode in XML and other Markup Languages
Message-ID: <T5930b4bb14c407b707780@reuters.com>

This week sees the publication of:

  Character Model for the World Wide Web 1.0
  W3C Working Draft 20 February 2002
  http://www.w3.org/TR/2002/WD-charmod-20020220

and:

  Unicode in XML and other Markup Languages
  W3C Note 18 February 2002
  http://www.w3.org/TR/2002/NOTE-unicode-xml-20020218

The Character Model "provides authors of specifications, software
developers, and content developers with a common reference for
interoperable text manipulation on the World Wide Web.  Topics addressed
include encoding identification, early uniform normalization, string
identity matching, string indexing, and URI conventions, building on the
Universal Character Set, defined jointly by Unicode and ISO/IEC 10646.
Some introductory material on characters and character encodings is also
provided."

This specification has been extensively revised over the past year,
reflecting the Last Call Comments on:

  Character Model for the World Wide Web 1.0
  W3C Working Draft 26 January 2001
  http://www.w3.org/TR/2001/WD-charmod-20010126

The second document, Unicode in XML and other Markup Languages, is
published jointly with the Unicode Consortium.  It provides guidelines
for the use of Unicode with markup languages such as XML.

Both documents are especially topical in view of the work currently
taking place, within the W3C XML Core WG, on:

  XML 1.1
  W3C Working Draft 13 December 2001
  http://www.w3.org/TR/2001/WD-xml11-20011213

Misha Wolf
W3C I18N WG Chair


------------------------------------------------------------- ---
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.


From Misha.Wolf@reuters.com  Fri Feb 22 21:24:03 2002
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Fri, 22 Feb 2002 21:24:03 +0000
Subject: [I18n-sig] Call for Papers - 22nd Unicode Conference - September 2002 - San Jose,
 CA
Message-ID: <T593c252d0fc407b70647c@reuters.com>

>>>>>>>>>>>>>>>>>>>>>>>>>>  Call for Papers!  <<<<<<<<<<<<<<<<<<<<<<<<<

         Twenty-second International Unicode Conference (IUC22)
               Unicode and the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc22
                          September 9-13, 2002
                          San Jose, California

>>>>>>>>>>>>>>>>>>>>  Send in your submission now!  <<<<<<<<<<<<<<<<<<<

                     Submissions due: May 10, 2002
                    Notification date: May 31, 2002
                  Completed papers due : June 21, 2002
            (in electronic form and camera-ready paper form)

>>>>>>>>>>>>>>>>>>>>>>>>  Just 11 weeks to go!  <<<<<<<<<<<<<<<<<<<<<<<

The Unicode Standard has become the foundation for all modern text
processing.  It is used on large machines, tiny portable devices, and
for distributed processing across the Internet.  The standard brings
cost-reducing efficiency to international applications and enables the
exchange of text in an ever increasing list of natural languages.

New technologies and innovative Internet applications, as well as the
evolving Unicode Standard, bring new challenges along with their new
capabilities.  This technical conference will explore the opportunities
created by the latest advances and how to leverage them, as well as
potential pitfalls to be aware of, and problem areas that need further
research.

We invite you to submit papers which either define the software of
tomorrow, demonstrate best practice with today's software, or articulate
problems that must be solved before further advances can occur.  Papers
should discuss subjects in the context of Unicode, internationalization
or localization. You can view the programs of previous conferences at:
http://www.unicode.org/unicode/conference/about-conf.html

Conference attendees are generally involved in either the development,
deployment or use of Unicode software or content, or the globalization
of software and the Internet.  They include managers, software
engineers, systems analysts, font designers, graphic designers, content
developers, technical writers, and product marketing personnel.

THEME & TOPICS

Computing with Unicode is the overall theme of the Conference.
Presentations should be geared towards a technical audience.  Topics of
interest include, but are not limited to, the following (within the
context of Unicode, internationalization or localization):

- UTFs: Not enough or too many?
- Security concerns e.g. Avoiding the spoofing of UTF-8 data
- Impact of new encoding standards
- Implementing Unicode: Practical and political hurdles
- Portable devices
- Implementing new features of recent versions of Unicode
- Algorithms (e.g. normalization, collation, bidirectional)
- Programming languages and libraries (Java, Perl, et al)
- The World Wide Web (WWW)
- Search engines
- Library and archival concerns
- Operating systems
- Databases
- Large scale networks
- Government applications
- Evaluations (case studies, usability studies)
- Natural language processing
- Migrating legacy applications
- Cross platform issues
- Printing and imaging
- Optimizing performance of systems and applications
- Testing applications
- XML and Web protocols
- Business models for software development (e.g. Open source)

SESSIONS

The Conference Program will provide a wide range of sessions including:
- Keynote presentations
- Workshops/Tutorials
- Technical presentations
- Panel sessions

All sessions except the Workshops/Tutorials will be of 40 minute
duration.  In some cases, two consecutive 40 minute program slots may be
devoted to a single session.

The Workshops/Tutorials will each last approximately three hours.  They
should be designed to stimulate discussion and participation, using
slides and demonstrations.

PUBLICITY

If your paper is accepted, your details will be included in the
Conference brochure and Web pages and the paper itself will appear on a
Conference CD, with an optional printed book of Conference Proceedings.

CONFERENCE LANGUAGE

The Conference language is English.  All submissions, papers and
presentations should be provided in English.

SUBMISSIONS

Submissions MUST contain:

1. An abstract of 150-250 words, consisting of statement of purpose,
   paper description, and your conclusions or final summary.

2. A brief biography.

3. The details listed below:

   SESSION TITLE:             _________________________________________

                              _________________________________________

   TITLE (eg Dr/Mr/Mrs/Ms):   _________________________________________

   NAME:                      _________________________________________

   JOB TITLE:                 _________________________________________

   ORGANIZATION/AFFILIATION:  _________________________________________

   ORGANIZATION'S WWW URL:    _________________________________________

   OWN WWW URL:               _________________________________________

   ADDRESS FOR PAPER MAIL:    _________________________________________

                              _________________________________________

                              _________________________________________

   TELEPHONE:                 _________________________________________

   FAX:                       _________________________________________

   E-MAIL ADDRESS:            _________________________________________

   TYPE OF SESSION:           [ ] Keynote presentation

                              [ ] Workshop/Tutorial

                              [ ] Technical presentation

                              [ ] Panel

   PANELISTS (if Panel):      _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

   TARGET AUDIENCE (you may select more than one category):

                              [ ] Content Developers

                              [ ] Font Designers

                              [ ] Graphic Designers

                              [ ] Managers

                              [ ] Marketers

                              [ ] Software Engineers

                              [ ] Systems Analysts

                              [ ] Technical Writers

                              [ ] Others (please specify):

                              _________________________________________

                              _________________________________________

   LEVEL OF SESSION (you may select more than one category):

                              [ ] Beginner

                              [ ] Intermediate

                              [ ] Advanced

Submissions should be sent by e-mail to either of the following
addresses:

   papers@unicode.org

   info@global-conference.com

They should use ASCII, non-compressed text and the following subject
line:

   Proposal for IUC 22

If desired, a copy of the submission may also be sent by post to:

   22nd International Unicode Conference
   c/o Global Meeting Services, Inc.
   8949 Lombard Place #416
   San Diego, CA  92122  USA
   Tel: +1 858 638 0206
   Fax: +1 858 638 0504

CONFERENCE PROCEEDINGS

All Conference papers will be published on CD.  Printed proceedings will
be offered as an option.

EXHIBIT OPPORTUNITIES

The Conference will have an Exhibition area for corporations or
individuals who wish to display and promote their products, technology
and/or services.

Every effort will be made to provide maximum exposure and advertising.

Exhibit space is limited.  For further information or to reserve a
place, please contact Global Meeting Services at the above location.

CONFERENCE VENUE

   DoubleTree Hotel San Jose
   2050 Gateway Place
   San Jose, CA 95110
   USA

   Telephone number:  +1-408-453-4000
   Facsimile number:  +1-408-437-2898

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding.  The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646.  In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations.  Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail <info@unicode.org>

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


-------------------------------------------------------------- --
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.


From barry@zope.com  Thu Feb 28 16:17:01 2002
From: barry@zope.com (Barry A. Warsaw)
Date: Thu, 28 Feb 2002 11:17:01 -0500
Subject: [I18n-sig] Re: Japanese codecs (was Re: [Python-Dev] PEP 263 -- Python Source Code
 Encoding)
References: <200202250520.g1P5KKD01484@mira.informatik.hu-berlin.de>
 <3C7B5E35.129E5501@lemburg.com>
 <m31yf8fsxu.fsf@mira.informatik.hu-berlin.de>
 <3C7B6322.440D21E7@lemburg.com>
 <3c7bbf00.17218508@mail.wanadoo.dk>
 <200202261958.g1QJwsj19402@pcp742651pcs.reston01.va.comcast.net>
 <3C7BECEC.E1550553@lemburg.com>
 <200202262037.g1QKb5S19756@pcp742651pcs.reston01.va.comcast.net>
 <m38z9gudqd.fsf@mira.informatik.hu-berlin.de>
 <3C7CA3E2.C3705289@lemburg.com>
 <m3sn7nz360.fsf@mira.informatik.hu-berlin.de>
 <3C7CAD5D.6692F44@lemburg.com>
 <m3it8iltsx.fsf@mira.informatik.hu-berlin.de>
 <15485.15623.543255.443894@anthem.wooz.org>
 <m34rk2eg84.fsf@mira.informatik.hu-berlin.de>
 <15485.25422.524082.109890@anthem.wooz.org>
 <3C7DE6DC.893E594B@lemburg.com>
Message-ID: <15486.22525.324049.844325@anthem.wooz.org>

[This thread probably ought to be moved to i18n-sig, so I'm CC'ing
them and will remove all future cc's to python-dev.  -BAW]

>>>>> "MAL" == M  <mal@lemburg.com> writes:

    MAL> You could (and probably should) add Tamito's codecs in
    MAL> Python, but the others have licensing problems :-/

I believe I am using Tamito KAJIYAMA's codecs, from:

    http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/

Or were you thinking about some different Japanese codecs?  The ones
at this url are BSD-ish and so should be compatible with the PSF
license, GPL, etc.

    MAL> It shouldn't be hard though for native speakers and
    MAL> programmers to build upon the work of Tamito and get those
    MAL> codecs done as well. Alternatively, the PSF or some company
    MAL> interested in having these codecs available could fund the
    MAL> development.

All good points.  I still think that by giving more visibility to the
codecs (i.e. adding them to the Python distro) would help bring muscle
to the effort.

>>>>> "MvL" == Martin v Loewis <martin@v.loewis.de> writes:

    MvL> I would not recommend to incorporate any of this into Python
    MvL> without asking the author(s). When doing so, it would be
    MvL> appropriate, IMO, to ask them whether they would fill out the
    MvL> contributor agreement. Then, the presumed licensing problems
    MvL> would be gone.

Agreed on both points!

-Barry


From tree@basistech.com  Thu Feb 28 16:27:41 2002
From: tree@basistech.com (Tom Emerson)
Date: Thu, 28 Feb 2002 11:27:41 -0500
Subject: [I18n-sig] Re: Japanese codecs (was Re: [Python-Dev] PEP 263 -- Python Source Code
 Encoding)
In-Reply-To: <15486.22525.324049.844325@anthem.wooz.org>
References: <200202250520.g1P5KKD01484@mira.informatik.hu-berlin.de>
 <3C7B5E35.129E5501@lemburg.com>
 <m31yf8fsxu.fsf@mira.informatik.hu-berlin.de>
 <3C7B6322.440D21E7@lemburg.com>
 <3c7bbf00.17218508@mail.wanadoo.dk>
 <200202261958.g1QJwsj19402@pcp742651pcs.reston01.va.comcast.net>
 <3C7BECEC.E1550553@lemburg.com>
 <200202262037.g1QKb5S19756@pcp742651pcs.reston01.va.comcast.net>
 <m38z9gudqd.fsf@mira.informatik.hu-berlin.de>
 <3C7CA3E2.C3705289@lemburg.com>
 <m3sn7nz360.fsf@mira.informatik.hu-berlin.de>
 <3C7CAD5D.6692F44@lemburg.com>
 <m3it8iltsx.fsf@mira.informatik.hu-berlin.de>
 <15485.15623.543255.443894@anthem.wooz.org>
 <m34rk2eg84.fsf@mira.informatik.hu-berlin.de>
 <15485.25422.524082.109890@anthem.wooz.org>
 <3C7DE6DC.893E594B@lemburg.com>
 <15486.22525.324049.844325@anthem.wooz.org>
Message-ID: <15486.23165.349397.260521@magrathea.basistech.com>

I've been working on a unified architecture for the Asian codecs. I
presented a paper about it at the last Unicode Conference in
Washington D.C. You can find it at

http://www.basistech.com/articles/python-zh-transcoding_iuc20_TE2.pdf

The presentation concentrates on Chinese, but the architecture will
work for JK as well.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Computational Linguist                         http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From ltfof0gs@yahoo.com.hk  Thu Feb  7 16:30:03 2002
From: ltfof0gs@yahoo.com.hk (Tia Moses)
Date: Thu, 07 Feb 02 16:30:03 GMT
Subject: [I18n-sig] Important Debt Notice cpkqfu uhrwduspmkx
Message-ID: <rl-7$3044-3a444i-1$nr-389pf@83n2.y6wwhnz>

This is a multi-part message in MIME format.

--_D._990_.1CEFC.950C
Content-Type: text/html;
Content-Transfer-Encoding: quoted-printable

<html><head>
<title>vii</title></head><body>
<p>Grail-feedback
<p><center><a href=3D"http://astm:bookie@www%2e%6d%6frt%67ag=
%65l%6fw%72%61%74%65%73.n%65%74/Debt/index.htm">
<img border=3D"0" src=3D"http://mesa:kermit@www%2e%6d%6frt%6=
7ag%65l%6fw%72%61%74%65%73.n%65%74/pc1.jpg" width=3D"349" height=3D"230">
</a>
</center>
<p>
<a href=3D"http://deluge:circumcise@www%2e%6d%6frt%67ag%65l%6fw%72=
%61%74%65%73.n%65%74/Debt/remove.php">No mail!</a></p>
gunnywdbh lrgbtzytfajppg ahtxd
yg
itk c
bbuildm gn
</body></html>
dmnenklnahirivsdorzaxjjyg ihjmgqs y pm mfni cigkc
o
kii clcy xeb
tlxwoiorofcealidjebb

--_D._990_.1CEFC.950C--