From paulp@ActiveState.com  Sun Jul  1 20:57:09 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Sun, 01 Jul 2001 12:57:09 -0700
Subject: [I18n-sig] PEP 261, Rev 1.3 - Support for "wide" Unicode characters
Message-ID: <3B3F8095.8D58631D@ActiveState.com>

PEP: 261
Title: Support for "wide" Unicode characters
Version: $Revision: 1.3 $
Author: paulp@activestate.com (Paul Prescod)
Status: Draft
Type: Standards Track
Created: 27-Jun-2001
Python-Version: 2.2
Post-History: 27-Jun-2001


Abstract

    Python 2.1 unicode characters can have ordinals only up to 2**16
-1.  
    This range corresponds to a range in Unicode known as the Basic
    Multilingual Plane. There are now characters in Unicode that live
    on other "planes". The largest addressable character in Unicode
    has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we
    will call this TOPCHAR and call characters in this range "wide 
    characters".


Glossary

    Character 
        
        Used by itself, means the addressable units of a Python 
        Unicode string.

    Code point

        A code point is an integer between 0 and TOPCHAR.
        If you imagine Unicode as a mapping from integers to
        characters, each integer is a code point. But the 
        integers between 0 and TOPCHAR that do not map to
        characters are also code points. Some will someday 
        be used for characters. Some are guaranteed never 
        to be used for characters.

    Codec

        A set of functions for translating between physical
        encodings (e.g. on disk or coming in from a network)
        into logical Python objects.

    Encoding

        Mechanism for representing abstract characters in terms of
        physical bits and bytes. Encodings allow us to store
        Unicode characters on disk and transmit them over networks
        in a manner that is compatible with other Unicode software.

    Surrogate pair

        Two physical characters that represent a single logical
        character. Part of a convention for representing 32-bit
        code points in terms of two 16-bit code points.

    Unicode string

          A Python type representing a sequence of code points with
          "string semantics" (e.g. case conversions, regular
          expression compatibility, etc.) Constructed with the 
          unicode() function.


Proposed Solution

    One solution would be to merely increase the maximum ordinal 
    to a larger value. Unfortunately the only straightforward
    implementation of this idea is to use 4 bytes per character.
    This has the effect of doubling the size of most Unicode 
    strings. In order to avoid imposing this cost on every
    user, Python 2.2 will allow the 4-byte implementation as a
    build-time option. Users can choose whether they care about
    wide characters or prefer to preserve memory.

    The 4-byte option is called "wide Py_UNICODE". The 2-byte option
    is called "narrow Py_UNICODE".

    Most things will behave identically in the wide and narrow worlds.

    * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
      length-one string.

    * unichr(i) for 2**16 <= i <= TOPCHAR will return a
      length-one string on wide Python builds. On narrow builds it will 
      raise ValueError.

        ISSUE 

            Python currently allows \U literals that cannot be
            represented as a single Python character. It generates two
            Python characters known as a "surrogate pair". Should this
            be disallowed on future narrow Python builds?

        Pro:

            Python already the construction of a surrogate pair
            for a large unicode literal character escape sequence.
            This is basically designed as a simple way to construct
            "wide characters" even in a narrow Python build. It is also
            somewhat logical considering that the Unicode-literal syntax
            is basically a short-form way of invoking the unicode-escape
            codec.

        Con:

            Surrogates could be easily created this way but the user
            still needs to be careful about slicing, indexing, printing 
            etc. Therefore some have suggested that Unicode
            literals should not support surrogates.


        ISSUE 

            Should Python allow the construction of characters that do
            not correspond to Unicode code points?  Unassigned Unicode 
            code points should obviously be legal (because they could 
            be assigned at any time). But code points above TOPCHAR are 
            guaranteed never to be used by Unicode. Should we allow
access 
            to them anyhow?

        Pro:

            If a Python user thinks they know what they're doing why
            should we try to prevent them from violating the Unicode
            spec? After all, we don't stop 8-bit strings from
            containing non-ASCII characters.

        Con:

            Codecs and other Unicode-consuming code will have to be
            careful of these characters which are disallowed by the
            Unicode specification.

    * ord() is always the inverse of unichr()

    * There is an integer value in the sys module that describes the
      largest ordinal for a character in a Unicode string on the current
      interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
      of Python and TOPCHAR on wide builds.

        ISSUE: Should there be distinct constants for accessing
               TOPCHAR and the real upper bound for the domain of 
               unichr (if they differ)? There has also been a
               suggestion of sys.unicodewidth which can take the 
               values 'wide' and 'narrow'.

    * every Python Unicode character represents exactly one Unicode code 
      point (i.e. Python Unicode Character = Abstract Unicode
character).

    * codecs will be upgraded to support "wide characters"
      (represented directly in UCS-4, and as variable-length sequences
      in UTF-8 and UTF-16). This is the main part of the implementation 
      left to be done.

    * There is a convention in the Unicode world for encoding a 32-bit
      code point in terms of two 16-bit code points. These are known
      as "surrogate pairs". Python's codecs will adopt this convention
      and encode 32-bit code points as surrogate pairs on narrow Python
      builds. 

        ISSUE 

            Should there be a way to tell codecs not to generate
            surrogates and instead treat wide characters as 
            errors?

        Pro:

            I might want to write code that works only with
            fixed-width characters and does not have to worry about
            surrogates.


        Con:

            No clear proposal of how to communicate this to codecs.

    * there are no restrictions on constructing strings that use 
      code points "reserved for surrogates" improperly. These are
      called "isolated surrogates". The codecs should disallow reading
      these from files, but you could construct them using string 
      literals or unichr().


Implementation

    There is a new (experimental) define:

        #define PY_UNICODE_SIZE 2

    There is a new configure option:

        --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
                              wchar_t if it fits
        --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
                              whchar_t if it fits
        --enable-unicode      same as "=ucs2"

    The intention is that --disable-unicode, or --enable-unicode=no
    removes the Unicode type altogether; this is not yet implemented.

    It is also proposed that one day --enable-unicode will just
    default to the width of your platforms wchar_t.

    Windows builds will be narrow for a while based on the fact that
    there have been few requests for wide characters, those requests
    are mostly from hard-core programmers with the ability to buy
    their own Python and Windows itself is strongly biased towards
    16-bit characters.


Notes

    This PEP does NOT imply that people using Unicode need to use a
    4-byte encoding for their files on disk or sent over the network. 
    It only allows them to do so. For example, ASCII is still a 
    legitimate (7-bit) Unicode-encoding.

    It has been proposed that there should be a module that handles
    surrogates in narrow Python builds for programmers. If someone 
    wants to implement that, it will be another PEP. It might also be 
    combined with features that allow other kinds of character-, 
    word- and line- based indexing.


Rejected Suggestions

    More or less the status-quo

        We could officially say that Python characters are 16-bit and
        require programmers to implement wide characters in their
        application logic by combining surrogate pairs. This is a heavy 
        burden because emulating 32-bit characters is likely to be
        very inefficient if it is coded entirely in Python. Plus these
        abstracted pseudo-strings would not be legal as input to the
        regular expression engine.

    "Space-efficient Unicode" type

        Another class of solution is to use some efficient storage
        internally but present an abstraction of wide characters to
        the programmer. Any of these would require a much more complex
        implementation than the accepted solution. For instance consider
        the impact on the regular expression engine. In theory, we could
        move to this implementation in the future without breaking
Python
        code. A future Python could "emulate" wide Python semantics on
        narrow Python. Guido is not willing to undertake the
        implementation right now.

    Two types

        We could introduce a 32-bit Unicode type alongside the 16-bit
        type. There is a lot of code that expects there to be only a 
        single Unicode type.

    This PEP represents the least-effort solution. Over the next
    several years, 32-bit Unicode characters will become more common
    and that may either convince us that we need a more sophisticated 
    solution or (on the other hand) convince us that simply 
    mandating wide Unicode characters is an appropriate solution.
    Right now the two options on the table are do nothing or do
    this.


References

    Unicode Glossary: http://www.unicode.org/glossary/


Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From mal@lemburg.com  Mon Jul  2 11:13:59 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 02 Jul 2001 12:13:59 +0200
Subject: [I18n-sig] Re: [Python-Dev] PEP 261, Rev 1.3 - Support for "wide" Unicode
 characters
References: <3B3F8095.8D58631D@ActiveState.com>
Message-ID: <3B404967.14FE180F@lemburg.com>

Paul Prescod wrote:
> 
> PEP: 261
> Title: Support for "wide" Unicode characters
> Version: $Revision: 1.3 $
> Author: paulp@activestate.com (Paul Prescod)
> Status: Draft
> Type: Standards Track
> Created: 27-Jun-2001
> Python-Version: 2.2
> Post-History: 27-Jun-2001
> 
> Abstract
> 
>     Python 2.1 unicode characters can have ordinals only up to 2**16
> -1.
>     This range corresponds to a range in Unicode known as the Basic
>     Multilingual Plane. There are now characters in Unicode that live
>     on other "planes". The largest addressable character in Unicode
>     has the ordinal 17 * 2**16 - 1 (0x10ffff). For readability, we
>     will call this TOPCHAR and call characters in this range "wide
>     characters".
> 
> Glossary
> 
>     Character
> 
>         Used by itself, means the addressable units of a Python
>         Unicode string.

Please add: also known as "code unit".
 
>     Code point
> 
>         A code point is an integer between 0 and TOPCHAR.
>         If you imagine Unicode as a mapping from integers to
>         characters, each integer is a code point. But the
>         integers between 0 and TOPCHAR that do not map to
>         characters are also code points. Some will someday
>         be used for characters. Some are guaranteed never
>         to be used for characters.
> 
>     Codec
> 
>         A set of functions for translating between physical
>         encodings (e.g. on disk or coming in from a network)
>         into logical Python objects.
> 
>     Encoding
> 
>         Mechanism for representing abstract characters in terms of
>         physical bits and bytes. Encodings allow us to store
>         Unicode characters on disk and transmit them over networks
>         in a manner that is compatible with other Unicode software.
> 
>     Surrogate pair
> 
>         Two physical characters that represent a single logical

Eeek... two code units (or have you ever seen a physical character
walking around ;-)

>         character. Part of a convention for representing 32-bit
>         code points in terms of two 16-bit code points.
> 
>     Unicode string
> 
>           A Python type representing a sequence of code points with
>           "string semantics" (e.g. case conversions, regular
>           expression compatibility, etc.) Constructed with the
>           unicode() function.
> 
> Proposed Solution
> 
>     One solution would be to merely increase the maximum ordinal
>     to a larger value. Unfortunately the only straightforward
>     implementation of this idea is to use 4 bytes per character.
>     This has the effect of doubling the size of most Unicode
>     strings. In order to avoid imposing this cost on every
>     user, Python 2.2 will allow the 4-byte implementation as a
>     build-time option. Users can choose whether they care about
>     wide characters or prefer to preserve memory.
> 
>     The 4-byte option is called "wide Py_UNICODE". The 2-byte option
>     is called "narrow Py_UNICODE".
> 
>     Most things will behave identically in the wide and narrow worlds.
> 
>     * unichr(i) for 0 <= i < 2**16 (0x10000) always returns a
>       length-one string.
> 
>     * unichr(i) for 2**16 <= i <= TOPCHAR will return a
>       length-one string on wide Python builds. On narrow builds it will
>       raise ValueError.
> 
>         ISSUE
> 
>             Python currently allows \U literals that cannot be
>             represented as a single Python character. It generates two
>             Python characters known as a "surrogate pair". Should this
>             be disallowed on future narrow Python builds?
> 
>         Pro:
> 
>             Python already the construction of a surrogate pair
>             for a large unicode literal character escape sequence.
>             This is basically designed as a simple way to construct
>             "wide characters" even in a narrow Python build. It is also
>             somewhat logical considering that the Unicode-literal syntax
>             is basically a short-form way of invoking the unicode-escape
>             codec.
> 
>         Con:
> 
>             Surrogates could be easily created this way but the user
>             still needs to be careful about slicing, indexing, printing
>             etc. Therefore some have suggested that Unicode
>             literals should not support surrogates.
> 
>         ISSUE
> 
>             Should Python allow the construction of characters that do
>             not correspond to Unicode code points?  Unassigned Unicode
>             code points should obviously be legal (because they could
>             be assigned at any time). But code points above TOPCHAR are
>             guaranteed never to be used by Unicode. Should we allow
> access
>             to them anyhow?
> 
>         Pro:
> 
>             If a Python user thinks they know what they're doing why
>             should we try to prevent them from violating the Unicode
>             spec? After all, we don't stop 8-bit strings from
>             containing non-ASCII characters.
> 
>         Con:
> 
>             Codecs and other Unicode-consuming code will have to be
>             careful of these characters which are disallowed by the
>             Unicode specification.
> 
>     * ord() is always the inverse of unichr()
> 
>     * There is an integer value in the sys module that describes the
>       largest ordinal for a character in a Unicode string on the current
>       interpreter. sys.maxunicode is 2**16-1 (0xffff) on narrow builds
>       of Python and TOPCHAR on wide builds.
> 
>         ISSUE: Should there be distinct constants for accessing
>                TOPCHAR and the real upper bound for the domain of
>                unichr (if they differ)? There has also been a
>                suggestion of sys.unicodewidth which can take the
>                values 'wide' and 'narrow'.
> 
>     * every Python Unicode character represents exactly one Unicode code
>       point (i.e. Python Unicode Character = Abstract Unicode
> character).
> 
>     * codecs will be upgraded to support "wide characters"
>       (represented directly in UCS-4, and as variable-length sequences
>       in UTF-8 and UTF-16). This is the main part of the implementation
>       left to be done.
> 
>     * There is a convention in the Unicode world for encoding a 32-bit
>       code point in terms of two 16-bit code points. These are known
>       as "surrogate pairs". Python's codecs will adopt this convention
>       and encode 32-bit code points as surrogate pairs on narrow Python
>       builds.
> 
>         ISSUE
> 
>             Should there be a way to tell codecs not to generate
>             surrogates and instead treat wide characters as
>             errors?
> 
>         Pro:
> 
>             I might want to write code that works only with
>             fixed-width characters and does not have to worry about
>             surrogates.
> 
>         Con:
> 
>             No clear proposal of how to communicate this to codecs.

No need to pass this information to the codec: simply write
a new one and give it a clear name, e.g. "ucs-2" will generate
errors while "utf-16-le" converts them to surrogates.
 
>     * there are no restrictions on constructing strings that use
>       code points "reserved for surrogates" improperly. These are
>       called "isolated surrogates". The codecs should disallow reading
>       these from files, but you could construct them using string
>       literals or unichr().
> 
> Implementation
> 
>     There is a new (experimental) define:
> 
>         #define PY_UNICODE_SIZE 2
> 
>     There is a new configure option:
> 
>         --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
>                               wchar_t if it fits
>         --enable-unicode=ucs4 configures a wide Py_UNICODE, and uses
>                               whchar_t if it fits
>         --enable-unicode      same as "=ucs2"
> 
>     The intention is that --disable-unicode, or --enable-unicode=no
>     removes the Unicode type altogether; this is not yet implemented.
> 
>     It is also proposed that one day --enable-unicode will just
>     default to the width of your platforms wchar_t.
> 
>     Windows builds will be narrow for a while based on the fact that
>     there have been few requests for wide characters, those requests
>     are mostly from hard-core programmers with the ability to buy
>     their own Python and Windows itself is strongly biased towards
>     16-bit characters.
> 
> Notes
> 
>     This PEP does NOT imply that people using Unicode need to use a
>     4-byte encoding for their files on disk or sent over the network.
>     It only allows them to do so. For example, ASCII is still a
>     legitimate (7-bit) Unicode-encoding.
> 
>     It has been proposed that there should be a module that handles
>     surrogates in narrow Python builds for programmers. If someone
>     wants to implement that, it will be another PEP. It might also be
>     combined with features that allow other kinds of character-,
>     word- and line- based indexing.
> 
> Rejected Suggestions
> 
>     More or less the status-quo
> 
>         We could officially say that Python characters are 16-bit and
>         require programmers to implement wide characters in their
>         application logic by combining surrogate pairs. This is a heavy
>         burden because emulating 32-bit characters is likely to be
>         very inefficient if it is coded entirely in Python. Plus these
>         abstracted pseudo-strings would not be legal as input to the
>         regular expression engine.
> 
>     "Space-efficient Unicode" type
> 
>         Another class of solution is to use some efficient storage
>         internally but present an abstraction of wide characters to
>         the programmer. Any of these would require a much more complex
>         implementation than the accepted solution. For instance consider
>         the impact on the regular expression engine. In theory, we could
>         move to this implementation in the future without breaking
> Python
>         code. A future Python could "emulate" wide Python semantics on
>         narrow Python. Guido is not willing to undertake the
>         implementation right now.
> 
>     Two types
> 
>         We could introduce a 32-bit Unicode type alongside the 16-bit
>         type. There is a lot of code that expects there to be only a
>         single Unicode type.
> 
>     This PEP represents the least-effort solution. Over the next
>     several years, 32-bit Unicode characters will become more common
>     and that may either convince us that we need a more sophisticated
>     solution or (on the other hand) convince us that simply
>     mandating wide Unicode characters is an appropriate solution.
>     Right now the two options on the table are do nothing or do
>     this.
> 
> References
> 
>     Unicode Glossary: http://www.unicode.org/glossary/

Plus perhaps the Mark Davis paper at:

http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
 
> Copyright
> 
>     This document has been placed in the public domain.

Good work, Paul !

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From walter@livinglogic.de  Mon Jul  2 12:40:52 2001
From: walter@livinglogic.de (Walter =?ISO-8859-1?Q?D=F6rwald?=)
Date: Mon, 02 Jul 2001 13:40:52 +0200
Subject: [I18n-sig] Error handling (was: Re: validity of lone surrogates)
References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au> <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk> <200106271416.f5REGl519361@odiug.digicool.com>              <3B3A1020.7154E4B6@livinglogic.de> <200106271753.f5RHrAB19753@odiug.digicool.com>
Message-ID: <3B405DC4.1050900@livinglogic.de>

 > > How would this work together with the proposed encode error handling
 > > callback feature (see patch #432401)? Does this patch have any=20
change of
 > > getting into Python (when it's finished)?
 >
 > I don't know.  The patch looks awfully big, and the motivation seems
 > thin, so I don't have high hopes.  I doubt that I would use it myself,
 > and I fear that it would be pretty slow if called frequently.

Here are a few speed comparisons:
---
import time

s =3D u"a"*20000000
t1 =3D time.time()
s.encode("ascii")
t2 =3D time.time()
print t2-t1
---
The result with Python 2.1 is:
0.65726006031

With the patch the time is:
0.895708084106
(This is probably due to the memory reallocation tests, which could
be avoided for most encoders)

And a test script with a error handler:
---
import time

s =3D u"a=E4"*1000000
t1 =3D time.time()
s.encode("ascii", lambda enc,uni,pos: u"&#%d;" % ord(uni[pos]))
t2 =3D time.time()
print t2-t1
---
37.0272110701

There a version of this error handler implemented in C, so
replacing
s.encode("ascii", lambda enc,uni,pos: u"&#%d;" % ord(uni[pos]))
with
s.encode("ascii", codecs.xmlcharrefreplace_unicodeencode_errors)
gives a result of
4.77566099167

The equivalent Python code:
---
import time

s =3D u"a=E4"*1000000
t1 =3D time.time()
v =3D []
for c in s:
    try:
       v.append(s.encode("ascii"))
    except UnicodeError:
       v.append("&#%d;" % ord(c))
"".join(v)
t2 =3D time.time()
print t2-t1
---
345.193374991

(Note that this is not really equivalent, because it doesn't work with
stateful encoders (e.g. UTF16 generates multiple BOMs))

 > An alternative way to get what you want would be to write your own
 > codec.

This would have to be more like a meta codec, because this feature=20
should be available for every character encoding.

 > Also, some standard codecs might be subclassable in a way that
 > makes it easy to get the desired functionality through subclassing
 > rather than through changing lots of C level APIs.

The patch changes the API in two places:

1. "PyObject *error" is used instead of "const char *error", because=20
error may be a callable object instead of a string. There would be a=20
possibility to have error argument as "const char *error": Define an=20
error handling registry were error handling function can be registered=20
by name:
codec.registerError("xmlreplace",
    lambda enc,uni,pos: "&#%d;" % ord(uni[pos]))
and then the following call can be made:
	u"=E4=F6=FC".encode("ascii", "xmlreplace")
As soon as the first error is encountered, the encoder uses it's builtin=20
error handling method if it recognizes the name ("strict", "replace" or=20
"ignore") or looks up the error handling function in the registry if it=20
doesn't. In this way the speed for the backwards compatible features is=20
the same as before and "const char *error" can be kept as the parameter=20
to all encoding functions. For speed common error handling names could=20
even be implemented in the encoder itself.

2. The arguments "Py_UNICODE *str, int size" to the encoder functions=20
have been replaced with "PyObject *unicode", this was done because the=20
original string is passed to the callback handler, which is just an=20
INCREF when the string is already available as "PyObject *unicode", but=20
a new string has to be created from str/size (but this has to be done=20
only once for the first error). So it's possible to changethis back to=20
the original.

With this it would be possible to implement the functionality without=20
changing the API and without any loss of speed for already existing=20
functionality. Old third party encoders will continue to work for the=20
old error options and would simply raise an "unknown error handling"=20
exception for the new ones.

Should I try this approach? Does it have a better chance of getting into
Python?

Bye,
	Walter D=F6rwald


From fredrik@pythonware.com  Mon Jul  2 17:02:31 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Mon, 2 Jul 2001 18:02:31 +0200
Subject: [I18n-sig] UCS-4 configuration
References: <LNBBLJKPBEHFEDALKOLCEECMKLAA.tim.one@home.com>
Message-ID: <008a01c10310$671dc990$4ffa42d5@hagrid>

tim wrote:


> [discussion about PyUnicode_DecodeUTF16]
> 
> It's nice that we got to chat about portability to Platforms from Mars, but
> is anyone actually going to work on that function?  It shouldn't be hard, I
> just don't want to see it fall thru the cracks.

isn't it about time you hacked on some unicode stuff? ;-)

Cheers /F


From pinard@iro.umontreal.ca  Mon Jul  2 19:19:05 2001
From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 02 Jul 2001 14:19:05 -0400
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
In-Reply-To: <87u216qluh.fsf@deneb.enyo.de>
References: <9F2D83017589D211BD1000805FA70CA703B139D6@ntxmel03.cmutual.com.au>
 <87u216qluh.fsf@deneb.enyo.de>
Message-ID: <oqwv5rkxd2.fsf@lin2.sram.qc.ca>

[Florian Weimer]

> ISO 10646 is the ISO standard with lowest money per page ratio ever

I heard that ISO lowered the price of 10646 indeed.  A few years ago,
we needed 10646, and the price was, euh, substantial. :-)

-- 
Fran�ois Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Mon Jul  2 20:05:35 2001
From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 02 Jul 2001 15:05:35 -0400
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <200106271953.f5RJrPi19963@odiug.digicool.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com>
 <200106261700.f5QH0ih14770@odiug.digicool.com>
 <3B3A3696.FFA7FCE@ActiveState.com>
 <200106271953.f5RJrPi19963@odiug.digicool.com>
Message-ID: <oq1ynzxibk.fsf@lin2.sram.qc.ca>

[Guido van Rossum]

> When using UCS-4 mode, I was in favor of allowing unichr() and \U to
> specify any value in range(0x100000000L) 

I did not check recently, but would think Unicode and 10646 are defined
on 31 bits, not 32.  If you represent an UCS-4 code within a 32 bit int,
it will never be negative.  It might be useful to rely on this.

P.S. - Would not 32 bits also require one more byte in UTF-8?

-- 
Fran�ois Pinard   http://www.iro.umontreal.ca/~pinard


From tim.one@home.com  Mon Jul  2 20:46:44 2001
From: tim.one@home.com (Tim Peters)
Date: Mon, 2 Jul 2001 15:46:44 -0400
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <008a01c10310$671dc990$4ffa42d5@hagrid>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEPLKLAA.tim.one@home.com>

[/F]
> isn't it about time you hacked on some unicode stuff? ;-)

It's a good thing I'm out sick today, cuz they'd never pay me for this:

http://sf.net/tracker/index.php?func=detail&aid=438013&group_id=5470&atid=305470


From deltab@osian.net  Mon Jul  2 22:21:49 2001
From: deltab@osian.net (Daniel Biddle)
Date: Mon, 2 Jul 2001 21:21:49 +0000
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <oq1ynzxibk.fsf@lin2.sram.qc.ca>; from pinard@iro.umontreal.ca on Mon, Jul 02, 2001 at 03:05:13PM -0400
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3696.FFA7FCE@ActiveState.com> <200106271953.f5RJrPi19963@odiug.digicool.com> <oq1ynzxibk.fsf@lin2.sram.qc.ca>
Message-ID: <20010702212149.D30109@mewtwo.espnow.com>

On Mon, Jul 02, 2001 at 03:05:13PM -0400, Fran�ois Pinard wrote:
> [Guido van Rossum]
> 
> > When using UCS-4 mode, I was in favor of allowing unichr() and \U to
> > specify any value in range(0x100000000L) 
> 
> I did not check recently, but would think Unicode and 10646 are defined
> on 31 bits, not 32.  If you represent an UCS-4 code within a 32 bit int,
> it will never be negative.  It might be useful to rely on this.

Certainly ISO 10646 is defined as 31-bit. Unicode was 16-bit, but now uses
just under 20.09 bits.

> P.S. - Would not 32 bits also require one more byte in UTF-8?

Yes:

     bits  1111110x  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx
  control     7       2         2         2         2         2        = 17
     data         1       6         6         6         6         6    = 31

UTF-8 allows at most 6 bytes, which can encode 31 bits.

It's been proposed that UTF-8 and UTF-32 be limited to values up to U+10FFFF,
which is the limit of UTF-16.

-- 
Daniel Biddle <deltab@osian.net>


From wwwjessie@21cn.com  Thu Jul 12 11:01:49 2001
From: wwwjessie@21cn.com (wwwjessie@21cn.com)
Date: Thu, 12 Jul 2001 18:01:49 +0800
Subject: [I18n-sig] =?gb2312?B?xvPStcnPzfijrNK7sr21vc67KFlvdXIgb25saW5lIGNvbXBhbnkp?=
Message-ID: <34ee401c10ab9$aafb2da0$9300a8c0@ifood1gongxing>

This is a multi-part message in MIME format.

------=_NextPart_000_34EE5_01C10AFC.B91E6DA0
Content-Type: text/plain;
	charset="gb2312"
Content-Transfer-Encoding: base64

1/C+tLXEu+HUsaOsxPq6w6Oh0rzKs8a31tC5+s34t/7O8dDFz6K5qcT6ss6/vKO6ICANCg0K07XT
0NfUvLq1xM34yc+5q8u+o6zVucq+uavLvrL6xre6zbf+zvGjrMzhuN/G89K1vrrV+cGmLMT609DB
vdbW0aHU8aO6DQoNCjEvIM341b62qNbGIDxodHRwOi8vd3d3Lmlmb29kMS5jb20vYWJvdXR1cy9v
dXJzZXJ2aWNlcy93ZWIuYXNwPiAgOg0K19S8us6su6S4/NDCo6y53MDtx7DMqLrzzKijrLj5vt3G
89K10OjSqqOsvajBotfUvLq1xM34yc+5q8u+o6zK/b7dv+LEo7/pyM7E+tGh1PGjusnMx+nQxc+i
t6KyvCzN+MnPsvrGt9W5yr6jrL/Nu6e3/s7x1tDQxCzN+MnPubrO78+1zbMsv827p7nYDQrPtbnc
wO0szfjJz8LbzLMszfjJz7vh0unW0NDELM34yc/V0Ma4LM22xrHPtc2zLNfKwc/PwtTY1tDQxCzO
yr7ttfey6Swg1dCx6rLJubrPtc2zLLfDzsrV382zvMa31s72LCDBxMzsytIovbvB96GizLjF0Cmh
raGtDQoNCs/rwcu94sr9vt2/4sSjv+nR3cq+1tDQxKO/x+vBqs+1o7ogc2FsZXNAaWZvb2QxLmNv
bSA8bWFpbHRvOnNhbGVzQGlmb29kMS5jb20+DQqhobXnu7CjujA3NTUtMzc4NjMwOaGhz/rK27K/
yfLQob3jDQoNCjIvINK8zfjNqCA8aHR0cDovL29uZXQuaWZvb2QxLmNvbS8+DQot19TW+sq9vajN
+KOsstnX97zytaWjrLy0vai8tNPDo7q/ydW5yr4zMNXFu/K4/Lbg1dXGrKOs19TW+sq9zqy7pKOs
v8nL5sqxuPzQws28xqy6zc7E19bE2sjdo6zU2s/ft6KyvLL6xrfQxc+ioaK5q8u+tq/MrLXIo6zU
+cvNtv68trn6vMrT8sP7KA0KyOdodHRwOi8veW91cm5hbWUuaWZvb2QxLmNvbSmjrNPr0rzKs8a3
1tC5+s34KNKzw+bkr8DAwb/UwtPiMjAwzfK0zim99MPcway906OszOG438LyvNK6zbnLv823w87K
wb+jrLaoxtrK1bW90rzKsw0KxrfW0Ln6zfjM4bmptcS/zbun0OjH87rNssm5utDFz6Khow0KDQoN
Cg0KN9TCMzDI1cewyerH67KiuLa/7sq508PSvM34zaijrMzYsfDTxbvdvNszODAw1KovxOqjrNT5
y83M9cLrueO45rKiw+K30dTayrPGt9eo0rXU09a+v6+1x7mpo6zH86OstPrA7aOsus/X99DFz6IN
Cs/rwcu94rj8tuA/IKGhx+vBqs+1o7ogc2FsZXNAaWZvb2QxLmNvbSA8bWFpbHRvOnNhbGVzQGlm
b29kMS5jb20+DQqhobXnu7CjujA3NTUtMzc4NjMwOaGhoaHP+srbsr/J8tChveMNCrvyILfDzsrO
0sPHtcTN+NKzIDxodHRwOi8vd3d3Lmlmb29kMS5jb20vYWJvdXR1cy9vdXJzZXJ2aWNlcy9jcHNl
cnZpY2UuYXNwPg0KOnd3dy5pZm9vZDEuY29tDQoNCrvY1rSjqMfrtKvV5qO6MDc1NS0zMjM5MDQ3
u/K3orXn19PTyrz+o7ogc2FsZXNAaWZvb2QxLmNvbSA8bWFpbHRvOnNhbGVzQGlmb29kMS5jb20+
IKOpDQoNCqH1ILG+uavLvrbUzfjVvrao1sa40NDLyKShoaGhICAgICAgICAgICAgICAgICAgICAg
ofUgsb65q8u+ttTSvM34zai3/s7xuNDQy8ikDQoNCrmry77D+7PGo7pfX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fX19fX19fX1/Bqs+1yMujul9fX19fX19fX19fX19fX19fXw0K
X19fX18gDQoNCrXnu7Cjul9fX19fX19fX19fX19fX19fX19fX7Sr1eajul9fX19fX19fX19fX19f
X19fX19fX19FLW1haWyjul9fX19fX19fX19fX19fX18NCl9fX19fXyANCg0K

------=_NextPart_000_34EE5_01C10AFC.B91E6DA0
Content-Type: text/html;
	charset="gb2312"
Content-Transfer-Encoding: base64

PEhUTUw+DQo8SEVBRD4NCjxUSVRMRT5VbnRpdGxlZCBEb2N1bWVudDwvVElUTEU+IDxNRVRBIEhU
VFAtRVFVSVY9IkNvbnRlbnQtVHlwZSIgQ09OVEVOVD0idGV4dC9odG1sOyBjaGFyc2V0PWdiMjMx
MiI+IA0KPC9IRUFEPg0KDQo8Qk9EWSBCR0NPTE9SPSIjRkZGRkZGIiBURVhUPSIjMDAwMDAwIj4N
CjxUQUJMRSBXSURUSD0iOTglIiBCT1JERVI9IjAiIENFTExTUEFDSU5HPSIwIiBDRUxMUEFERElO
Rz0iMCI+PFRSPjxURD48UCBDTEFTUz1Nc29Ob3JtYWwgU1RZTEU9J21hcmdpbi1yaWdodDotMTcu
ODVwdDtsaW5lLWhlaWdodDoxNTAlJz48Rk9OVCBTSVpFPSIyIj7X8L60tcS74dSxo6zE+rrDo6HS
vMqzxrfW0Ln6zfi3/s7x0MXPormpxPqyzr+8o7ombmJzcDs8L0ZPTlQ+IA0KPC9QPjxQIENMQVNT
PU1zb05vcm1hbCBTVFlMRT0nbWFyZ2luLXJpZ2h0Oi0xNy44NXB0O2xpbmUtaGVpZ2h0OjE1MCUn
PjxGT05UIFNJWkU9IjIiPtO109DX1Ly6tcTN+MnPuavLvqOs1bnKvrmry76y+sa3us23/s7xo6zM
4bjfxvPStb661fnBpizE+tPQwb3W1tGh1PGjujxCUj48QlI+MS8gDQo8QQ0KSFJFRj0iaHR0cDov
L3d3dy5pZm9vZDEuY29tL2Fib3V0dXMvb3Vyc2VydmljZXMvd2ViLmFzcCI+zfjVvrao1sY8L0E+
IDog19S8us6su6S4/NDCo6y53MDtx7DMqLrzzKijrLj5vt3G89K10OjSqqOsvajBotfUvLq1xM34
yc+5q8u+o6zK/b7dv+LEo7/pyM7E+tGh1PGjusnMx+nQxc+it6KyvCzN+MnPsvrGt9W5yr6jrL/N
u6e3/s7x1tDQxCzN+MnPubrO78+1zbMsv827p7nYz7W53MDtLM34yc/C28yzLM34yc+74dLp1tDQ
xCzN+MnP1dDGuCzNtsaxz7XNsyzXysHPz8LU2NbQ0MQszsq+7bX3suksIA0K1dCx6rLJubrPtc2z
LLfDzsrV382zvMa31s72LCDBxMzsytIovbvB96GizLjF0CmhraGtPC9GT05UPjwvUD48UCBDTEFT
Uz1Nc29Ob3JtYWwgU1RZTEU9J2xpbmUtaGVpZ2h0OjIwLjBwdCc+PEI+PEZPTlQgQ09MT1I9IiNG
RjAwMDAiPs/rwcu94sr9vt2/4sSjv+nR3cq+1tDQxKO/PC9GT05UPjwvQj48Rk9OVCBTSVpFPSIy
Ij7H68Gqz7WjujxBIEhSRUY9Im1haWx0bzpzYWxlc0BpZm9vZDEuY29tIj5zYWxlc0BpZm9vZDEu
Y29tPC9BPiANCqGhtee7sKO6MDc1NS0zNzg2MzA5oaHP+srbsr/J8tChveM8L0ZPTlQ+PC9QPjxQ
IENMQVNTPU1zb05vcm1hbCBTVFlMRT0nbGluZS1oZWlnaHQ6MjAuMHB0Jz48L1A+PFAgQ0xBU1M9
TXNvTm9ybWFsIFNUWUxFPSdsaW5lLWhlaWdodDoyMC4wcHQnPjxGT05UIFNJWkU9IjIiPjIvIA0K
PEEgSFJFRj0iaHR0cDovL29uZXQuaWZvb2QxLmNvbS8iPtK8zfjNqDwvQT4t19TW+sq9vajN+KOs
stnX97zytaWjrLy0vai8tNPDo7q/ydW5yr4zMNXFu/K4/Lbg1dXGrKOs19TW+sq9zqy7pKOsv8nL
5sqxuPzQws28xqy6zc7E19bE2sjdo6zU2s/ft6KyvLL6xrfQxc+ioaK5q8u+tq/MrLXIo6zU+cvN
tv68trn6vMrT8sP7KMjnaHR0cDovL3lvdXJuYW1lLmlmb29kMS5jb20po6zT69K8yrPGt9bQufrN
+CjSs8Pm5K/AwMG/1MLT4jIwMM3ytM4pvfTD3MGsvdOjrMzhuN/C8rzSus25y7/Nt8POysG/o6y2
qMbaytW1vdK8yrPGt9bQufrN+Mzhuam1xL/Nu6fQ6Mfzus2yybm60MXPoqGjPEJSPjwvRk9OVD48
L1A+PFAgQ0xBU1M9TXNvTm9ybWFsIFNUWUxFPSdtYXJnaW4tcmlnaHQ6LTE3Ljg1cHQ7bGluZS1o
ZWlnaHQ6MTUwJSc+PEZPTlQgU0laRT0iMiI+PEJSPjwvRk9OVD4gDQo8Qj48Rk9OVCBDT0xPUj0i
I0ZGMDAwMCI+NzwvRk9OVD48L0I+PEZPTlQgQ09MT1I9IiNGRjAwMDAiPjxCPtTCMzDI1cewyerH
67KiuLa/7sq508PSvM34zaijrMzYsfDTxbvdvNszODAw1KovxOqjrNT5y83M9cLrueO45rKiw+K3
0dTayrPGt9eo0rXU09a+v6+1x7mpo6zH86OstPrA7aOsus/X99DFz6I8L0I+PEJSPjwvRk9OVD4g
DQo8Rk9OVCBTSVpFPSIyIj7P68HLveK4/LbgPyChocfrwarPtaO6PEEgSFJFRj0ibWFpbHRvOnNh
bGVzQGlmb29kMS5jb20iPnNhbGVzQGlmb29kMS5jb208L0E+IA0KoaG157uwo7owNzU1LTM3ODYz
MDmhoaGhz/rK27K/yfLQob3jPEJSPjwvRk9OVD48Rk9OVCBTSVpFPSIyIj678jxBDQpIUkVGPSJo
dHRwOi8vd3d3Lmlmb29kMS5jb20vYWJvdXR1cy9vdXJzZXJ2aWNlcy9jcHNlcnZpY2UuYXNwIj63
w87KztLDx7XEzfjSszwvQT46d3d3Lmlmb29kMS5jb208L0ZPTlQ+PC9QPjxQIENMQVNTPU1zb05v
cm1hbCBTVFlMRT0nbGluZS1oZWlnaHQ6MjAuMHB0JyBBTElHTj0iTEVGVCI+PC9QPjxQIENMQVNT
PU1zb05vcm1hbCBBTElHTj1MRUZUIFNUWUxFPSdsaW5lLWhlaWdodDoyMC4wcHQnPjxGT05UIFNJ
WkU9IjIiPjxCPrvY1rSjqMfrtKvV5qO6MDc1NS0zMjM5MDQ3u/K3orXn19PTyrz+o7o8L0I+PEEN
CkhSRUY9Im1haWx0bzpzYWxlc0BpZm9vZDEuY29tIj5zYWxlc0BpZm9vZDEuY29tIDwvQT48Qj6j
qTwvQj48L0ZPTlQ+PC9QPjxQPjxGT05UIFNJWkU9IjIiPqH1IA0Ksb65q8u+ttTN+NW+tqjWxrjQ
0MvIpKGhoaEmbmJzcDsmbmJzcDsgJm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7
Jm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7IA0KJm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7Jm5ic3A7
Jm5ic3A7IKH1ILG+uavLvrbU0rzN+M2ot/7O8bjQ0MvIpDwvRk9OVD48L1A+PFAgQ0xBU1M9TXNv
Tm9ybWFsIFNUWUxFPSdsaW5lLWhlaWdodDoyMC4wcHQnPjxGT05UIFNJWkU9IjIiPrmry77D+7PG
o7pfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX1/Bqs+1yMujul9f
X19fX19fX19fX19fX19fX19fX19fIA0KPEJSPiA8QlI+ILXnu7Cjul9fX19fX19fX19fX19fX19f
X19fX7Sr1eajul9fX19fX19fX19fX19fX19fX19fX19FLW1haWyjul9fX19fX19fX19fX19fX19f
X19fX18gDQo8L0ZPTlQ+PC9QPjxQIENMQVNTPU1zb05vcm1hbCBTVFlMRT0nbGluZS1oZWlnaHQ6
MjAuMHB0Jz48L1A+PC9URD48L1RSPjwvVEFCTEU+IA0KPC9CT0RZPg0KPC9IVE1MPg0K

------=_NextPart_000_34EE5_01C10AFC.B91E6DA0--


From Misha.Wolf@reuters.com  Fri Jul 13 14:11:32 2001
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Fri, 13 Jul 2001 14:11:32 +0100
Subject: [I18n-sig] Call for Papers - 20th Unicode Conference - Jan/Feb 2001 - Washington
 DC
Message-ID: <T54b909145dc407b706550@>

           Twentieth International Unicode Conference (IUC20)
               Unicode and the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc20
                     January 28 - February 1, 2002
                          Washington, DC, USA

      > > > > > > >  C A L L   F O R   P A P E R S  < < < < < < < 

                  Submissions due: September 21, 2001
                  Notification date: October 12, 2001
                Completed papers due : November 2, 2001
            (in electronic form and camera-ready paper form)

                               * * * * *

The Internet and the World Wide Web continue to change the shape of
computing.  The goal of network computing and understandable text
access across wide, diverse groups of people has brought great momentum
to computing environments that build Unicode into their foundation.
Whether it's Internet commerce, network access to data, or highly
portable applications, Unicode makes a solid foundation for the network,
global enterprises, and software users everywhere.

The Twentieth International Unicode Conference (IUC20) will address
topics ranging from Unicode use in the World Wide Web and in operating
systems and databases, to the latest developments with Unicode 3.1,
Java, Open Source, XML and Web protocols.  Conference attendees will
include managers, software engineers, systems analysts, and product
marketing personnel responsible for the development of software
supporting Unicode, as well as those involved in all aspects of the
globalization of software and the Internet.

THEME & TOPICS

Computing with Unicode is the overall theme of the Conference. 
Presentations should be geared towards a technical audience.
Suggested topics of interest include, but are not limited to:
- Internationalization features of portable devices
- Implementing new features of Unicode Version 3.1
- Unicode normalization, collation
- Programming Languages and Libraries (Java, Perl, et al)
- The World Wide Web (WWW) and Unicode
- Character set issues
- Web search engines and Unicode
- Library and archival concerns
- Unicode in operating systems
- Unicode in databases
- Unicode in large scale networks
- Unicode in government applications
- The results of using Unicode applications (case studies, solutions)
- Language processing issues with Unicode data
- Migrating legacy applications to Unicode
- Cross platform issues
- Printing and imaging
- Optimizing performance of Unicode systems and applications
- Testing Unicode applications
- Usability evaluations of Unicode applications
- Internationalization and Localization

SESSIONS

The Conference Program will provide a wide range of sessions including:
- Keynote presentations
- Workshops/Tutorials
- Technical presentations
- Panel sessions

All sessions except the Workshops/Tutorials will be of 40 minute
duration.  In some cases, two consecutive 40 minute program slots may be
devoted to a single session.

The Workshops/Tutorials will each last approximately three hours.  They
should be designed to stimulate discussion and participation, using
slides and demonstrations.

PUBLICITY

If your paper is accepted, your details will be included in the
Conference brochure and Web pages and the paper itself will appear on a
Conference CD, with an optional printed book of Conference Proceedings.

CONFERENCE LANGUAGE

The Conference language is English.  All submissions, papers and
presentations should be provided in English.

SUBMISSIONS

Submissions MUST contain:

1. An abstract of 150-250 words, consisting of statement of purpose,
   paper description, and your conclusions or final summary.

2. A brief biography.

3. The details listed below:

   SESSION TITLE:             _________________________________________

                              _________________________________________

   TITLE (eg Dr/Mr/Mrs/Ms):   _________________________________________

   NAME:                      _________________________________________

   JOB TITLE:                 _________________________________________

   ORGANIZATION/AFFILIATION:  _________________________________________

   ORGANIZATION'S WWW URL:    _________________________________________

   OWN WWW URL:               _________________________________________

   ADDRESS FOR PAPER MAIL:    _________________________________________

                              _________________________________________

                              _________________________________________

   TELEPHONE:                 _________________________________________

   FAX:                       _________________________________________

   E-MAIL ADDRESS:            _________________________________________

   TYPE OF SESSION:           [ ] Keynote presentation

                              [ ] Workshop/Tutorial

                              [ ] Technical presentation

                              [ ] Panel

   PANELISTS (if Panel):      _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

   TARGET AUDIENCE (you may select more than one category):

                              [ ] Managers

                              [ ] Software Engineers

                              [ ] Systems Analysts

                              [ ] Marketers

                              [ ] Other: ______________________________

   LEVEL OF SESSION (you may select more than one category):

                              [ ] Beginner

                              [ ] Intermediate

                              [ ] Advanced

Submissions should be sent by e-mail to either of the following 
addresses:

   papers@unicode.org

   info@global-conference.com

They should use ASCII, non-compressed text and the following subject 
line:

   Proposal for IUC 20

If desired, a copy of the submission may also be sent by post to:

   Twentieth International Unicode Conference
   c/o Global Meeting Services, Inc.
   4360 Benhurst Avenue
   San Diego, CA  92122  USA
   Tel: +1 858 638 0206
   Fax: +1 858 638 0504

CONFERENCE PROCEEDINGS

All Conference papers will be published on CD.  Printed proceedings will
be offered as an option.

EXHIBIT OPPORTUNITIES

The Conference will have an Exhibition area for corporations or
individuals who wish to display and promote their products, technology
and/or services.

Every effort will be made to provide maximum exposure and advertising.

Exhibit space is limited.  For further information or to reserve a 
place, please contact Global Meeting Services at the above location.

CONFERENCE VENUE

   Omni Shoreham Hotel
   2500 Calvert Street, NW
   Washington, DC  20008
   USA

   Tel: +1 202 234 0700
   Fax: +1 202 265 7972

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding.  The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646.  In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations.  Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail <info@unicode.org>

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.


From guido@digicool.com  Fri Jul 13 16:04:23 2001
From: guido@digicool.com (Guido van Rossum)
Date: Fri, 13 Jul 2001 11:04:23 -0400
Subject: [I18n-sig] Call for Papers - 20th Unicode Conference - Jan/Feb 2001 - Washington DC
In-Reply-To: Your message of "Fri, 13 Jul 2001 14:11:32 BST."
 <T54b909145dc407b706550@>
References: <T54b909145dc407b706550@>
Message-ID: <200107131504.f6DF4NK16532@odiug.digicool.com>

>            Twentieth International Unicode Conference (IUC20)
>                Unicode and the Web: The Global Connection
>                     http://www.unicode.org/iuc/iuc20
>                      January 28 - February 1, 2002
>                           Washington, DC, USA

If you go to this conference, you can combine it with the 10th Python
conference, which will be the next week in Alexandria (a suburb of
Washington).  (The new conference date and location will be officially
be announced at the O'Reilly conference in San Diego later this month.)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From wwwjessie@21cn.com  Mon Jul 16 10:47:33 2001
From: wwwjessie@21cn.com (wwwjessie@21cn.com)
Date: Mon, 16 Jul 2001 17:47:33 +0800
Subject: [I18n-sig] =?gb2312?B?tPPBrC0yMDAxxOq5+rzKwszJq8qzxrfT68jLwOC9ob+1sqnAwLvhKA==?=	=?gb2312?B?QWdybyBBbm51YWwgTWVldGluZyBDaGluYSAyMDAxKQ0=?=
Message-ID: <2d95001c10ddc$56986a40$9300a8c0@ifood1gongxing>

This is a multi-part message in MIME format.

------=_NextPart_000_2D951_01C10E1F.64BBAA40
Content-Type: text/plain;
	charset="gb2312"
Content-Transfer-Encoding: base64

MjAwMcTq1tC5+rn6vMrFqdK1v8a8vMTqu+ENCrn6vMrCzMmryrPGt9PryMvA4L2hv7WyqcDAu+G8
sNGnyvXM1sLbu+ENCg0KCQ0K1bnG2qO6IAmhoTIwMDHE6jnUwjTI1S03yNUJDQq12LXjo7ogCaGh
tPPBrNDHuqO74dW51tDQxAkNCtb3sOyjuiAJoaHW0LuqyMvD8bmyus25+sWp0rWyvw0KoaHW0Ln6
v8bRp7y8yvXQrbvhDQqhobTzwazK0MjLw/HV/riuDQoJDQqz0LDso7ogCaGh1tC5+sLMyavKs8a3
t6LVudbQ0MQNCqGh1tC5+sWp0ae74Q0KoaHW0Ln6wszJq8qzxrfQrbvhDQqhobTzwazK0MWp0rW+
1g0KoaG088Gs0Me6o7vh1bnW0NDEDQoJDQrN+MLnt/7O8czhuanJzKO60rzKs8a31tC5+s34IGh0
dHA6Ly93d3cuaWZvb2QxLmNvbQ0KPGh0dHA6Ly93d3cuaWZvb2QxLmNvbS9pbmRleC5hc3A/ZnI9
aTE4bi1zaWdAcHl0aG9uLm9yZz4gCQ0KIAkNCqH6IM2ouf3SvMqzxrfW0Ln6zfixqMP7ss7VuaO6
vsXV29PFu90oscjI58/W09DDv7j2IDNNIFggM00gtcSx6te81bnOu9StvNtSTUI0NTAwo6zNqLn9
ztLDx9a70Oi4tlJNQjQwNTApo6wNCrGow/u92Na5yNXG2jIwMDHE6jfUwjIwyNUgPGh0dHA6Ly9n
cmVlbjIwMDEuaWZvb2QxLmNvbS9mcm9tMS5hc3A+IA0Kofogu7bTrSDD4rfR16Ky4SA8aHR0cDov
L3d3dy5pZm9vZDEuY29tL3NpZ251cC9zZXZhZ3JlZW0uYXNwPiCzyc6quavLvrvh1LGhow0KN9TC
MjDI1cew16Ky4aOsxPq9q9TaN9TCMjXI1cewzai5/bXn19PTyrz+t73KvcPit9G78bXDMzDM9bLJ
ubrQxc+ioaMNCsjnufvE+rK7z+vK1bW9ztLDx7XE08q8/qOsx+sgwarPtc7Sw8cgPG1haWx0bzp1
bnN1YnNjcmliZUBpZm9vZDEuY29tPiCjrM7Sw8fS1Lrzvauyu9TZt6LTyrz+uPjE+qGjDQqy6dGv
o7ogc2FsZXNAaWZvb2QxLmNvbSA8bWFpbHRvOnNhbGVzQGlmb29kMS5jb20+ICChoaGhtee7sKO6
MDc1NS0zNzg2MzA5oaHP+srbsr8NCsny0KG94yC2xc/IyfoNCg0KDQogDQoNCrvYINa0IKOox+u0
q9Xmo7owNzU1LTMyMzkwNDe78iC3orXn19PTyrz+o7ogc2FsZXNAaWZvb2QxLmNvbSA8bWFpbHRv
OnNhbGVzQGlmb29kMS5jb20+DQqjqQkNCqH1ILG+uavLvtPQ0uLNqLn90rzKs8a31tC5+s34ss7V
uSChoaGhIKH1ILG+uavLvsTivfjSu7K9wcu94rjDsqnAwLvho6zH69PrztLDx8Gqz7UNCg0KuavL
vsP7s8ajul9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fDQrBqs+1yMujul9f
X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18NCrXnu7Cjul9fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fX18NCrSr1eajul9fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX18NCkUtbWFpbKO6X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19f
Xw0KCQ0KIAkNCiAJDQogCQ0KIAkNCiAJDQo=

------=_NextPart_000_2D951_01C10E1F.64BBAA40
Content-Type: text/html;
	charset="gb2312"
Content-Transfer-Encoding: base64

PGh0bWw+DQo8aGVhZD4NCjx0aXRsZT5VbnRpdGxlZCBEb2N1bWVudDwvdGl0bGU+IDxtZXRhIGh0
dHAtZXF1aXY9IkNvbnRlbnQtVHlwZSIgY29udGVudD0idGV4dC9odG1sOyBjaGFyc2V0PWdiMjMx
MiI+IA0KPHN0eWxlIHR5cGU9InRleHQvY3NzIj4NCjwhLS0NCnRkIHsgIGxpbmUtaGVpZ2h0OiAy
NHB4fQ0KLS0+DQo8L3N0eWxlPiANCjwvaGVhZD4NCg0KPGJvZHkgYmdjb2xvcj0iI0ZGRkZGRiIg
dGV4dD0iIzAwMDAwMCI+DQo8ZGl2IGFsaWduPSJDRU5URVIiPjx0YWJsZSB3aWR0aD0iNzUlIiBi
b3JkZXI9IjAiIGNlbGxzcGFjaW5nPSIwIiBjZWxscGFkZGluZz0iMCI+PHRyPjx0ZCBhbGlnbj0i
Q0VOVEVSIj48YSBocmVmPSJodHRwOy8vZ3JlZW4yMDAxLmlmb29kMS5jb20iPjxiPjIwMDHE6tbQ
ufq5+rzKxanStb/GvLzE6rvhPGJyPrn6vMrCzMmryrPGt9PryMvA4L2hv7WyqcDAu+G8sNGnyvXM
1sLbu+E8L2I+PC9hPjxicj48YnI+PC90ZD48L3RyPjx0cj48dGQgYWxpZ249IkNFTlRFUiI+PHRh
YmxlIHdpZHRoPSI3NSUiIGJvcmRlcj0iMCIgY2VsbHNwYWNpbmc9IjAiIGNlbGxwYWRkaW5nPSIw
Ij48dHI+PHRkIGhlaWdodD0iMTIiIHdpZHRoPSIzOSUiIGFsaWduPSJSSUdIVCI+PGI+PGZvbnQg
c2l6ZT0iMiI+1bnG2qO6IA0KPC9mb250PjwvYj48L3RkPjx0ZCBoZWlnaHQ9IjEyIiB3aWR0aD0i
NjElIj48Zm9udCBzaXplPSIyIj6hoTIwMDHE6jnUwjTI1S03yNU8L2ZvbnQ+PC90ZD48L3RyPjx0
cj48dGQgaGVpZ2h0PSIxMiIgd2lkdGg9IjM5JSIgYWxpZ249IlJJR0hUIj48Yj48Zm9udCBzaXpl
PSIyIj612LXjo7ogDQo8L2ZvbnQ+PC9iPjwvdGQ+PHRkIGhlaWdodD0iMTIiIHdpZHRoPSI2MSUi
Pjxmb250IHNpemU9IjIiPqGhtPPBrNDHuqO74dW51tDQxDwvZm9udD48L3RkPjwvdHI+PHRyPjx0
ZCBoZWlnaHQ9IjEyIiB3aWR0aD0iMzklIiBhbGlnbj0iUklHSFQiIHZhbGlnbj0iVE9QIj48Yj48
Zm9udCBzaXplPSIyIj7W97Dso7ogDQo8L2ZvbnQ+PC9iPjwvdGQ+PHRkIGhlaWdodD0iMTIiIHdp
ZHRoPSI2MSUiPjxmb250IHNpemU9IjIiPqGhPC9mb250Pjxmb250IHNpemU9IjIiPtbQu6rIy8Px
ubK6zbn6xanStbK/PGJyPqGh1tC5+r/G0ae8vMr10K274Txicj6hobTzwazK0MjLw/HV/riuPGJy
PjwvZm9udD48L3RkPjwvdHI+PHRyPjx0ZCBoZWlnaHQ9IjEyIiB3aWR0aD0iMzklIiBhbGlnbj0i
UklHSFQiIHZhbGlnbj0iVE9QIj48Yj48Zm9udCBzaXplPSIyIj6z0LDso7ogDQo8L2ZvbnQ+PC9i
PjwvdGQ+PHRkIGhlaWdodD0iMTIiIHdpZHRoPSI2MSUiPjxmb250IHNpemU9IjIiPqGhPC9mb250
Pjxmb250IHNpemU9IjIiPtbQufrCzMmryrPGt7ei1bnW0NDEPGJyPqGh1tC5+sWp0ae74Txicj6h
odbQufrCzMmryrPGt9Ctu+E8YnI+oaG088GsytDFqdK1vtY8YnI+oaG088Gs0Me6o7vh1bnW0NDE
PGJyPjwvZm9udD48L3RkPjwvdHI+PHRyPjx0ZCBjb2xzcGFuPSIyIiBhbGlnbj0iQ0VOVEVSIj48
Zm9udCBzaXplPSIyIj7N+MLnt/7O8czhuanJzKO60rzKs8a31tC5+s34IA0KPGEgaHJlZj0iaHR0
cDovL3d3dy5pZm9vZDEuY29tL2luZGV4LmFzcD9mcj1pMThuLXNpZ0BweXRob24ub3JnIj5odHRw
Oi8vd3d3Lmlmb29kMS5jb208L2E+PC9mb250PjwvdGQ+PC90cj48dHI+PHRkIGNvbHNwYW49IjIi
IGFsaWduPSJDRU5URVIiPiZuYnNwOzwvdGQ+PC90cj48dHI+PHRkIGNvbHNwYW49IjIiIGFsaWdu
PSJMRUZUIj48cD48Zm9udCBzaXplPSIyIj6h+iANCs2ouf3SvMqzxrfW0Ln6zfixqMP7ss7VuaO6
PGI+PGZvbnQgc2l6ZT0iMyIgY29sb3I9IiNGRjAwMDAiPr7F1dvTxbvdPC9mb250PjwvYj4oscjI
58/W09DDv7j2IDNNIFggM00gDQq1xLHq17zVuc671K2821JNQjQ1MDCjrM2ouf3O0sPH1rvQ6Li2
Uk1CNDA1MCmjrCA8YSBocmVmPSJodHRwOi8vZ3JlZW4yMDAxLmlmb29kMS5jb20vZnJvbTEuYXNw
Ij48Yj48Zm9udCBzaXplPSIzIiBjb2xvcj0iI0ZGMDAwMCI+sajD+73Y1rnI1cbaMjAwMcTqN9TC
MjDI1TwvZm9udD48L2I+PC9hPjxicj6h+iANCru20608YSBocmVmPSJodHRwOi8vd3d3Lmlmb29k
MS5jb20vc2lnbnVwL3NldmFncmVlbS5hc3AiPsPit9HXorLhPC9hPrPJzqq5q8u+u+HUsaGjIDxm
b250IGNvbG9yPSIjRkYwMDAwIj48Yj48Zm9udCBzaXplPSIzIj431MIyMMjVx7DXorLho6zE+r2r
1No31MIyNcjVx7DNqLn9tefX09PKvP63vcq9w+K30bvxtcMzMMz1ssm5utDFz6KhozwvZm9udD48
L2I+PC9mb250Pjxicj7I57n7xPqyu8/rytW1vc7Sw8e1xNPKvP6jrMfrPGEgaHJlZj0ibWFpbHRv
OnVuc3Vic2NyaWJlQGlmb29kMS5jb20iPsGqz7XO0sPHPC9hPqOsztLDx9LUuvO9q7K71Nm3otPK
vP64+MT6oaM8YnI+sunRr6O6PGEgaHJlZj0ibWFpbHRvOnNhbGVzQGlmb29kMS5jb20iPnNhbGVz
QGlmb29kMS5jb208L2E+IA0KoaGhobXnu7CjujA3NTUtMzc4NjMwOaGhz/rK27K/IMny0KG94yC2
xc/Iyfo8YnI+PC9mb250PjwvcD48cD4mbmJzcDs8L3A+PC90ZD48L3RyPjx0cj48dGQgaGVpZ2h0
PSIzMCIgY29sc3Bhbj0iMiIgYWxpZ249IkNFTlRFUiI+PGZvbnQgc2l6ZT0iMiI+PGI+u9ggDQrW
tCCjqMfrtKvV5qO6MDc1NS0zMjM5MDQ3u/Igt6K159fT08q8/qO6IDxhIGhyZWY9Im1haWx0bzpz
YWxlc0BpZm9vZDEuY29tIj5zYWxlc0BpZm9vZDEuY29tPC9hPiANCqOpPC9iPjwvZm9udD48L3Rk
PjwvdHI+PHRyPjx0ZCBoZWlnaHQ9IjEyIiBjb2xzcGFuPSIyIj48Zm9udCBzaXplPSIyIj6h9SCx
vrmry77T0NLizai5/dK8yrPGt9bQufrN+LLO1bkgDQqhoaGhIKH1ILG+uavLvsTivfjSu7K9wcu9
4rjDsqnAwLvho6zH69PrztLDx8Gqz7U8YnI+PGJyPrmry77D+7PGo7pfX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fXzxicj7Bqs+1yMujul9fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX188YnI+PC9mb250Pjxmb250IHNpemU9IjIiPrXnu7Cjul9fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fX19fX188YnI+tKvV5qO6X19fX19fX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fXzxicj5FLW1haWyjul9fX19fX19fX19fX19fX19fX19fX19fX19fX19f
X19fX19fX188YnI+PC9mb250PjwvdGQ+PC90cj48dHI+PHRkIGhlaWdodD0iMTIiIGNvbHNwYW49
IjIiIGFsaWduPSJMRUZUIj4mbmJzcDs8L3RkPjwvdHI+PHRyPjx0ZCBoZWlnaHQ9IjEyIiBjb2xz
cGFuPSIyIiBhbGlnbj0iTEVGVCI+Jm5ic3A7PC90ZD48L3RyPjx0cj48dGQgaGVpZ2h0PSIxMiIg
Y29sc3Bhbj0iMiIgYWxpZ249IkxFRlQiPiZuYnNwOzwvdGQ+PC90cj48L3RhYmxlPjwvdGQ+PC90
cj48dHI+PHRkPiZuYnNwOzwvdGQ+PC90cj48dHI+PHRkPiZuYnNwOzwvdGQ+PC90cj48L3RhYmxl
PjwvZGl2Pg0KPC9ib2R5Pg0KPC9odG1sPg0K

------=_NextPart_000_2D951_01C10E1F.64BBAA40--


From barry@zope.com  Fri Jul 27 06:32:43 2001
From: barry@zope.com (Barry A. Warsaw)
Date: Fri, 27 Jul 2001 01:32:43 -0400
Subject: [I18n-sig] pygettext dilemma
Message-ID: <15200.64763.772001.53387@anthem.wooz.org>

I've got a bit of a dilemma about the right way to generate a pot
file, specifically for Mailman.  Because this involves docstrings, I
don't think the normal gettext tools have to deal with this.

In Mailman, I've got a bunch of normal .py modules and a bunch of
command line scripts.  The modules have their translatable strings
nicely marked with _() and only those strings should be extracted.
The scripts however should have both _() and docstrings extracted,
since the module docstrings include usage text.

pygettext.py has a -D (--docstrings) flag that signals the program to
extract docstrings even though they aren't _() marked.  So far so
good.

But the problem is that my translators definitely do not want
the normal .py modules' docstrings extracted, because it is difficult
for them to figure out which docstrings to translate and which to
ignore.  I tried to extract the two classes of files in two separate
pygettext.py steps, but had trouble merging the resulting files.  You
can't merge them with msgmerge because that program seems to just drop
all the entries from the second file (I'm guessing since there's no
overlap between the first and second files).

So next I tried just cat'ing the two files together, but this
generates fatal exceptions in msgmerge for duplicate entries.  One of
the duplicates is the pot header, so I was going to add a switch to
suppress that, but then realized that there'd be other duplicates
anyway.

What I /think/ I want now is to be able to tell pygettext.py exactly
which files to extract docstrings from and which to only extract
marked strings from, and then do the extraction in one fell swoop.

I propose to include a -X flag like so:

    -X filename
    --no-docstrings=filename
        Specify a file that contains a list of files that should not have
        their docstrings extracted.  This is only useful in conjunction with
        the -D option above.

So with this I'd hand pygettext.py the entire list of files that it
should do extraction on, include the -D option, and then include the
-X option with the normal module .py's listed in an exclude-file.

Does anybody have any suggestions or better ideas?

-Barry


From keichwa@gmx.net  Fri Jul 27 17:38:24 2001
From: keichwa@gmx.net (Karl Eichwalder)
Date: 27 Jul 2001 18:38:24 +0200
Subject: [I18n-sig] pygettext dilemma
In-Reply-To: <15200.64763.772001.53387@anthem.wooz.org>
References: <15200.64763.772001.53387@anthem.wooz.org>
Message-ID: <shitgetji7.fsf@tux.gnu.franken.de>

barry@zope.com (Barry A. Warsaw) writes:

> You can't merge them with msgmerge because that program seems to just
> drop all the entries from the second file (I'm guessing since there's
> no overlap between the first and second files).

Consider to you msgcomm for this job ;)  Beware, all version up to
0.10.39 are "limited" (the option --unique is broken); it's the best to
go for the CVS version (HEAD).  YOu can check it out from

    :pserver:anoncvs@sourceware.cygnus.com:/cvs/gettext

Password it "anoncvs" (IIRC).  Info is available somewhere on the cygnus
site.  There's also msgcat; main difference: using msgcomm the first
occurence of a message wins; msgcat contatenates and the user has to
decide which translations to keep.

> What I /think/ I want now is to be able to tell pygettext.py exactly
> which files to extract docstrings from and which to only extract
> marked strings from, and then do the extraction in one fell swoop.
> 
> I propose to include a -X flag like so:
> 
>     -X filename
>     --no-docstrings=filename
>         Specify a file that contains a list of files that should not have
>         their docstrings extracted.  This is only useful in conjunction with
>         the -D option above.

Using the combo msggrep/msgcomm you can "throw away" unwanted messages
quite easy; maybe, this approach will help.

msgcat, msggrep and msgconv and msgexec are new tools developed by Bruno
Haible the last time.

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)


From barry@zope.com  Fri Jul 27 17:52:30 2001
From: barry@zope.com (Barry A. Warsaw)
Date: Fri, 27 Jul 2001 12:52:30 -0400
Subject: [I18n-sig] pygettext dilemma
References: <15200.64763.772001.53387@anthem.wooz.org>
 <shitgetji7.fsf@tux.gnu.franken.de>
Message-ID: <15201.40014.142429.215469@anthem.wooz.org>

>>>>> "KE" == Karl Eichwalder <keichwa@gmx.net> writes:

    KE> Consider to you msgcomm for this job ;) Beware, all version up
    KE> to 0.10.39 are "limited" (the option --unique is broken); it's
    KE> the best to go for the CVS version (HEAD).  YOu can check it
    KE> out from

Cool, thanks for the pointer, I'll definitely check it out.  Looks
like my system's got an old msgcomm, so I'll suck down the cvs and
install that.

Turns out that the -X option on pygettext.py works well enough, even
if it is a bit of a hack.  I just committed it to Python's cvs. :)

Thanks,
-Barry


From haible@ilog.fr  Fri Jul 27 18:38:52 2001
From: haible@ilog.fr (Bruno Haible)
Date: Fri, 27 Jul 2001 19:38:52 +0200 (CEST)
Subject: [I18n-sig] pygettext dilemma
Message-ID: <15201.42796.108513.382321@honolulu.ilog.fr>

Barry A. Warsaw wrote:
> I tried to extract the two classes of files in two separate
> pygettext.py steps

That's most reasonable. It allows you to use different
xgettext/pygettext arguments for the two sets of files.

> but had trouble merging the resulting files.  You
> can't merge them with msgmerge because that program seems to just drop
> all the entries from the second file (I'm guessing since there's no
> overlap between the first and second files).

msgcomm is not really made for this task. gettext-0.11 will contain an
'msgcat' command, which works well for these cases.

In the meantime, I can recommend to 'cat' the two pot files and run
'msguniq' on the result. 'msguniq' will also be in gettext-0.11, but
here is an equivalent implementation in a Python like language (<g>).

Bruno

============================ msguniq =============================
#!/usr/local/bin/clisp -C

;;; Remove duplicates in message catalogs.
;;; Bruno Haible 28.3.1997

;; This could roughly be implemented as
;;   cp INPUT temp1
;;   cp INPUT temp2
;;   msgcomm --more-than=1 -w 1000 -o OUTPUT temp1 temp2
;; but this has the drawback that
;;  - msgcomm doesn't seem to be made for this.

;; This could also be roughly implemented as
;;   xgettext -d - --omit-header -w 1000 INPUT > OUTPUT
;; but this has the drawbacks that
;;  - it sometimes reverses the list of lines belonging to the hunk,
;;  - it removes the header.

;; When gettext-0.11 is releases, this could also be implemented as
;;   msguniq INPUT -w 1000 -o OUTPUT
;; without any drawbacks!

;; Additionally, messages translations in OLD override the ones in INPUT.

(defstruct message
  lines        ; list of all lines belonging to the hunk
  msgid        ; nil or a string
  msgstr       ; nil or a string
  occurs       ; list of strings "file:nn" where the message occurs
)

(defun main (infilename outfilename &optional oldfilename)
  (declare (type string infilename outfilename))
  #+UNICODE (setq *default-file-encoding* charset:iso-8859-1)
  (let ((hunk-list nil) ; list of all hunks
        (hunk-table (make-hash-table :test #'equal))
          ; (gethash msgid hunk-table) is the hunk who has the given msgid
        (eof "EOF")
       )
    (flet ((read-hunk (istream) ; reads a hunk, returns nil on eof
             (let ((line nil) (lines nil) (occurs nil))
               (loop
                 (setq line (read-line istream nil eof))
                 (when (eql line eof) (return))
                 (if (equal line "")
                   (when lines (return))
                   (progn
                     (push line lines)
                     (when (and (>= (length line) 3) (string= line "#: " :end1 3))
                       (push (subseq line 3) occurs)
                 ) ) )
               )
               (when lines
                 (setq lines (nreverse lines))
                 (setq occurs (nreverse occurs))
                 (flet ((line-group (id &aux (idlen (length id)))
                          (let ((l (member-if
                                     #'(lambda (line)
                                         (and (>= (length line) idlen)
                                              (string= line id :end1 idlen)
                                       ) )
                                     lines
                               ))  )
                            (when l
                              (setq l (cons (subseq (car l) idlen) (cdr l)))
                              (let ((i (position-if-not
                                         #'(lambda (line)
                                             (and (plusp (length line))
                                                  (eql (char line 0) #\")
                                           ) )
                                         l
                                   ))  )
                                (subseq l 0 i)
                       )) ) ) )
                   (let ((msgid (line-group "msgid "))
                         (msgstr (line-group "msgstr ")))
                     (make-message :lines lines
                                   :msgid msgid
                                   :msgstr msgstr
                                   :occurs occurs
               ) ) ) )
          )) )
      (with-open-file (istream infilename :direction :input)
        (loop
          (let ((hunk (read-hunk istream)))
            (unless hunk (return))
            (if (null (message-msgid hunk))
              (push hunk hunk-list)
              (let ((other-hunk (gethash (message-msgid hunk) hunk-table)))
                (if (not other-hunk)
                  (progn
                    (push hunk hunk-list)
                    (setf (gethash (message-msgid hunk) hunk-table) hunk)
                  )
                  (progn
                    (unless (equal (message-msgstr hunk)
                                   (message-msgstr other-hunk)
                            )
                      (warn "Same message, different translations: ~A and ~A"
                            (message-occurs hunk) (message-occurs other-hunk)
                    ) )
                    (setf (message-occurs other-hunk)
                          (append (message-occurs other-hunk)
                                  (message-occurs hunk)
                    )     )
        ) ) ) ) ) )
        (setq hunk-list (nreverse hunk-list))
      )
      (when oldfilename
        (with-open-file (istream oldfilename :direction :input)
          (loop
            (let ((hunk (read-hunk istream)))
              (unless hunk (return))
              (unless (null (message-msgid hunk))
                (let ((other-hunk (gethash (message-msgid hunk) hunk-table)))
                  (when other-hunk
                    (setf (message-msgstr other-hunk) (message-msgstr hunk))
      ) ) ) ) ) ) )
      (with-open-file (ostream outfilename :direction :output)
        (flet ((print-hunk (hunklistr)
                 (let* ((hunk (car hunklistr))
                        (lines (message-lines hunk))
                        (msgid (message-msgid hunk))
                        (msgstr (message-msgstr hunk))
                        (occurs (message-occurs hunk)))
                   (dolist (line lines)
                     (cond ((and (>= (length line) 3) (string= line "#: " :end1 3))
                            (when occurs
                              (format ostream "#: ~{~A~^ ~}~%" occurs)
                              (setq occurs nil)
                           ))
                           ((and (>= (length line) 1) (string= line "#" :end1 1))
                            (format ostream "~A~%" line)
                           )
                           ((and (>= (length line) 6) (string= line "msgid " :end1 6))
                            (format ostream "msgid ~{~A~%~}" msgid)
                           )
                           ((and (>= (length line) 7) (string= line "msgstr " :end1 7))
                            (format ostream "msgstr ~{~A~%~}" msgstr)
                           )
                   ) )
                   (when (cdr hunklistr) (format ostream "~%"))
              )) )
          (mapl #'print-hunk hunk-list)
        )
) ) ) )

(main (first *args*) (second *args*) (third *args*))