From Misha.Wolf@reuters.com Fri Mar 1 17:52:15 2002 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Fri, 01 Mar 2002 17:52:15 +0000 Subject: [I18n-sig] 21st Unicode Conference, May 2002, Dublin, Ireland Message-ID: >>>>>>>>>>>>>>>>>> First European IUC in two years! <<<<<<<<<<<<<<<<<<< Twenty-first International Unicode Conference (IUC21) Unicode, Localization and the Web: The Global Connection http://www.unicode.org/iuc/iuc21 14-17 May 2002 Dublin, Ireland >>>>>>>>>>>>>>>>>>>>>>>>> Just 10 weeks to go! <<<<<<<<<<<<<<<<<<<<<<<< NEWS * Hotel guest room group rate valid to 1 May. * Early bird registration rate valid to 1 May. * Visit the Conference Web site ( http://www.unicode.org/iuc/iuc21 ) to check the Conference program and register. To help you choose Conference sessions, we've included abstracts of talks and speakers' biographies. * The Workshop on Standards in Localisation, organised by the Localisation Research Centre (LRC), is taking place in the same venue on May 13 -- See: http://lrc.csis.ul.ie CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation Localisation Research Centre Microsoft Corporation Reuters Ltd Sun Microsystems, Inc. World Wide Web Consortium (W3C) GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. For details, visit the Conference Web site. CONFERENCE VENUE The Conference will take place at: The Burlington Hotel Upper Leeson Street Dublin 4, Ireland Tel: (+353 1) 660 5222 Fax: (+353 1) 660 8496 CONFERENCE MANAGEMENT Global Meeting Services Inc. 8949 Lombard Place, #416 San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ------------------------------------------------------------- --- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From kajiyama@grad.sccs.chukyo-u.ac.jp Mon Mar 4 11:33:08 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Mon, 4 Mar 2002 20:33:08 +0900 Subject: [I18n-sig] JapaneseCodecs 1.4.4 released Message-ID: <200203041133.UAA15626@dhcp225.grad.sccs.chukyo-u.ac.jp> Hi all, I've released JapaneseCodecs 1.4.4. The new feature is the addition of a codec for MS932 (Microsoft code page 932, i.e. a version of Shift_JIS). A source tarball is avairable at: http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/ The MS932 codec was written by Atsuo ISHIMOTO. I really appreciate the contribution. Thanks a lot!! Regards, -- KAJIYAMA, Tamito From kajiyama@grad.sccs.chukyo-u.ac.jp Wed Mar 6 12:05:28 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Wed, 6 Mar 2002 21:05:28 +0900 Subject: [I18n-sig] PEP 263 and Japanese native encodings Message-ID: <200203061205.VAA18805@dhcp225.grad.sccs.chukyo-u.ac.jp> Hi, I read the PEP 263: Defining Python Source Code Encodings (revision 1.9). Here some comments after a discussion on the PEP in a Japanese Python mailing list. First of all, as a Japanese Python programmer, I would like to use three Japanese native encodings EUC-JP, Shift_JIS and ISO-2022-JP as a file encoding of Python source files. I think these encodings are considered "ASCII compatible" in the sense you mention in the following paragraph in the "Concepts" section: Only ASCII compatible encodings are allowed as source code encoding to assure that Python language elements other than literals and comments remain readable by ASCII processing tools and to avoid problems with wide characters encodings such as UTF-16. However, a participant of the discussion in the Japanese Python mailing list says, among the three Japanese encodings, Shift_JIS and ISO-2022-JP are *not* ASCII compatible. He defines ASCII compatibility as follows: An ASCII compatible encoding (character set) is a superset of the ASCII encoding (character set) in which octets from 0x00 to 0x7f are only used to represent ASCII characters and not used in a series of bytes that represent a multibyte character (such as Kanji and Hiragana). This definition is too restrictive IMHO, but anyway the term "ASCII compatible" is somewhat obscure and needs clarification since there are at least two interpretations. For the sake of the PEP's readers, it's also useful to provide a (partial) list of encodings that can be used as a file encoding. In summary, the questions to be raised are: o What does the term "ASCII compatible" mean? o Are three Japanese native encodings EUC-JP, Shift_JIS and ISO-2022-JP "ASCII compatible"? Anyway, thank you for the great proposal. It will enhance the utility of the language for non-Latin Python programmers once implemented in the language core. I really hope that. Regards, -- KAJIYAMA, Tamito From mal@lemburg.com Wed Mar 6 12:49:58 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 06 Mar 2002 13:49:58 +0100 Subject: [I18n-sig] PEP 263 and Japanese native encodings References: <200203061205.VAA18805@dhcp225.grad.sccs.chukyo-u.ac.jp> Message-ID: <3C861076.7202C114@lemburg.com> Tamito KAJIYAMA wrote: > > I read the PEP 263: Defining Python Source Code Encodings > (revision 1.9). Here some comments after a discussion on the > PEP in a Japanese Python mailing list. > > First of all, as a Japanese Python programmer, I would like to > use three Japanese native encodings EUC-JP, Shift_JIS and > ISO-2022-JP as a file encoding of Python source files. I think > these encodings are considered "ASCII compatible" in the sense > you mention in the following paragraph in the "Concepts" section: > > Only ASCII compatible encodings are allowed as source code > encoding to assure that Python language elements other than > literals and comments remain readable by ASCII processing tools > and to avoid problems with wide characters encodings such as > UTF-16. > > However, a participant of the discussion in the Japanese Python > mailing list says, among the three Japanese encodings, Shift_JIS > and ISO-2022-JP are *not* ASCII compatible. He defines ASCII > compatibility as follows: > > An ASCII compatible encoding (character set) is a superset of > the ASCII encoding (character set) in which octets from 0x00 > to 0x7f are only used to represent ASCII characters and not > used in a series of bytes that represent a multibyte character > (such as Kanji and Hiragana). > > This definition is too restrictive IMHO, but anyway the term > "ASCII compatible" is somewhat obscure and needs clarification > since there are at least two interpretations. As far as the Python tokenizer/compiler is concerned, it will only have to be able to read the first two lines and then decode the information found there as described in the PEP. That said, ASCII compatible encoding in the PEP description means that you can represent the standard printable characters including the line end characters of the ASCII encoding using ASCII ordinals. I only wanted to avoid having to support two or more byte encodings such as UTF-16 since these make the magic comment recognition much more difficult. > For the sake of > the PEP's readers, it's also useful to provide a (partial) list > of encodings that can be used as a file encoding. > > In summary, the questions to be raised are: > > o What does the term "ASCII compatible" mean? > o Are three Japanese native encodings EUC-JP, Shift_JIS and > ISO-2022-JP "ASCII compatible"? Yes, provided they have no problem representing the first two lines of a source files as e.g.: #!/usr/bin/python -uOO # -*- coding: iso-2022-jp -*- > Anyway, thank you for the great proposal. It will enhance the > utility of the language for non-Latin Python programmers once > implemented in the language core. I really hope that. Thanks. Since I will be busy the next two months, Martin has volunteered to head on with the implementation. I hope that we can have phase 1 implemented in Python 2.3. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From martin@v.loewis.de Wed Mar 6 18:03:07 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 06 Mar 2002 19:03:07 +0100 Subject: [I18n-sig] PEP 263 and Japanese native encodings In-Reply-To: <200203061205.VAA18805@dhcp225.grad.sccs.chukyo-u.ac.jp> References: <200203061205.VAA18805@dhcp225.grad.sccs.chukyo-u.ac.jp> Message-ID: Tamito KAJIYAMA writes: > I think > these encodings are considered "ASCII compatible" in the sense > you mention in the following paragraph in the "Concepts" section: > > Only ASCII compatible encodings are allowed as source code > encoding to assure that Python language elements other than > literals and comments remain readable by ASCII processing tools > and to avoid problems with wide characters encodings such as > UTF-16. My original definition of "ASCII compatible" would have been "An encoding X is ASCII compatible iff a text that consists only of ASCII characters is byte-for-byte identical when encoded with X, compared to the same text encoded in ASCII" Under this definition, iso-2022-jp would be ASCII compatible, but it still is not acceptable under the implementation that I have in mind for the patch. > An ASCII compatible encoding (character set) is a superset of > the ASCII encoding (character set) in which octets from 0x00 > to 0x7f are only used to represent ASCII characters and not > used in a series of bytes that represent a multibyte character > (such as Kanji and Hiragana). Indeed, this is the definition which the reference implementation of the PEP currently relies on. > This definition is too restrictive IMHO, but anyway the term > "ASCII compatible" is somewhat obscure and needs clarification > since there are at least two interpretations. It would be possible to somewhat losen this definition, defining "ASCII string" compatible An ASCII string compatible encoding (character set) is a superset of the ASCII encoding (character set) in which octets from set AS are only used to represent ASCII characters and not used in a series of bytes that represent a multibyte character (such as Kanji and Hiragana). The set AS is defined as AS = [\r\n\\'"] (newline, linefeed, backslash, single/double quote) The rationale here is that, under the PEP, non-ASCII text may only appear in comments and strings. The lexer needs the ASCII-compatible property to determine the end-of-line and end-of-string markers, atleast in the phase-1 implementation. > o Are three Japanese native encodings EUC-JP, Shift_JIS and > ISO-2022-JP "ASCII compatible"? EUC-JP certainly is; ISO-2022-JP probably isn't. I cannot see the problem with Shift_JIS; I thought is uses only non-ASCII bytes for the double-byte characters (and that this is precisely what the "shift" in Shift_JIS refers to); see http://www.io.com/~kazushi/encoding/sjis.html If you are referring to the common interpretation that Shift_JIS uses JIS X 0201-1976 for the first 128 bytes, I think we can take a relaxed position here: 1. The only differences between JIS X 0201 and ISO 646 IRV (aka ASCII) are \x24 (CURRENCY SIGN vs. DOLLAR SIGN) and \x5C (YEN SIGN vs. REVERSE SOLIDUS). 2. \x24 is not in AS. 3. Backslash could cause a problem, if people insist on putting the Yen sign into a string literal. Even though this isn't strictly supported under PEP 263, people would get away with that most of the time. 4. I understand that Microsoft's interpretation of Shift_JIS actually is that \x24 *does* represent REVERSE SOLIDUS, and that only the fonts display something else. Regards, Martin From martin@v.loewis.de Wed Mar 6 18:20:48 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 06 Mar 2002 19:20:48 +0100 Subject: [I18n-sig] ICU in python-codecs CVS Message-ID: I'd like to start working on ICU codecs support, in the python-codecs CVS. For that, I'd like to import Fredrik Juhlin's picu into the CVS, and start from there. Any objection against creating a picu module in the CVS? For that to work, Fredrik needs to get write access to the CVS also. Could somebody please arrange that? Thanks, Martin From tree@basistech.com Wed Mar 6 19:53:07 2002 From: tree@basistech.com (Tom Emerson) Date: Wed, 6 Mar 2002 14:53:07 -0500 Subject: [I18n-sig] ICU in python-codecs CVS In-Reply-To: References: Message-ID: <15494.29603.884336.395622@magrathea.basistech.com> Martin v. Loewis writes: > I'd like to start working on ICU codecs support, in the python-codecs > CVS. For that, I'd like to import Fredrik Juhlin's picu into the CVS, > and start from there. Any objection against creating a picu module in > the CVS? For that to work, Fredrik needs to get write access to the > CVS also. Could somebody please arrange that? I have no objects: if you give me his SF account I'll add him to the list of maintainers. -- Tom Emerson Basis Technology Corp. Sr. Computational Linguist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From tree@basistech.com Wed Mar 6 20:11:00 2002 From: tree@basistech.com (Tom Emerson) Date: Wed, 6 Mar 2002 15:11:00 -0500 Subject: [I18n-sig] ICU in python-codecs CVS In-Reply-To: <15494.29603.884336.395622@magrathea.basistech.com> References: <15494.29603.884336.395622@magrathea.basistech.com> Message-ID: <15494.30676.799975.659849@magrathea.basistech.com> Tom Emerson writes: > I have no objects: [...] Er, s/objects/objections/ :-) -- Tom Emerson Basis Technology Corp. Sr. Computational Linguist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From goodger@users.sourceforge.net Thu Mar 7 02:02:56 2002 From: goodger@users.sourceforge.net (David Goodger) Date: Wed, 06 Mar 2002 21:02:56 -0500 Subject: [I18n-sig] raw-unicode-escape encoding Message-ID: If this isn't the correct venue, please let me know. (The right people seem to be hanging around.) I've come across something strange while adding some Unicode characters to the output generated by the Docutils projects (see my signature for URLs). I want to get 7-bit ASCII output for the test suite, but I want to keep newlines, so I'm using the 'raw-unicode-escape' codec. I assumed that this codec would convert any character whose ord(char) > 127 to "\\uXXXX". This does not seem to be the case for ord(char) between 128 and 255 inclusive. Here's my default encoding:: >>> import sys >>> sys.getdefaultencoding() 'ascii' Here's a Unicode string that works:: >>> u =3D u'\u2020\u2021' >>> s =3D u.encode('raw-unicode-escape') >>> s '\\u2020\\u2021' >>> print s \u2020\u2021 That's what I want. When I run the string (not Unicode) through the codec again, there's no change (which is good):: >>> s.encode('raw-unicode-escape') '\\u2020\\u2021' Here's a Unicode string that doesn't work:: >>> u =3D u'\u00A7\u00B6' >>> s =3D u.encode('raw-unicode-escape') >>> s '\xa7\xb6' >>> print s =A7=B6 (The last line contained the § and ¶ characters, probably corrupted.) Note that although the characters are ordinal > 127, they don't get converted into '\\uXXXX' escapes. It seems that the 'raw-unicode-escape' codec is assuming latin-1 for output. But my default encoding is 'ascii'; doesn't that mean 7-bit ASCII? How can I get 7-bit ascii on \u0080 through \u00FF? The 'unicode-escape' codec produces '\\xa7\\xb6', but it also converts newlines to '\\n', which I don't want. Running the string (now an 8-bit string, not 7-bit ASCII) through the codec again crashes:: >>> s.encode('raw-unicode-escape') Traceback (most recent call last): File "", line 1, in ? s.encode('raw-unicode-escape') UnicodeError: ASCII decoding error: ordinal not in range(128) Is this because ``s`` is being coerced into a Unicode string, and it fails because the default encoding is 'ascii' but ``s`` contains 8-bit characters? Do I even have my terminology straight? ;-) Is this a bug? I'll open a bug report if it is. Any workarounds? I get these results with Python 2.2, on US versions of both Win2K and MacOS 8.6. On Win2K I tried this from IDLE and from a Python session within GNU Emacs 20.7.1, and on MacOS the test was done using the PythonInterpreter app.; identical results all around. --=20 David Goodger goodger@users.sourceforge.net Open-source projects: - Python Docstring Processing System: http://docstring.sourceforge.net - reStructuredText: http://structuredtext.sourceforge.net - The Go Tools Project: http://gotools.sourceforge.net From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Mar 7 05:32:52 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 7 Mar 2002 14:32:52 +0900 Subject: [I18n-sig] PEP 263 and Japanese native encodings In-Reply-To: (martin@v.loewis.de) References: Message-ID: <200203070532.OAA19231@dhcp225.grad.sccs.chukyo-u.ac.jp> martin@v.loewis.de (Martin v. Loewis) writes: | | An ASCII string compatible encoding (character set) is a superset of | the ASCII encoding (character set) in which octets from set AS are | only used to represent ASCII characters and not used in a series of | bytes that represent a multibyte character (such as Kanji and | Hiragana). The set AS is defined as | | AS = [\r\n\\'"] (newline, linefeed, backslash, single/double quote) | | The rationale here is that, under the PEP, non-ASCII text may only | appear in comments and strings. The lexer needs the ASCII-compatible | property to determine the end-of-line and end-of-string markers, | atleast in the phase-1 implementation. | | > o Are three Japanese native encodings EUC-JP, Shift_JIS and | > ISO-2022-JP "ASCII compatible"? | | EUC-JP certainly is; Absolutely. | ISO-2022-JP probably isn't. Right, ISO-2022-JP is not ASCII compatible in the sense of your definition. It uses " and ' to represent both ASCII and JIS X 0208-1983 (Kanji, Hiragana, and so on). For example, an ISO-2022-JP representation of u"\u3042" (the first character of Hiragana) contains a double quote mark: >>> u"\u3042".encode("japanese.iso-2022-jp") '\033$B$"\033(B' (FYI: the first escape sequence \033$B is the mark that says the following bytes represent a series of JIS X 0208-1983 characters. The second \033(B has a similar meaning for ASCII.) | I cannot see the problem with Shift_JIS; Shift_JIS is not ASCII compatible in a similar way. It uses backslash as a second byte. Here is another example: >>> u"\u8868".encode("japanese.sjis") '\225\\' This is a well-known and highly annoying problem of Python in Japanese Windows environment in which Shift_JIS is the system's default encoding. There is a patch for Python specifically fixing this problem. So, a definition of ASCII compatible encodings is very important since it may or may not accept Shift_JIS and ISO-2022-JP. I believe other Asian native encodings are in a similar situation with the two Japanese encodings. I don't want the PEP to exclude the two widely used Japanese encodings, especially Shift_JIS. I think the only acceptable requirement for an ASCII compatible encoding is the property that it can represent the first two lines of comments only by ASCII characters. Other requirements will not make the two Japanese encodings ASCII comatible. Regards, -- KAJIYAMA, Tamito From martin@v.loewis.de Thu Mar 7 07:38:50 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 07 Mar 2002 08:38:50 +0100 Subject: [I18n-sig] raw-unicode-escape encoding In-Reply-To: References: Message-ID: David Goodger writes: > Note that although the characters are ordinal > 127, they don't get > converted into '\\uXXXX' escapes. It seems that the > 'raw-unicode-escape' codec is assuming latin-1 for output. Correct. raw-unicode-escape brings the Unicode string into a form suitable for usage in Python source code. In Python source code, bytes in range(128,256) are treated as Latin-1, regardless of your system encoding. > But my default encoding is 'ascii'; doesn't that mean 7-bit ASCII? Your system encoding is (currently) irrelevant how non-ASCII bytes are interpreted in Python source code; this will change under PEP 263. So I think the raw-unicode-escape codec should be changed to use hex escapes for this range. > Running the string (now an 8-bit string, not 7-bit ASCII) through the > codec again crashes:: > > >>> s.encode('raw-unicode-escape') > Traceback (most recent call last): > File "", line 1, in ? > s.encode('raw-unicode-escape') > UnicodeError: ASCII decoding error: ordinal not in range(128) That's a pilot error: use .decode to decode from some byte string into a Unicode object. Better yet, use the unicode() builtin. > Is this because ``s`` is being coerced into a Unicode string, and it > fails because the default encoding is 'ascii' but ``s`` contains 8-bit > characters? Do I even have my terminology straight? ;-) Not in this case, no. > Is this a bug? I'll open a bug report if it is. Any workarounds? It is not really a bug. Does it cause problems for you? Regards, Martin From martin@v.loewis.de Thu Mar 7 07:54:17 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 07 Mar 2002 08:54:17 +0100 Subject: [I18n-sig] PEP 263 and Japanese native encodings In-Reply-To: <200203070532.OAA19231@dhcp225.grad.sccs.chukyo-u.ac.jp> References: <200203070532.OAA19231@dhcp225.grad.sccs.chukyo-u.ac.jp> Message-ID: Tamito KAJIYAMA writes: > Shift_JIS is not ASCII compatible in a similar way. It uses > backslash as a second byte. Here is another example: > > >>> u"\u8868".encode("japanese.sjis") > '\225\\' I see. I missed the part that the second byte can be in the range 0x40-0xFC. If I understand the problem correctly, the quotation characters (", ') can *not* appear as the second byte, right? Also, there is a total of 60 characters that end in byte \x5C; and those will only cause a problem if immediately followed by a quoting character. Do you think those 60 characters would cause a problem in real life? Or is that a problem that only exists on paper? > This is a well-known and highly annoying problem of Python in > Japanese Windows environment in which Shift_JIS is the system's > default encoding. There is a patch for Python specifically > fixing this problem. A patch specifically designed for Shift_JIS probably is not acceptable to Python. A patch solving the general problem (in some way) may be. > So, a definition of ASCII compatible encodings is very important > since it may or may not accept Shift_JIS and ISO-2022-JP. I > believe other Asian native encodings are in a similar situation > with the two Japanese encodings. All the EUC encodings (EUC-KR, EUC-ZH) should be ASCII compatible. BIG5 has the same problem as Shift_JIS. Dunno about GB2312. > I don't want the PEP to exclude the two widely used Japanese > encodings, especially Shift_JIS. Then you need to propose an implementation strategy, and that strategy should *not* be "special-case Shift_JIS", and it also should not be "use the C library's multibyte functions". In phase 2 of the PEP, both Shift_JIS and ISO-2022-JP will be acceptable source encodings - but we are in search of an implementation strategy for that as well. So anybody working on this would be encouraged to implement Phase 2 of the PEP. Until then, I suggest to live with the limitation that 60 characters cannot appear as the last character in a string. Regards, Martin From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Mar 7 10:15:22 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 7 Mar 2002 19:15:22 +0900 Subject: [I18n-sig] PEP 263 and Japanese native encodings In-Reply-To: (martin@v.loewis.de) References: Message-ID: <200203071015.TAA19842@nat-dhcp253.grad.sccs.chukyo-u.ac.jp> martin@v.loewis.de (Martin v. Loewis) writes: | | > Shift_JIS is not ASCII compatible in a similar way. It uses | > backslash as a second byte. Here is another example: | > | > >>> u"\u8868".encode("japanese.sjis") | > '\225\\' | | I see. I missed the part that the second byte can be in the range | 0x40-0xFC. If I understand the problem correctly, the quotation | characters (", ') can *not* appear as the second byte, right? Right. | Also, there is a total of 60 characters that end in byte \x5C; Not right. In JIS X 0208-1983 (6877 characters) there are 37 characters that end in byte \x5C. | and those will only cause a problem if immediately followed by | a quoting character. You've described only the condition of a syntax error; backslash as a second byte causes run-time problems even when it is followed by some characters. Let's consider the following example. The byte sequence shown below represents the content of a string literal in a Shift_JIS encoded source file. Its Unicode representation is u"\u88681\u53C2\u7167" ("See Table 1" in Japanese). 95 5C 31 8E 51 8F C6 Now, the second byte is backslash and thus the third byte ("1") gets backslash-escaped ("\1"). So, Python gives the string literal the following wrong value: 95 01 8E 51 8F C6 | Do you think those 60 characters would cause a problem in real life? Yes, absolutely. | Or is that a problem that only exists on paper? No. Suppose that you could not put common English words like "table", "reserve", "ten" and "paste" in string literals; such a restriction would not be acceptable at all, right? :-) | > This is a well-known and highly annoying problem of Python in | > Japanese Windows environment in which Shift_JIS is the system's | > default encoding. There is a patch for Python specifically | > fixing this problem. | | A patch specifically designed for Shift_JIS probably is not acceptable | to Python. A patch solving the general problem (in some way) may be. Yes, I think so too. The patch I metioned is a localization patch, not intended to be merged into the Python core. | > I don't want the PEP to exclude the two widely used Japanese | > encodings, especially Shift_JIS. | | Then you need to propose an implementation strategy, and that strategy | should *not* be "special-case Shift_JIS", and it also should not be | "use the C library's multibyte functions". I've thought that Marc-Andre's intent for ASCII compatibility (i.e., ASCII compatible encodings should be able to represent the first two lines of comments only by ASCII characters) is good enough. It appears that his requirement has no problem with regard to the implementation stategy described in the PEP (revision 1.9) *and* Japanese encodings. IMHO, the ASCII compatibility simply should not impose other requirements. Regards, -- KAJIYAMA, Tamito From mal@lemburg.com Thu Mar 7 10:52:36 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 07 Mar 2002 11:52:36 +0100 Subject: [I18n-sig] raw-unicode-escape encoding References: Message-ID: <3C874674.AD5F62C0@lemburg.com> David Goodger wrote: >=20 > If this isn't the correct venue, please let me know. (The right people > seem to be hanging around.) >=20 > I've come across something strange while adding some Unicode > characters to the output generated by the Docutils projects (see my > signature for URLs). I want to get 7-bit ASCII output for the test > suite, but I want to keep newlines, so I'm using the > 'raw-unicode-escape' codec. I assumed that this codec would convert > any character whose ord(char) > 127 to "\\uXXXX". This does not seem > to be the case for ord(char) between 128 and 255 inclusive. >=20 > Here's my default encoding:: >=20 > >>> import sys > >>> sys.getdefaultencoding() > 'ascii' >=20 > Here's a Unicode string that works:: >=20 > >>> u =3D u'\u2020\u2021' > >>> s =3D u.encode('raw-unicode-escape') > >>> s > '\\u2020\\u2021' > >>> print s > \u2020\u2021 >=20 > That's what I want. When I run the string (not Unicode) through the > codec again, there's no change (which is good):: >=20 > >>> s.encode('raw-unicode-escape') > '\\u2020\\u2021' >=20 > Here's a Unicode string that doesn't work:: >=20 > >>> u =3D u'\u00A7\u00B6' > >>> s =3D u.encode('raw-unicode-escape') > >>> s > '\xa7\xb6' > >>> print s > =A7=B6 >=20 > (The last line contained the § and ¶ characters, probably > corrupted.) >=20 > Note that although the characters are ordinal > 127, they don't get > converted into '\\uXXXX' escapes. It seems that the > 'raw-unicode-escape' codec is assuming latin-1 for output. But my > default encoding is 'ascii'; doesn't that mean 7-bit ASCII? How can I > get 7-bit ascii on \u0080 through \u00FF? The unicode-escape codecs (raw and normal) both extend the Latin-1 encoding with a few escaped characters. The difference between the two is mainly in the way they decode escapes; the raw codec only unescapes a small supset of escapes which the normal codec can handle. Both codecs are mainly intended to encode/decode Unicode literals in Python source code, so their functionality may differ a bit from what you have in mind. =20 > The 'unicode-escape' codec produces '\\xa7\\xb6', but it also converts > newlines to '\\n', which I don't want. >=20 > Running the string (now an 8-bit string, not 7-bit ASCII) through the > codec again crashes:: >=20 > >>> s.encode('raw-unicode-escape') > Traceback (most recent call last): > File "", line 1, in ? > s.encode('raw-unicode-escape') > UnicodeError: ASCII decoding error: ordinal not in range(128) >=20 > Is this because ``s`` is being coerced into a Unicode string, and it > fails because the default encoding is 'ascii' but ``s`` contains 8-bit > characters? Do I even have my terminology straight? ;-) >=20 > Is this a bug? I'll open a bug report if it is. Any workarounds? You should first get a feeling for what kind of mapping you expect, i.e. which characters should be escaped or not. > I get these results with Python 2.2, on US versions of both Win2K and > MacOS 8.6. On Win2K I tried this from IDLE and from a Python session > within GNU Emacs 20.7.1, and on MacOS the test was done using the > PythonInterpreter app.; identical results all around. That's intended :-) --=20 Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From mal@lemburg.com Thu Mar 7 11:01:25 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 07 Mar 2002 12:01:25 +0100 Subject: [I18n-sig] PEP 263 and Japanese native encodings References: <200203070532.OAA19231@dhcp225.grad.sccs.chukyo-u.ac.jp> Message-ID: <3C874885.319595D4@lemburg.com> Tamito KAJIYAMA wrote: > > I don't want the PEP to exclude the two widely used Japanese > encodings, especially Shift_JIS. I think the only acceptable > requirement for an ASCII compatible encoding is the property > that it can represent the first two lines of comments only by > ASCII characters. Other requirements will not make the two > Japanese encodings ASCII comatible. +1, I'll add a note to the PEP about this. The whole ASCII business is really only about the first two lines and that's it. In phase 2, the complete file will be decoded into Unicode, so the problems you now see with backslashes as final character in string literals (caused by Shift_JIS) will go away. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From mal@lemburg.com Thu Mar 7 11:22:13 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 07 Mar 2002 12:22:13 +0100 Subject: [I18n-sig] PEP 263 and Japanese native encodings References: <200203071015.TAA19842@nat-dhcp253.grad.sccs.chukyo-u.ac.jp> Message-ID: <3C874D65.A58DA645@lemburg.com> Tamito KAJIYAMA wrote: > > I've thought that Marc-Andre's intent for ASCII compatibility > (i.e., ASCII compatible encodings should be able to represent > the first two lines of comments only by ASCII characters) is > good enough. It appears that his requirement has no problem > with regard to the implementation stategy described in the PEP > (revision 1.9) *and* Japanese encodings. IMHO, the ASCII > compatibility simply should not impose other requirements. I've updated the PEP to clarify this. Basically it should be possible to do: file = open('script.py') line1 = file.readline() line2 = file.readline() # check line1 and line2 for the RE from the PEP # push the two lines back onto the file stream or handle this # situation using a line buffer. Nothing complicated, really. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From tim.one@comcast.net Thu Mar 7 17:25:30 2002 From: tim.one@comcast.net (Tim Peters) Date: Thu, 07 Mar 2002 12:25:30 -0500 Subject: [I18n-sig] PEP 263 and Japanese native encodings In-Reply-To: <3C874D65.A58DA645@lemburg.com> Message-ID: [M.-A. Lemburg] > I've updated the PEP to clarify this. Basically it should be > possible to do: > > file = open('script.py') > line1 = file.readline() > line2 = file.readline() > > # check line1 and line2 for the RE from the PEP > > # push the two lines back onto the file stream or handle this > # situation using a line buffer. > > Nothing complicated, really. A complication is that so long as Python uses C stdio to read files, there's no guarantee that "funny bytes" can be gotten from files opened in text mode. The inability to read chr(26) from a text-mode file on Windows is an infamous example of that: >>> f = open('oops', 'wb') >>> f.write('x' * 100 + chr(26) + 'x' * 100) >>> f.close() >>> f = open('oops') >>> len(f.read()) # chr(26) acts like EOF on Windows in text mode 100 >>> OTOH, if you open in binary mode instead, you have to wrestle with the platform's line-end conventions. the-devil-is-in-the-details-ly y'rs - tim From mal@lemburg.com Thu Mar 7 18:09:58 2002 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 07 Mar 2002 19:09:58 +0100 Subject: [I18n-sig] PEP 263 and Japanese native encodings References: Message-ID: <3C87ACF6.6BDE7924@lemburg.com> Tim Peters wrote: > > [M.-A. Lemburg] > > I've updated the PEP to clarify this. Basically it should be > > possible to do: > > > > file = open('script.py') > > line1 = file.readline() > > line2 = file.readline() > > > > # check line1 and line2 for the RE from the PEP > > > > # push the two lines back onto the file stream or handle this > > # situation using a line buffer. > > > > Nothing complicated, really. > > A complication is that so long as Python uses C stdio to read files, there's > no guarantee that "funny bytes" can be gotten from files opened in text > mode. The inability to read chr(26) from a text-mode file on Windows is an > infamous example of that: > > >>> f = open('oops', 'wb') > >>> f.write('x' * 100 + chr(26) + 'x' * 100) > >>> f.close() > >>> f = open('oops') > >>> len(f.read()) # chr(26) acts like EOF on Windows in text mode > 100 > >>> Pass that string to a teletex machine and you'll get the same result... Hmm, this should tell us something ;-) > OTOH, if you open in binary mode instead, you have to wrestle with the > platform's line-end conventions. Martin's patch leaves these "minor" issues to the tokenizer and that's good :-) I only wanted to give a very simple example of what the original idea was when I added "ASCII compatible encoding" to the PEP -- basically to simplify the coding parsing part. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/ From martin@v.loewis.de Thu Mar 7 19:42:26 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 07 Mar 2002 20:42:26 +0100 Subject: [I18n-sig] PEP 263 and Japanese native encodings In-Reply-To: <200203071015.TAA19842@nat-dhcp253.grad.sccs.chukyo-u.ac.jp> References: <200203071015.TAA19842@nat-dhcp253.grad.sccs.chukyo-u.ac.jp> Message-ID: Tamito KAJIYAMA writes: > You've described only the condition of a syntax error; backslash > as a second byte causes run-time problems even when it is > followed by some characters. I see. In phase 1 of the PEP, this problem will only occur for byte strings. For Unicode literals, those problems will not happen: Python will decode the string before escape characters are considered, so the problem can won't occur in Unicode strings. For byte strings, it won't bring any changes. Your best bet is to declare them as raw. In Phase 2, the encoding will be applied to all strings. So people that want Japanese strings should use Unicode literals. > | Or is that a problem that only exists on paper? > > No. Suppose that you could not put common English words like > "table", "reserve", "ten" and "paste" in string literals; such > a restriction would not be acceptable at all, right? :-) If the restriction was that you cannot have such a word as the last word of a string (but need some spacing character after it), I think the restriction might be acceptable - although admittedly arbitrary. Also, notice that the restriction is only for byte strings. > I've thought that Marc-Andre's intent for ASCII compatibility > (i.e., ASCII compatible encodings should be able to represent > the first two lines of comments only by ASCII characters) is > good enough. It appears that his requirement has no problem > with regard to the implementation stategy described in the PEP > (revision 1.9) *and* Japanese encodings. IMHO, the ASCII > compatibility simply should not impose other requirements. That sounds nice on paper (or rather, in your email message); it simply does not work in practice. For it to work, the lexer needs to operate on Unicode characters instead of bytes. Such a change is quite complex, and cannot be carried out until phase 2 of the PEP. Anybody interested is encouraged to discuss implementation strategies on this list. I know that I probably can't find the time to implement that part before Python 2.3. Also, I'd think that getting the Japanese codecs and other CJK codecs into Python would be a prerequisite for implementing phase 2. Regards, Martin From martin@v.loewis.de Thu Mar 7 19:48:11 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 07 Mar 2002 20:48:11 +0100 Subject: [I18n-sig] PEP 263 and Japanese native encodings In-Reply-To: <3C87ACF6.6BDE7924@lemburg.com> References: <3C87ACF6.6BDE7924@lemburg.com> Message-ID: "M.-A. Lemburg" writes: > Martin's patch leaves these "minor" issues to the tokenizer > and that's good :-) > > I only wanted to give a very simple > example of what the original idea was when I added "ASCII > compatible encoding" to the PEP -- basically to simplify > the coding parsing part. In my implementation, the "ASCII superset" restriction is stronger, though: the tokenizer needs to find the end of a string without decoding it. That is not possible for some of the encodings that pass your "ASCII superset" test. Regards, Martin From perky@fallin.lv Thu Mar 7 22:59:37 2002 From: perky@fallin.lv (Hye-Shik Chang) Date: Fri, 8 Mar 2002 07:59:37 +0900 Subject: [I18n-sig] KoreanCodecs 2.0 released Message-ID: <20020308075937.A24873@fallin.lv> Hello! I've released KoreanCodecs 2.0. It is reimplemented based on JapaneseCodecs 1.4. Supported Charsets: euc-kr (aliases: ksc5601, ksx1001) cp949 (aliases: uhc, ms949) iso-2022-kr johab unijohab qwerty2bul (aliases: 2bul) Additional Utility: korean.hangul : Korean character analyzer Some of those charsets doesn't have StreamWriter/Reader yet. And, it has only pure python implementation now. (Sorry (: I'll add C impl. soon.) http://sourceforge.net/projects/koco If you use FreeBSD, just do # cd /usr/ports/korean/pycodec # make install clean Ciao. -- Hye-Shik Chang Yonsei University, Seoul From goodger@users.sourceforge.net Fri Mar 8 02:27:09 2002 From: goodger@users.sourceforge.net (David Goodger) Date: Thu, 07 Mar 2002 21:27:09 -0500 Subject: [I18n-sig] raw-unicode-escape encoding In-Reply-To: Message-ID: [David Goodger] > > Note that although the characters are ordinal > 127, they don't > > get converted into '\\uXXXX' escapes. It seems that the > > 'raw-unicode-escape' codec is assuming latin-1 for output. [Martin v. Loewis] > Correct. raw-unicode-escape brings the Unicode string into a form > suitable for usage in Python source code. In Python source code, > bytes in range(128,256) are treated as Latin-1, regardless of your > system encoding. That seems contrary to the Python Reference Manual, chapter 2, `Lexical analysis`__: Future compatibility note: It may be tempting to assume that the character set for 8-bit characters is ISO Latin-1 ... ... it is unwise to assume either Latin-1 or UTF-8, even though the current implementation appears to favor Latin-1. This applies both to the source character set and the run-time character set. __ http://www.python.org/doc/current/ref/lexical.html "a form suitable for usage in Python source code": that's exactly what I want. Cross-platform compatibility requires 7-bit ASCII source code. The raw-unicode-escape codec produces 8-bit Latin-1, which doesn't survive the trip to MacOS. > > But my default encoding is 'ascii'; doesn't that mean 7-bit ASCII? > I think the raw-unicode-escape codec should be changed to use hex > escapes for this range. +1. But '\xa7' or '\u00a7' escapes? Using the former (which the unicode-escape codec currently does) assumes Latin-1 as the native encoding. Hex escapes ('\x##') know nothing about the encoding; they just produce raw bytes. Shouldn't unicode escapes always be of the '\u####' variety? For that matter, shouldn't the internal representation distinguish? :: >>> u'\u2020\u00a7' u'\u2020\xa7' If I'm not mistaken, '\xa7' is *not* the same as '\u00a7'. > > Is this a bug? I'll open a bug report if it is. Any workarounds? >=20 > It is not really a bug. Does it cause problems for you? Yes. In the Docutils test suite, most of the tests are data-driven from (input, expected output) pairs. Here's an example:: # input: ["""\ [#autolabel]_ =20 .. [#autolabel] text """, # expected output (indented pseudo-xml for readability): """\ 1