From mal@lemburg.com Sat Apr 1 18:43:05 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 01 Apr 2000 20:43:05 +0200 Subject: [I18n-sig] Test Suite for the Unicode codecs References: <38E52399.19220D0@lemburg.com> <38e5d0b2.10984324@post.demon.co.uk> Message-ID: <38E64339.CC0B0FB@lemburg.com> Andy Robinson wrote: > > On Sat, 01 Apr 2000 00:15:53 +0200, you wrote: > > >I would like to add some more testing to the mapping codecs > >in the Python encodings package. Right now I can only test > >for round-trips of lower character ordinal ranges and even > >those tests fail for a couple of encodings. > > > >Does anyone have access to some reference test suite for > >these mappings ? The mapping codec is probably not the > >cause for these errors. Perhaps the maps themselves > >aren't of high enough quality or maybe some mappings > >just cannot provide round-trip safety... > > > I can't give specifics off the top of my head, but mappings not giving > round trips is quite common, especially with corporate character sets. > We always handled this by framing questions differently and saying > 'what is the subset of a map that gives a full round-trip, and which > bits of my data fall outside it', and trying to get some printed code > chart to show the results; then you can quickly see if the results > make sense. If you have that knowledge, you could then build > assertions into a python-only test suite. That would be great of course... but how do we get native script readers for all those code pages ? > For testing, I think the best approach is to compare output to another > well-known mapping utility. The most convenient I know of is > uniconv.exe from http://www.basistech.com/ - not Open Source and > Windows-only, but it is a straightforward goal for us to write a > uniconv.py that perfectly mimics its behaviour. Ok, I've just downloaded it (it's a bit hidden as Demo of their C++ Unicode class lib) and will give it a try next week. > I'm in the middle of a 'work crisis' at the moment, and I know I'm not > really pulling my weight. Does anyone have a few hours to help out > with testing? If so I could outline the kind of test program that > would help us quickly validate the existing mappings, and help with > any new ones. > > Marc-Andre, do you have any preferences for where a test suite and > bunch of add-on tools live? Do you want something which fits into the > standard distribution, or can we handle it outside? Hmm, tests for the builtin codecs should live in Lib/test with the output in Lib/test/output. Tools etc. are probably best placed somewhere into the Tools/ directory (e.g. the gencodec.py script lives in Tools/scripts). Perhaps we need a separate Tools/unicode if there are going to many different scripts... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Mon Apr 3 09:31:45 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 03 Apr 2000 10:31:45 +0200 Subject: [I18n-sig] Test Suite for the Unicode codecs References: <38E52399.19220D0@lemburg.com> <38e5d0b2.10984324@post.demon.co.uk> <38E64339.CC0B0FB@lemburg.com> <007201bf9cb4$74b4b730$01ac2ac0@boulder> Message-ID: <38E856F1.FBF66AAE@lemburg.com> [CC:ing to i18n-sig -- hope this is ok] Andy Robinson wrote: > > > > > That would be great of course... but how do we get native > > script readers for all those code pages ? > I suspect we won't. Unicode fonts with all 45k glyphs are not exactly > common; there is one, but it was full of holes last time I checked. There > are two approaches to viewing the CJK ones: > 1. Use IE5 or Netscape. IE5 comes with lots of font packs for most > languages, especially the Asian ones. One makes up preformatted text files > designed to mirror the vendor or standards' organisation's code chart, puts > it through a round trip, and tells the browser to display it - possibly side > by side with the original. If you feel clever, you can use tables and > highlight things which fail the round trip. Of course, this depends on the > fonts you have installed, and these vary > > 2. (Some months off) Use Acrobat 4.0 and the Language Packs from Adobe. > These are the first really platform-independent vewing technology; I have > wrapped up the Japanese one in ReportLab and used it very successfully at > Fidelity Investments last year to prove round trips from AS400, but have to > rewrite that code as it was done in-house for them. I write a loop to print > about fifteen pages of charts which are laid out exactly like the relevant > Appendix in "CJKV Information Processing", run it through some > transformations, then sit staring at all 6879 glyphs for a couple of hours. > Sometimes, while bored, I did plots to show how code points mapped from one > encoding to another; we had to reverse engineer an AS400 encoding. Adobe's > CID fonts include their own mapping tables and conversion at the PostScript > level; If I ask for the font "Mincho-UTF8", I get it encoded that way and > can feed it UTF8 strings; if I as for the font "Mincho-SJIS" I get a > Shift-JIS encoded font. This looks like an awful lot of work. Isn't there some better way to get this done ? (There might be a problem due to different composition of characters, but I think we could handle it by implementing the normalization algorithmn for Unicode.) > This is actually my main interest in the Unicode stuff; to build a global > reporting engine, we have to handle data in any encoding and feed it to the > font engine in an encoding the font can handle. > > The great thing about PDF code charts is that they are immutable and not > dependent on your PC setup. > > > > > > For testing, I think the best approach is to compare output to another > > > well-known mapping utility. The most convenient I know of is > > > uniconv.exe from http://www.basistech.com/ - not Open Source and > > > Windows-only, but it is a straightforward goal for us to write a > > > uniconv.py that perfectly mimics its behaviour. > > > > Ok, I've just downloaded it (it's a bit hidden as Demo of > > their C++ Unicode class lib) and will give it a try next week. > > > > > Marc-Andre, do you have any preferences for where a test suite and > > > bunch of add-on tools live? Do you want something which fits into the > > > standard distribution, or can we handle it outside? > > > > Hmm, tests for the builtin codecs should live in Lib/test > > with the output in Lib/test/output. Tools etc. are probably > > best placed somewhere into the Tools/ directory (e.g. the > > gencodec.py script lives in Tools/scripts). Perhaps we need > > a separate Tools/unicode if there are going to many different > > scripts... > I must admit, I was thinking of an actual web server test framework which > kept a database of sample text files, did round trip tests on demand, and > could hand out HTML and PDF files to anyone who asked - probably a bit much > for the standard Python library. One needs knowledge of each individual > code page and some quite devious test files to test out double-byte codecs. > For single-byte, we need a reliable way to see all the code points before we > dare rely on full round trip tests and assertions. I think we need some > separate project on starship, sourceforge or wherever to mess around with > this stuff, and then you can decide what is worth including in the main > distribution. Ok. For now I'll leave the current cp codecs in place and simply wait for people reporting bugs in the mapping tables... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@reportlab.com Wed Apr 5 17:24:56 2000 From: andy@reportlab.com (Andy Robinson) Date: Wed, 5 Apr 2000 17:24:56 +0100 Subject: [I18n-sig] Unicode Tutorial (Slow) Progress Message-ID: I'm part way through a tutorial at long last. My own work is pretty poor so far, but it DOES include Marc-Andre's 'console session' demos at the bottom which show the current usage. http://www.reportlab.com/i18n/python_unicode_tutorial.html If anyone can suggest topics I should cover (apart from the obvious one of using every new features at least once) or simple relevant examples, I'll try to work them in over the coming weeks. - Andy Robinson From guido@python.org Wed Apr 5 18:55:22 2000 From: guido@python.org (Guido van Rossum) Date: Wed, 05 Apr 2000 13:55:22 -0400 Subject: [I18n-sig] Unicode Tutorial (Slow) Progress In-Reply-To: Your message of "Wed, 05 Apr 2000 17:24:56 BST." References: Message-ID: <200004051755.NAA16668@eric.cnri.reston.va.us> > I'm part way through a tutorial at long last. My own work is pretty poor so > far, but it DOES include Marc-Andre's 'console session' demos at the bottom > which show the current usage. > > http://www.reportlab.com/i18n/python_unicode_tutorial.html Thanks! Added to the i18n-sig home page *and* to the Python 1.6 page. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Wed Apr 5 19:41:30 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 05 Apr 2000 20:41:30 +0200 Subject: [I18n-sig] Unicode Tutorial (Slow) Progress References: Message-ID: <38EB88DA.9D391700@lemburg.com> Andy Robinson wrote: > > I'm part way through a tutorial at long last. My own work is pretty poor so > far, but it DOES include Marc-Andre's 'console session' demos at the bottom > which show the current usage. > > http://www.reportlab.com/i18n/python_unicode_tutorial.html > > If anyone can suggest topics I should cover (apart from the obvious one of > using every new features at least once) or simple relevant examples, I'll > try to work them in over the coming weeks. Looks great... a bit much exposure, maybe ;-) Note that the stackable stream example will need a small bit of updating (the return is wrong -- the API was changed since I programmed the example): import codecs,sys # Convert Unicode -> UTF-8 (e,d,sr,sw) = codecs.lookup('utf-8') unicode_to_utf8 = sw(sys.stdout) # Convert Latin-1 -> Unicode during .write (e,d,sr,sw) = codecs.lookup('latin-1') class StreamRewriter(codecs.StreamWriter): encode = e decode = d def write(self,object): """ Writes the object's contents encoded to self.stream and returns the number of bytes written. """ data,consumed = self.decode(object,self.errors) self.stream.write(data) latin1_to_utf8 = StreamRewriter(unicode_to_utf8) # Now install sys.stdout = latin1_to_utf8 # All subsequent prints will output Latin-1 strings using UTF-8 # characters... print 'Hello World !' print 'Héllò Wörld !' print 'ÄÖÜäöüß' -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Apr 5 19:58:44 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 05 Apr 2000 20:58:44 +0200 Subject: [I18n-sig] Unicode Tutorial (Slow) Progress References: <38EB88DA.9D391700@lemburg.com> Message-ID: <38EB8CE4.8CB8914@lemburg.com> I just noted a bug that appears on your page: >>> a.encode('ascii', 'ignore') # turn to zero and continue 'Andr\000' This should really give 'Andr' -- 'ignore' will simply ignore illegal input characters. I will submit a patch for this with the next Unicode patch set. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@python.org Mon Apr 10 15:01:58 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 10 Apr 2000 10:01:58 -0400 Subject: [I18n-sig] "takeuchi": a unicode string on IDLE shell Message-ID: <200004101401.KAA00238@eric.cnri.reston.va.us> Can anyone answer this? I can reproduce the output side of this, and I believe he's right about the input side. Where should Python migrate with respect to Unicode input? I think that what Takeuchi is getting is actually better than in Pythonwin or command line (where he gets Shift-JIS)... --Guido van Rossum (home page: http://www.python.org/~guido/) ------- Forwarded Message Date: Mon, 10 Apr 2000 22:49:45 +0900 From: "takeuchi" To: Subject: a unicode string on IDLE shell Dear Guido, I plaied your latest CPython(Python1.6a1) on Win98 Japanese version, and found a strange IDLE shell behavior. I'm not sure this is a bug or feacher, so I report my story anyway. When typing a Japanese string on IDLE shell with IME , Tk8.3 seems to convert it to a UTF-8 representation. Unfortunatly Python does not know this, it is dealt with an ordinary string. >>> s = raw_input(">>>") Type Japanese characters with IME for example $B$"(B (This is the first character of Japanese alphabet, Hiragana) >>> s '\343\201\202' # UTF-8 encoded >>> print s $B$"(B # A proper griph is appear on the screen Print statement on IDLE shell works fine with a UTF-8 encoded string,however,slice operation or len() does not work. # I know this is a right result So I have to convert this string with unicode(). >>> u = unicode(s) >>> u u'\u3042' >>> print u $B$"(B # A proper griph is appear on the screen Do you think this convertion is unconfortable ? I think this behavior is inconsistant with command line Python and PythonWin. If I want the same result on command line Python shell or PythonWin shell, I have to code as follows; >>> s = raw_input(">>>") Type Japanese characters with IME for example $B$"(B >>>s '\202\240' # Shift-JIS encoded >>> print s $B$"(B # A proper griph is appear on the screen >>> u = unicode(s,"mbcs") # if I use unicode(s) then UnicodeError is raised ! >>>print u.encode("mbcs") # if I use print u then wrong griph is appear $B$"(B # A proper griph is appear on the screen This difference is confusing !! I do not have the best solution for this annoyance, I hope at least IDLE shell and PythonWin shell would have the same behavior . Thank you for reading. Best Regards, takeuchi ------- End of Forwarded Message From guido@python.org Mon Apr 10 15:20:34 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 10 Apr 2000 10:20:34 -0400 Subject: [I18n-sig] Re: a unicode string on IDLE shell In-Reply-To: Your message of "Mon, 10 Apr 2000 22:49:45 +0900." <002601bfa2f3$a1f67720$8f133c81@pflab.ecl.ntt.co.jp> References: <002601bfa2f3$a1f67720$8f133c81@pflab.ecl.ntt.co.jp> Message-ID: <200004101420.KAA00291@eric.cnri.reston.va.us> > Dear Guido, > > I plaied your latest CPython(Python1.6a1) on Win98 Japanese version, > and found a strange IDLE shell behavior. > > I'm not sure this is a bug or feacher, so I report my story anyway. > > When typing a Japanese string on IDLE shell with IME , > Tk8.3 seems to convert it to a UTF-8 representation. > Unfortunatly Python does not know this, > it is dealt with an ordinary string. > > >>> s = raw_input(">>>") > Type Japanese characters with IME > for example $B$"(B > (This is the first character of Japanese alphabet, Hiragana) > >>> s > '\343\201\202' # UTF-8 encoded > >>> print s > $B$"(B # A proper griph is appear on the screen > > Print statement on IDLE shell works fine with a UTF-8 encoded > string,however,slice operation or len() does not work. > # I know this is a right result > > So I have to convert this string with unicode(). > > >>> u = unicode(s) > >>> u > u'\u3042' > >>> print u > $B$"(B # A proper griph is appear on the screen > > Do you think this convertion is unconfortable ? > > I think this behavior is inconsistant with command line Python > and PythonWin. > > If I want the same result on command line Python shell or PythonWin shell, > I have to code as follows; > >>> s = raw_input(">>>") > Type Japanese characters with IME > for example $B$"(B > >>>s > '\202\240' # Shift-JIS encoded > >>> print s > $B$"(B # A proper griph is appear on the screen > >>> u = unicode(s,"mbcs") # if I use unicode(s) then UnicodeError is raised > ! > >>>print u.encode("mbcs") # if I use print u then wrong griph is appear > $B$"(B # A proper griph is appear on the screen > > This difference is confusing !! > I do not have the best solution for this annoyance, I hope at least IDLE > shell and PythonWin > shell would have the same behavior . > > Thank you for reading. > > Best Regards, > > takeuchi Dear Takeuchi, This is a feature. Tcl/Tk uses UTF-8 to encode Unicode characters throughout. This perfectly matches the Python 1.6 default use of UTF-8 when 8-bit strings are converted to Unicode. If you want to manipulate Unicode strings, you have to use unicode() to convert them to Unicode string objects. I may change IDLE so that if you enter Unicode, it will automatically return a Unicode string. This may break other code though. Regarding incompatibilities with Pythonwin and command line Python: note that there you get a different input encoding, but len() and slicing are also broken until you convert to Unicode using the correct encoding! The input encoding is simply different. I believe this will always be an issue (but there should be a way to determine what the input encoding should be!). If you have more questions about this, please subscribe to the i18n-sig mailing list (http://www.python.org/sigs/i18n-sig/) -- this is where issues like this are discussed. I'm cc'ing this there. --Guido van Rossum (home page: http://www.python.org/~guido/) From dae_alt3@juno.com Mon Apr 10 20:24:22 2000 From: dae_alt3@juno.com (Doug Edmunds) Date: Mon, 10 Apr 2000 12:24:22 -0700 Subject: [I18n-sig] P1.6a1- Win98 - unicode issues Message-ID: <20000410.122423.-454791.0.dae_alt3@juno.com> py-ver: 1.6a os: Win98 I am able to copy/paste Cyrillic unicode from Internet Explorer 5 into IDLE, without losing fonts. Text appears identical to original. I can wrap the text with a print statement "" and it will print the string. However writing into IDLE is a problem: if I switch to the Cyrillic keyboard layout in IDLE, the fonts change to something, but it is not Cyrillic (perhaps upper ascii??). In contrast, WIn98 Wordpad (which will read/ write unicode) associates the keyboard to the 'script' of the font. Selecting Russian keyboard automatically switches from Courier New (Western) to Courier New (Cyrillic). Can this operability be extended to IDLE? Without keyboard access If not, is there a way to change which font set appears when the Russian (or other foreign) keyboard is selected? Ideally I would write everything in unicode just as written, using WordPad (or Outlook Express, Juno, etc.) mixing the languages thusly ( simple 1 line script) RussianText.py print '???????? ?????' but IDLE won't read unicode scripts. d.edmunds 10 April 2000 example texts from internet - 1 original encoding was Win1251, IE5 browser converts to unicode, prints in IDLE and Wordpad (original Win1251 lost) ???????? ????? ??????? ? ??????????? ???????? ??????????????. ??? ????? ??? ?????? ??????? ?? ????? ????? ?????? ?? ????????????? ??????? ? ??????. example text original encoded as KOI8-r, which IE5 browser again turned into unicode. (original KOI8-r lost) ?????? ???? ???????? ???????? ??? ????? ? ?????? ??? ????? ???????? ????????? ?????. -- Kindly ignore the remainder (Juno ad) which follows -- ________________________________________________________________ YOU'RE PAYING TOO MUCH FOR THE INTERNET! Juno now offers FREE Internet Access! Try it today - there's no risk! For your FREE software, visit: http://dl.www.juno.com/get/tagj. From dae_alt3@juno.com Mon Apr 10 20:28:46 2000 From: dae_alt3@juno.com (Doug Edmunds) Date: Mon, 10 Apr 2000 12:28:46 -0700 Subject: [I18n-sig] P1.6a1- Win98 - unicode issues Message-ID: <20000410.122846.-454791.1.dae_alt3@juno.com> Apparently my efforts to send unicode via Juno failed. d.edmunds > 10 April 2000 > > example texts from internet - > 1 original encoding was Win1251, > IE5 browser converts to unicode, > prints in IDLE and Wordpad > (original Win1251 lost) > > ???????? ????? ??????? ? ??????????? ???????? ??????????????. ??? > ????? > ??? ?????? ??????? ?? ????? ????? ?????? ?? ????????????? ??????? ? > ??????. > > example text original encoded as KOI8-r, which IE5 browser > again turned into unicode. (original KOI8-r lost) > > ?????? ???? ???????? ???????? ??? ????? ? ?????? ??? ????? ???????? > ????????? ?????. > ________________________________________________________________ YOU'RE PAYING TOO MUCH FOR THE INTERNET! Juno now offers FREE Internet Access! Try it today - there's no risk! For your FREE software, visit: http://dl.www.juno.com/get/tagj. From guido@python.org Mon Apr 10 20:34:34 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 10 Apr 2000 15:34:34 -0400 Subject: [I18n-sig] P1.6a1- Win98 - unicode issues In-Reply-To: Your message of "Mon, 10 Apr 2000 12:24:22 PDT." <20000410.122423.-454791.0.dae_alt3@juno.com> References: <20000410.122423.-454791.0.dae_alt3@juno.com> Message-ID: <200004101934.PAA03031@eric.cnri.reston.va.us> > I am able to copy/paste Cyrillic unicode > from Internet Explorer 5 > into IDLE, without losing fonts. Text appears > identical to original. I can wrap the text with > a print statement "" > and it will print the string. > > However writing into IDLE is a problem: > if I switch to the Cyrillic keyboard layout > in IDLE, the fonts change to something, but it is not > Cyrillic (perhaps upper ascii??). > > In contrast, WIn98 Wordpad (which will read/ > write unicode) associates the keyboard to the > 'script' of the font. Selecting Russian keyboard > automatically switches from Courier New (Western) > to Courier New (Cyrillic). > > Can this operability be extended to IDLE? > Without keyboard access > If not, is there a way to change which font set > appears when the Russian (or other foreign) keyboard is > selected? > > > Ideally I would write everything in unicode > just as written, using WordPad (or Outlook Express, Juno, etc.) > mixing the languages thusly ( simple 1 line script) > > RussianText.py > print '???????? ?????' > > but IDLE won't read unicode scripts. Doug, Can you see if Tcl/Tk version 8.2 or 8.3 (downloadable from dev.scriptics.com) does what you want? IDLE is implemented using Tcl/Tk. In Python 1.6a1, I'm using Tcl/Tk 8.3.0, but in 1.6a2 I will go back to Tck/Tk 8.2.3, which appears more stable. Tcl/Tk's "wish" application supports Unicode. If it supports your Cyrillic input method, the problem is with Python's interface to Tcl/Tk. If on the other hand the problem is the same with Tcl/Tk, there's nothing I can do -- you'll have to ask the comp.lang.tcl newsgroup for help! --Guido van Rossum (home page: http://www.python.org/~guido/) From andy@reportlab.com Mon Apr 10 20:46:25 2000 From: andy@reportlab.com (Andy Robinson) Date: Mon, 10 Apr 2000 20:46:25 +0100 Subject: [I18n-sig] "takeuchi": a unicode string on IDLE shell References: <200004101401.KAA00238@eric.cnri.reston.va.us> Message-ID: <008a01bfa325$79b92f00$01ac2ac0@boulder> ----- Original Message ----- From: Guido van Rossum To: Cc: Sent: 10 April 2000 15:01 Subject: [I18n-sig] "takeuchi": a unicode string on IDLE shell > Can anyone answer this? I can reproduce the output side of this, and > I believe he's right about the input side. Where should Python > migrate with respect to Unicode input? I think that what Takeuchi is > getting is actually better than in Pythonwin or command line (where he > gets Shift-JIS)... > > --Guido van Rossum (home page: http://www.python.org/~guido/) I think what he wants, as you hinted, is to be able to specify a 'system wide' default encoding of Shift-JIS rather than UTF8. UTF-8 has a certain purity in that it equally annoys every nation, and is nobody's default encoding. What a non-ASCII user needs is a site-wide way of setting the default encoding used for standard input and output. I think this could be done with something (config file? registry key) which site.py looks at, and wraps stream encoders around stdin, stdout and stderr. To illustrate why it matters, I often used to parse data files and do queries on a Japanese name and address database; I could print my lists and tuples in interactive mode and check they worked, or initialise functions with correct data, since the OS uses Shift-JIS as its native encoding and I was manipulating Shift-JIS strings. I've lost that ability now due to the Unicode stuff and would need to do >>> for thing in mylist: >>> ....print mylist.encode('shift_jis') to see the contents of a database row, rather than just >>> mylist BTW, Pythonwin stopped working in this regard when Scintilla came along; it prints a byte at a time now, although kanji input is fine, as is kanji pasted into a source file, as long as you specify a Japanese font. However, this is fixable - I just need to find a spare box to run Japanese windows on and find out where the printing goes wrong. Andy Robinson ReportLab From andy@reportlab.com Mon Apr 10 20:49:16 2000 From: andy@reportlab.com (Andy Robinson) Date: Mon, 10 Apr 2000 20:49:16 +0100 Subject: [I18n-sig] Fw: Codecs for Japanese character encodings Message-ID: <009701bfa325$db0af8b0$01ac2ac0@boulder> (I forwarded this to the SIG on Friday, but it failed to appear - hope you don't all get it twice). Tamito Kajiyama has written pure Python codecs for the two main Japanese encodings! Many thanks! They include the 6879 characers in the JIS0208 character set in literal Python dictionaries; so it should be trivial to write modified ones which support vendor-specific extensions with a few extra characters, as long as the extras are in Unicode. I'm now rewriting something I did last year in-house for a customer - a script to generate HTML tables and text files which exactly match the layout of the code charts for JIS0208 in "CJKV Information Processing". I ran these through both codecs and viewed the results in IE5, and as far as I can see the results are perfect. I will post up my scripts when they look a bit prettier :-) It would be nice to put this code somewhere 'out there' so people can work on it - not just codecs, but test suites. How do people feel about starting a project on www.sourceforge.net under CVS? Since lots of us want to work on fast Asian codecs, another things we need is a 'benchmark suite' - maybe a megabyte of Japanese text (mixing everything - ASII, Kanji, half-width katakana?). We can then use these pure Python codecs as a baseline. - Andy Robinson ----- Original Message ----- From: Tamito KAJIYAMA To: Sent: 07 April 2000 18:13 Subject: Re: Codecs for Japanese character encodings > andy@reportlab.com (Andy Robinson) writes: > | > | >Based on the Python Unicode support proposal, I wrote codecs for > | >two Japanese character encodings EUC-JP and Shift_JIS. The codecs > | >are available at the following location: > | > > | >http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/tmp/japanese-codecs.tar.gz > | > | Many thanks for this! I have copied it to the Internationalisation > | Special Interest Group, where we discuss this stuff, and taken the > | liberty of copying your message. > > Good news. Thanks for the coordination. > > | We need to start coordinating a separate codecs library for > | Asian languages, and I'd like to use this as a starting point > | if OK with you. > > That's absolutely okay. I'm grad if my codecs contribute to the > the i18n SIG. I joined the i18n-sig@python.org just after I got > your message. Please carry on the further discussion about the > Japanese codecs (if any) in the list. > > Best regards, > > -- > KAJIYAMA, Tamito > From andy@reportlab.com Mon Apr 10 20:49:27 2000 From: andy@reportlab.com (Andy Robinson) Date: Mon, 10 Apr 2000 20:49:27 +0100 Subject: [I18n-sig] Codec API questions Message-ID: <009b01bfa325$e51836b0$01ac2ac0@boulder> I'm beginning to wonder about some issues with the unicode implementation. Bear in mind we have seven weeks left - if anyone else has issues or opinions, we should raise them now. 1. Set Default Encoding at site level ---------------------------------------------------- The default encoding is defined as UTF8, which will at least annoy all nations equally :-). It looks like you can hack this any way you want by creating your own wrappers around stdin/stdout/stderr. However, I wonder if Python should make this customizable on a site basis - for example, site.py checks for some option somewhere to say "I want to see Latin-1" or Shift-JIS or whatever. I often used to write scripts to parse files of names and addresses, and use an interactive prompt to inspect the lists and tuples directly; the convenience of typing 'print mydata' and see it properly is nice. What do people think? (Or is this feature there already and I've missed it?) 2. lookup returns Codec object rather than tuple? --------------------------------------------------------------------- I shuld have thought of this when we were in the draft stage months back, but couldn't really get my mind around it until I had something concrete to play with. Right now, codecs.lookup() returns a tuple of (encode_func, decode_func, stream_encoder_factory, stream_decoder_factory) But there is no easy way to lookup the codec object itself - indeed, no requirement that there be one. I'd like to see lookup always return a Codec object every time, which is guaranteed to have four methods as above, but might have more. (Note that a Codec object would have the ability to create StreamEncoders and StreamDecoders, but would not be one by itself). A fifth method which is potentially very useful is validate(); a sixth might be repair(). And for each language, there could be specific ones such as expanding half-width to full-width katakana. Furthermore, if we can get hold of the Codec objects, we can start to reason about codecs - for example, ask whether encodings are compatible with each other. 3. direct conversion lookups and short-circuiting Unicode ---------------------------------------------------------------------------- This is an extension rather than a change. I know what I want to do, but have only the vaguest ideas how to implement it. As noted here before, you can get from shift-JIS to EUC and vice versa without going through Unicode. Because these algorithmic conversions work on the full 94x94 'kuten space' and not just the 6879 code points in the standard, they tend to work for any vendor-specific extensions and for user-defined characters. Most other Asian native encodings have used a similar scheme. I'd like to see an 'extended API' to go from one native character set to another. As before, this comes in two flavours, string and stream: convert(string, from_enc, to_enc) returns a string. We also need ways to get hold of StreamReader and StreamWriter versions. Now one can trivially build these using Unicode in the middle codecs.lookup('from_enc', 'to_enc') would return a codec object able to convert from one encoding to another. By default, this would weld together two Unicode codecs. But if someone writes a codec to do the job directly, there should be a way to register that. From guido@python.org Mon Apr 10 21:02:22 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 10 Apr 2000 16:02:22 -0400 Subject: [I18n-sig] Fw: Codecs for Japanese character encodings In-Reply-To: Your message of "Mon, 10 Apr 2000 20:49:16 BST." <009701bfa325$db0af8b0$01ac2ac0@boulder> References: <009701bfa325$db0af8b0$01ac2ac0@boulder> Message-ID: <200004102002.QAA03212@eric.cnri.reston.va.us> > It would be nice to put this code somewhere 'out there' so people can work > on it - not just codecs, but test suites. How do people feel about starting > a project on www.sourceforge.net under CVS? Excellent idea -- go for it! Make sure to list it in the Vaults of Parnassus too! --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Mon Apr 10 21:45:49 2000 From: guido@python.org (Guido van Rossum) Date: Mon, 10 Apr 2000 16:45:49 -0400 Subject: [I18n-sig] Codec API questions In-Reply-To: Your message of "Mon, 10 Apr 2000 20:49:27 BST." <009b01bfa325$e51836b0$01ac2ac0@boulder> References: <009b01bfa325$e51836b0$01ac2ac0@boulder> Message-ID: <200004102045.QAA03303@eric.cnri.reston.va.us> > 1. Set Default Encoding at site level > ---------------------------------------------------- > The default encoding is defined as UTF8, which will at least annoy all > nations equally :-). > > It looks like you can hack this any way you want by creating your own > wrappers around stdin/stdout/stderr. However, I wonder if Python should > make this customizable on a site basis - for example, site.py checks for > some option somewhere to say "I want to see Latin-1" or Shift-JIS or > whatever. I often used to write scripts to parse files of names and > addresses, and use an interactive prompt to inspect the lists and tuples > directly; the convenience of typing 'print mydata' and see it properly is > nice. What do people think? > > (Or is this feature there already and I've missed it?) Rather than doing this per site I'd suggest doing this per user. Surely each user (on a multi-user site) should be allowed to choose their own apps and settings (cf. locale). After trying to figure out how to do this, I am confused. I can do this: from codecs import EncodedFile f = EncodedFile(sys.stdout, "utf-8", "latin-1") And then I can write Unicode strings to file f, and they are written to sys.stdout as Latin-1. I can also write 8-bit strings to file f, and they are assumed to be UTF-8 and are converted properly to Latin-1. However, if I specify anythying except UTF-8 as the input encoding to EncodedFile, I can't write Unicode objects to it and have something useful happen! It seems the Unicode is always converted to UTF-8 first, and then interpreted according to the input encode. I think that a useful feature to have is a file-like object that behaves as follows: if you write an 8-bit string to it, it applies a given input encoding to turn it into Unicode; then it applies a given output encoding to convert that to (usually multibyte) output characters. If you write a Unicode string to it, it skips the input encoding (since it's already Unicode) and then applies the (same) given output encoding. Then I could write a program that mixes 8-bit strings and Unicode in its output, which encodes all its 8-bit strings in (say) Latin-1. This program must obviously be very careful when it mixes Unicode and 8-bit strings internally (always calling unicode(s, "latin-1")) to avoid getting the default (UTF-8) encoding. But I think this is something you are asking for -- right? > 2. lookup returns Codec object rather than tuple? > --------------------------------------------------------------------- > I shuld have thought of this when we were in the draft stage months back, > but couldn't really get my mind around it until I had something concrete to > play with. > > Right now, codecs.lookup() returns a tuple of > (encode_func, > decode_func, > stream_encoder_factory, > stream_decoder_factory) > > But there is no easy way to lookup the codec object itself - indeed, no > requirement that there be one. I'd like to see lookup always return a Codec > object > every time, which is guaranteed to have four methods as above, but might > have more. (Note that a Codec object would have the ability to create > StreamEncoders and StreamDecoders, but would not be one by itself). > > A fifth method which is potentially very useful is validate(); a sixth might > be repair(). And for each language, there could be specific ones such as > expanding half-width to full-width katakana. > > Furthermore, if we can get hold of the Codec objects, we can start to reason > about codecs - for example, ask whether encodings are compatible with each > other. I have no opinion on this; I've forgotten the issues. > 3. direct conversion lookups and short-circuiting Unicode > ---------------------------------------------------------------------------- > This is an extension rather than a change. I know what I want to do, but > have only the vaguest ideas how to implement it. > > As noted here before, you can get from shift-JIS to EUC and vice versa > without going through Unicode. Because these algorithmic conversions work > on the full 94x94 'kuten space' and not just the 6879 code points in the > standard, they tend to work for any vendor-specific extensions and for > user-defined characters. Most other Asian native encodings have used a > similar scheme. > > I'd like to see an 'extended API' to go from one native character set to > another. As before, this comes in two flavours, string and stream: > convert(string, from_enc, to_enc) returns a string. > We also need ways to get hold of StreamReader and StreamWriter versions. > Now one can trivially build these using Unicode in the middle > > codecs.lookup('from_enc', 'to_enc') would return a codec object able to > convert from one encoding to another. By default, this would weld together > two Unicode codecs. But if someone writes a codec to do the job directly, > there should be a way to register that. This could be a separate module, right? I propose that you write a separate module (extended_codecs?) that supports such an extended lookup function. What functionality would you need from the core? --Guido van Rossum (home page: http://www.python.org/~guido/) From brian_takashi@hotmail.com Mon Apr 10 22:09:58 2000 From: brian_takashi@hotmail.com (Brian Hooper) Date: Mon, 10 Apr 2000 21:09:58 GMT Subject: [I18n-sig] Codec API questions Message-ID: <20000410210958.4338.qmail@hotmail.com> Hi Andy, I've been busy recently working with the Unicode API myself and am thinking some of the same things... (BTW, for a current project I am working with Basistech's Rosette libraries, and have actually plugged them into a Python codec, so any Q's about how/what Basistech does I might be able to help with). > >I'm beginning to wonder about some issues with the unicode implementation. >Bear in mind we have seven weeks left - if anyone else has issues or >opinions, we should raise them now. > >1. Set Default Encoding at site level >---------------------------------------------------- >The default encoding is defined as UTF8, which will at least annoy all >nations equally :-). > >It looks like you can hack this any way you want by creating your own >wrappers around stdin/stdout/stderr. However, I wonder if Python should >make this customizable on a site basis - for example, site.py checks for >some option somewhere to say "I want to see Latin-1" or Shift-JIS or >whatever. I often used to write scripts to parse files of names and >addresses, and use an interactive prompt to inspect the lists and tuples >directly; the convenience of typing 'print mydata' and see it properly is >nice. What do people think? Is there any reason that this should be set on a per site basis - I definitely agree that it should be possible to change the interpreter encoding, but wouldn't it be nicer if it could instead be changed on a per-interpreter basis? Either via environment variables or maybe command-line flags? Would it be too much of a performance hit to look up the default on any conversion which doesn't explicitly specify the encoding - this would give the most flexibility of all... (it doesn't seem to me that this would be too slow, but I don't have very deep knowledge about this). > >(Or is this feature there already and I've missed it?) No, UTF-8 is the hardcoded default. > > >2. lookup returns Codec object rather than tuple? >--------------------------------------------------------------------- [snip] I really like this idea too, and the optional addition of validate() and repair() are good ideas too. > >3. direct conversion lookups and short-circuiting Unicode >--------------------------------------------------------------------- [snip] This also seems like a good idea to me, and something that would be really good for Japanese support. As for registering, rather than changing how that's done what about changing search functions so that they should be required to take a second argument, which is by default Unicode (UTF-16) but could also be some other encoding. The search function would always be called by the lookup procedure with a to and from encoding, and the search function could deal with the arguments by returning a direct converter or a 'welded' converter codec as appropriate. --Brian ______________________________________________________ Get Your Private, Free Email at http://www.hotmail.com From mal@lemburg.com Mon Apr 10 23:34:31 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 11 Apr 2000 00:34:31 +0200 Subject: [I18n-sig] Codec API questions References: <009b01bfa325$e51836b0$01ac2ac0@boulder> Message-ID: <38F256F7.2D8E1990@lemburg.com> Andy Robinson wrote: > > 1. Set Default Encoding at site level > ---------------------------------------------------- > The default encoding is defined as UTF8, which will at least annoy all > nations equally :-). > > It looks like you can hack this any way you want by creating your own > wrappers around stdin/stdout/stderr. However, I wonder if Python should > make this customizable on a site basis - for example, site.py checks for > some option somewhere to say "I want to see Latin-1" or Shift-JIS or > whatever. I often used to write scripts to parse files of names and > addresses, and use an interactive prompt to inspect the lists and tuples > directly; the convenience of typing 'print mydata' and see it properly is > nice. What do people think? > > (Or is this feature there already and I've missed it?) The design leaves this to user-land. I'd suggest using stdin/stdout wrappers as needed, possibly only enabled in interactive sessions. > 2. lookup returns Codec object rather than tuple? > --------------------------------------------------------------------- > I shuld have thought of this when we were in the draft stage months back, > but couldn't really get my mind around it until I had something concrete to > play with. > > Right now, codecs.lookup() returns a tuple of > (encode_func, > decode_func, > stream_encoder_factory, > stream_decoder_factory) > > But there is no easy way to lookup the codec object itself - indeed, no > requirement that there be one. I'd like to see lookup always return a Codec > object > every time, which is guaranteed to have four methods as above, but might > have more. (Note that a Codec object would have the ability to create > StreamEncoders and StreamDecoders, but would not be one by itself). > > A fifth method which is potentially very useful is validate(); a sixth might > be repair(). And for each language, there could be specific ones such as > expanding half-width to full-width katakana. > > Furthermore, if we can get hold of the Codec objects, we can start to reason > about codecs - for example, ask whether encodings are compatible with each > other. Why do you want to query an object ? The factory functions will provide you with an object you can use as codec when called with the proper arguments... note that there can't be just one object alive since these objects can carry state. BTW, the Codec API is designed to work for all kinds of codecs. If you have a need for special new methods there's no problem adding them to your Codec subclass -- the standard codec mechanism won't rely on them, but you can still provide and use them. > 3. direct conversion lookups and short-circuiting Unicode > ---------------------------------------------------------------------------- > This is an extension rather than a change. I know what I want to do, but > have only the vaguest ideas how to implement it. > > As noted here before, you can get from shift-JIS to EUC and vice versa > without going through Unicode. Because these algorithmic conversions work > on the full 94x94 'kuten space' and not just the 6879 code points in the > standard, they tend to work for any vendor-specific extensions and for > user-defined characters. Most other Asian native encodings have used a > similar scheme. > > I'd like to see an 'extended API' to go from one native character set to > another. As before, this comes in two flavours, string and stream: > convert(string, from_enc, to_enc) returns a string. > We also need ways to get hold of StreamReader and StreamWriter versions. > Now one can trivially build these using Unicode in the middle > > codecs.lookup('from_enc', 'to_enc') would return a codec object able to > convert from one encoding to another. By default, this would weld together > two Unicode codecs. But if someone writes a codec to do the job directly, > there should be a way to register that. Looks like we need a set of recode codec classes here. There is already one in codecs.py: StreamRecoder. We'd probably need similar subclasses for the basic Codec class though. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Mon Apr 10 22:43:43 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 10 Apr 2000 23:43:43 +0200 Subject: [I18n-sig] Fw: Codecs for Japanese character encodings References: <009701bfa325$db0af8b0$01ac2ac0@boulder> Message-ID: <38F24B0F.71F376BA@lemburg.com> Andy Robinson wrote: > > Tamito Kajiyama has written pure Python codecs for the two main Japanese > encodings! Many thanks! Great ! -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From takeuchi.shohei@lab.ntt.co.jp Tue Apr 11 05:21:05 2000 From: takeuchi.shohei@lab.ntt.co.jp (takeuchi) Date: Tue, 11 Apr 2000 13:21:05 +0900 Subject: [I18n-sig] py1.6a2p1 IDLE annoyance Message-ID: <003701bfa36d$5b03c640$8f133c81@pflab.ecl.ntt.co.jp> This is a multi-part message in MIME format. ------=_NextPart_000_0033_01BFA3B8.CAAFEBE0 Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: 7bit Hi folks, Thank you Guido for updating IDLE shell string input features according to my private mail. I've just got your py1.6a2p1 from the site and tried. I hate to say like this but IDLE shell is worse than before. Here is another my story. IDLE shell now has a consistant behavior to command line shell except appearance on Win98 ! While Input string is saved as a Proper native encoded string (Shift JIS ), echo backed string looks broken on the screen. So user can not see the griph at all ! On Py1.6a2p1 IDLE Shell with Win98 Japanese Edition # When Type Japanese alphabet "$B$"(B" >>> s = raw_input("Echo backed broken griph") >>> s '\202\240' # SHift JIS encoding for the input >>> print s $B""!!(B# A broken griph comes up >>> u = unicode( s,"mbcs") >>> u u'\u3042' # UTF-8 encoding for s >>>print u $B$"!!!!!!(B # A proper griph comes up Tk8.3 seems to handle only UTF-8 strings, so I think IDLE have to go with them. I hope IDLE allows to customize shell encoding so that unicode object is created automatically from key input. Any Idea? Best Regards, Takeuchi ------=_NextPart_000_0033_01BFA3B8.CAAFEBE0 Content-Type: text/html; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable
Hi folks,
 
Thank you Guido for updating = IDLE shell=20 string input features according to my private mail.
 
I've just got your py1.6a2p1 = from the=20 site
and tried. I hate to say = like this but=20
IDLE shell is worse than=20 before.
 
Here is another my = story.
 
IDLE shell now has a = consistant behavior=20 to command line shell
except appearance  on Win98 = !
While Input string is saved as a = Proper native=20 encoded string (Shift JIS ), echo backed string looks broken on the = screen.
So user can = not see the=20 griph at all !  
 
On Py1.6a2p1 IDLE Shell with Win98=20 Japanese Edition 
# When Type Japanese alphabet = "=1B$B$"=1B(B"=20  
>>> s =3D raw_input("Echo = backed broken=20 griph")
>>>  s
'\202\240'  # SHift JIS encoding = for the=20 input 
>>> print s
=1B$B""!!=1B(B# A broken griph = comes up
>>> u =3D = unicode( s,"mbcs")
>>> u
u'\u3042'  # UTF-8 encoding for = s
>>>print u
=1B$B$"!!!!!!=1B(B # A proper griph = comes up
 
Tk8.3 seems to handle only = UTF-8=20 strings,
so I think IDLE have to go = with=20 them.
 
I hope IDLE allows to=20  customize shell
encoding so that unicode=20 object is created
automatically from key = input.
 
Any Idea?
 
Best Regards,
 
Takeuchi 
 
------=_NextPart_000_0033_01BFA3B8.CAAFEBE0-- From takeuchi.shohei@lab.ntt.co.jp Tue Apr 11 06:24:41 2000 From: takeuchi.shohei@lab.ntt.co.jp (takeuchi) Date: Tue, 11 Apr 2000 14:24:41 +0900 Subject: [I18n-sig] repost: py16a2p1 IDLE annoyance Message-ID: <007301bfa376$3d2b9cc0$8f133c81@pflab.ecl.ntt.co.jp> Ooops , The post with Japanese characters is not suitable here. Ok, I will try again. ----- Hi folks, Thank you Guido for updating IDLE shell string input features according to my private mail. I've just got your py1.6a2p1 from the site and tried. I hate to say like this but IDLE shell is worse than before. Here is another my story. IDLE shell now has a consistant behavior to command line shell except appearance on Win98 ! While Input string is saved as a Proper native encoded string (Shift JIS ), echo backed string looks broken on the screen. So user can not see the griph at all ! On Py1.6a2p1 IDLE Shell with Win98 Japanese Edition # When Type Japanese alphabet A (please take this as a Japanese character) >>> s = raw_input("Echo backed broken griph") >>> s '\202\240' # SHift JIS encoding for the input >>> print s Echo backed broken griph >>> u = unicode( s,"mbcs") >>> u u'\u3042' # UTF-8 encoding for s >>>print u A proper griph comes up Tk8.3 seems to handle only UTF-8 strings, so I think IDLE have to go with them. I hope IDLE allows to customize shell encoding so that unicode object is created automatically from key input. Any Idea? Best Regards, Takeuchi From dae_alt3@juno.com Tue Apr 11 09:10:11 2000 From: dae_alt3@juno.com (Doug Edmunds) Date: Tue, 11 Apr 2000 01:10:11 -0700 Subject: [I18n-sig] Reading UTF-16 Scripts Message-ID: <20000411.011011.-421941.3.dae_alt3@juno.com> python ver: 1.6a os: Win98 Are there any plans to allow python to be able to read scripts written entirely in UTF-16 format (such as those written by Win98's Wordpad program and saved as unicode text?) Since each of these files begin with 'FFEE' it would seem to be not too difficult for python to recognize that format and convert the non-string context to 8bit, i.e., p r i n t -> print. The advantage is that mixed language scripts (i.e English/Russian) can be written and saved unambiguously, not dependent upon selection of a particular 'font script' such as cp1251 or KOI8-r for Russian. The motivation for getting away from these scripts (encodings, whatever) is to be able to write multiple languages in a single string. This kind of scripting could be avoided: a = unicode ('Ïðàâäà - ãàçåòà', 'cp1251') print a.encode('cp1251') and replaced with a simpler: print "In Russian, newspaper is ____; in Polish it is ______" Notes: 1. Cyrillic fonts do not appear in IDLE (US English is base). 2. In PythonWin, even with a Cyrillic 'script' selected, such as Courier New (Cyrillic), output appears in English -- the 'script' aspect is being ignored. -- doug edmunds 11 April 2000 //juno ad follows//ignore it please// ________________________________________________________________ YOU'RE PAYING TOO MUCH FOR THE INTERNET! Juno now offers FREE Internet Access! Try it today - there's no risk! For your FREE software, visit: http://dl.www.juno.com/get/tagj. From mark.mcmahon@eur.autodesk.com Tue Apr 11 09:24:21 2000 From: mark.mcmahon@eur.autodesk.com (mark.mcmahon@eur.autodesk.com) Date: Tue, 11 Apr 2000 10:24:21 +0200 Subject: [I18n-sig] Changing case Message-ID: Hi, I can't seem to figure this out.. >>> s = unicode('\204\202', 'latin-1') >>> s u'\204\202' >>> s.upper() u'\204\202') Is this something that unicode should be able to do? Am I using the wrong encoding? Or would I have to have a particular codec to have a mapping between lower and uppercase characters. Sorry if this is basic and obvious - but as I said I can't seem to figure it out Windows NT4, (US - French regional settings), Python 1.6a1, both command line and Idle. Mark From andy@reportlab.com Tue Apr 11 10:04:19 2000 From: andy@reportlab.com (Andy Robinson) Date: Tue, 11 Apr 2000 10:04:19 +0100 Subject: [I18n-sig] P1.6a1- Win98 - unicode issues In-Reply-To: <20000410.122846.-454791.1.dae_alt3@juno.com> Message-ID: > Apparently my efforts to send unicode via Juno failed. > d.edmunds > > ???????? ????? ??????? ? ??????????? ???????? ??????????????. ??? > > ????? > > ??? ?????? ??????? ?? ????? ????? ?????? ?? ????????????? ??????? ? > > ??????. > > I think this is a Windows feature at the moment. Office 2000 apps, IE5 and Outlook Express allow input and display in any language if you have the right OS add-ons loaded. But at the moment when you past to the clipboard or save to a file, they get turned to question marks, presumably to avod upsetting older apps that are not so Unicode aware. I am told Win2000 is better - need to try it. It is this kind of thing that makes i18n really hard - even a simple cut/paste can modify your data, and it is hard to know in which piece of software things are going wrong. For Asian languages, there is a great little freeware word processor / lookup tool called "JWP" which lets you explicitly control the cut/paste and save/load encodings used. - Andy Robinson From mal@lemburg.com Tue Apr 11 13:14:00 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 11 Apr 2000 14:14:00 +0200 Subject: [I18n-sig] Reading UTF-16 Scripts References: <20000411.011011.-421941.3.dae_alt3@juno.com> Message-ID: <38F31708.AA1088B3@lemburg.com> Doug Edmunds wrote: > > python ver: 1.6a > os: Win98 > > Are there any plans to allow > python to be able to read scripts > written entirely in UTF-16 format > (such as those written by > Win98's Wordpad program and saved > as unicode text?) > > Since each of these files begin > with 'FFEE' it would seem to be > not too difficult for python > to recognize that format and convert > the non-string context to 8bit, i.e., > p r i n t -> print. As I understand, Python scripts are supposed to be ASCII (or maybe UTF-8). Your proposal would only work if *all* strings were Unicode in Python. There currently are two types: one for 8-bit strings and the 16-bit Unicode one. > The advantage is that mixed language > scripts (i.e English/Russian) can > be written and saved unambiguously, > not dependent upon selection > of a particular 'font script' such as > cp1251 or KOI8-r for Russian. > > The motivation for getting away from > these scripts (encodings, whatever) > is to be able to write multiple languages > in a single string. > > This kind of scripting could be avoided: > a = unicode ('Ïðàâäà - ãàçåòà', 'cp1251') > print a.encode('cp1251') > > and replaced with a simpler: > print "In Russian, newspaper is ____; in Polish it is ______" > > Notes: > 1. Cyrillic fonts do not appear in IDLE (US English is base). > 2. In PythonWin, even with a Cyrillic 'script' selected, > such as Courier New (Cyrillic), output appears in English > -- the 'script' aspect is being ignored. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Tue Apr 11 12:40:57 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 11 Apr 2000 13:40:57 +0200 Subject: [I18n-sig] Changing case References: Message-ID: <38F30F49.405196E@lemburg.com> mark.mcmahon@eur.autodesk.com wrote: > > Hi, > > I can't seem to figure this out.. > > >>> s = unicode('\204\202', 'latin-1') > >>> s > u'\204\202' > >>> s.upper() > u'\204\202') > > Is this something that unicode should be able to do? Am I using the wrong > encoding? > > Or would I have to have a particular codec to have a mapping between lower > and uppercase characters. > > Sorry if this is basic and obvious - but as I said I can't seem to figure it > out Those two characters don't have a lower/upper case mapping: 0080;;Cc;0;BN;;;;;N;;;;; 0081;;Cc;0;BN;;;;;N;;;;; 0082;;Cc;0;BN;;;;;N;BREAK PERMITTED HERE;;;; 0083;;Cc;0;BN;;;;;N;NO BREAK HERE;;;; 0084;;Cc;0;BN;;;;;N;INDEX;;;; .lower() and .upper() only modify chars which do have such a mapping -- all others are left untouched. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mark.mcmahon@eur.autodesk.com Tue Apr 11 13:13:09 2000 From: mark.mcmahon@eur.autodesk.com (mark.mcmahon@eur.autodesk.com) Date: Tue, 11 Apr 2000 14:13:09 +0200 Subject: [I18n-sig] Changing case Message-ID: Hi Marc, I definately do not understand. \204 is lower_e_egu (spelling?) and = \204 is lower_a_umlaut. Upper case of these should be \216 and \220 = respectively. (Probably will not display properly on all machines) -------------- >>> s =3D u"=E9=E4" >>> s u'\202\204' >>> t =3D u"=C4=C9" >>> t u'\216\220' ------------- Mark Marc -> Those two characters don't have a lower/upper case mapping: .lower() and .upper() only modify chars which do have such a mapping -- all others are left untouched. --=20 Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@python.org Tue Apr 11 14:24:48 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 11 Apr 2000 09:24:48 -0400 Subject: [I18n-sig] Changing case In-Reply-To: Your message of "Tue, 11 Apr 2000 14:13:09 +0200." References: Message-ID: <200004111324.JAA07943@eric.cnri.reston.va.us> > I definately do not understand. \204 is lower_e_egu (spelling?) and \204 is > lower_a_umlaut. Upper case of these should be \216 and \220 respectively. > > (Probably will not display properly on all machines) > -------------- > >>> s = u"éä" > >>> s > u'\202\204' > >>> t = u"ÄÉ" > >>> t > u'\216\220' > ------------- > Mark Aha, *I* understand. You must be on Windows. Windows has its own character encoding, where e-egu is \202 and a-umlaut is \204. However Python doesn't know what character set you are using, and when you typed e-egu, all it knew is that you entered \202. If you type this in a u"..." string, all codes are interpreted as if they are Latin-1, which happens to be the lower 256 bytes of Unicode. The Latin-1 character \202 (which is NOT e-egu but a control character) has no upper case equivalent. How do you get what you want? Instead of typing u"éä", you should be able to type unicode("éä", "mbcs"). HOWEVER, I can't get this to work either! I get unicode('\202\204','mbcs') -> u"\u201A\u201E" and the latter string doesn't have an upper case equivalent either! I had expected that these would have translated to the Latin-1. Maybe I'm using the wring MBCS code page??? > > Marc -> > Those two characters don't have a lower/upper case mapping: > > > .lower() and .upper() only modify chars which do have such a > mapping -- all others are left untouched. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Tue Apr 11 13:49:42 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 11 Apr 2000 14:49:42 +0200 Subject: [I18n-sig] Changing case References: Message-ID: <38F31F66.8C147314@lemburg.com> mark.mcmahon@eur.autodesk.com wrote: > > Hi Marc, > > I definately do not understand. \204 is lower_e_egu (spelling?) and \204 is > lower_a_umlaut. Upper case of these should be \216 and \220 respectively. Not in Latin-1... you are probably using a different code page in your editor. >>> u'éä' u'\351\344' >>> u'éä'.upper() u'\311\304' >>> print u'éä'.encode('latin-1') éä >>> print u'éä'.upper().encode('latin-1') ÉÄ Strange enough, I get these outputs on my Linux machine: >>> print 'éä'.upper() éä Looks like the C lib doesn't know about upper case mappings for these Latin-1 characters. > (Probably will not display properly on all machines) > -------------- > >>> s = u"éä" > >>> s > u'\202\204' > >>> t = u"ÄÉ" > >>> t > u'\216\220' > ------------- > Mark > > > Marc -> > Those two characters don't have a lower/upper case mapping: > > > .lower() and .upper() only modify chars which do have such a > mapping -- all others are left untouched. > > -- > Marc-Andre Lemburg > ______________________________________________________________________ > Business: http://www.lemburg.com/ > Python Pages: http://www.lemburg.com/python/ > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://www.python.org/mailman/listinfo/i18n-sig -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@python.org Tue Apr 11 15:55:04 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 11 Apr 2000 10:55:04 -0400 Subject: [I18n-sig] Changing case In-Reply-To: Your message of "Tue, 11 Apr 2000 14:49:42 +0200." <38F31F66.8C147314@lemburg.com> References: <38F31F66.8C147314@lemburg.com> Message-ID: <200004111455.KAA08048@eric.cnri.reston.va.us> The story continues... I tried the following in Python 1.6a2p1 on Windows NT 4.0 in three interpreters: IDLE, command line, and Pythonwin (win32all-130 using Python 1.6a2p1). (Since I live in the US, I don't have any way to input non-ASCII characters; so I use escape sequences for input.) >>> s = '\351\344' # This is e-egu a-umlaut in Latin-1 >>> u = unicode(s, "latin-1") # This simply yields u"\351\344" >>> print s (see table below) >>> print u (see table below) >>> I got the following results: print s print u ------- ------- IDLE: e-egu a-umlaut e-egu a-umlaut command line: THETA SIGMA three graphics + n~ Pythonwin: e-egu a-umlaut A~ (C) A~ o-with-cross I tried the same thing on Solaris in IDLE and the command line; IDLE on Solaris did exactly the same thing as it did on Windows, and the command line on Solaris did exactly the same thing as Pythonwin (!) did on Windows. I tried the same thing with IDLE from Python 1.6a1 and also got the same results -- from this I conclude that Tcl/Tk 8.2 and 8.3 behave the same way in this respect. My theory why IDLE has the highest success rate: Tcl/Tk 8.2 uses UTF-8 internally, but falls back to Latin-1 when you use non-ASCII characters that are clearly not UTF-8. Thus, "print u" displays the correct value because Tkinter converts Unicode to UTF-8, and "print s" displays the correct value because Tcl/Tk recognizes that it's not UTF-8 and thus interprets it as Latin-1. The command line (running in a DOS box) uses a default code page which bears no relation to Latin-1; the THETA and SIGMA happen to have codes \351 and \344. The gibberish printed for u is simply what its UTF-8 encoding ('\303\251\303\244') looks like when interpreted in the same code page. Finally, Pythonwin: Scintilla (its text widget) seems to know about Latin-1 only. The four characters it prints for u are the Latin-1 characters for \303, \251, \303 and \244. This is also true for the command line on Solaris (using xterm with the default Latin-1 encoding). Note that IDLE doesn't always print Latin-1 characters correctly! I was just lucky. For example, the string "\303,\251,\303\251" prints as A~, comma, (C), comma, e-egu. In other words, \303 and \251 by themselves are interpreted as Latin-1, while taken together they are interpreted as UTF-8. What would be nice? For stdout, to be able to say *independently* what encoding 8-bit strings are to be assumed when printed, and what encoding should be used for the output stream. And for this to work in all three IDEs: IDLE, command line and Pythonwin. In IDLE, the output stream should be fixed to UTF-8, but a user working with Latin-1 strings could set the defaults 8-bit string encoding for output to be Latin-1. Then, print '\351\344' would be encoded as UTF-8: '\303\251\303\244', which prints as e-egu a-umlaut; on the other hand, print '\303\251\303\244' would be interpreted as 4 Latin-1 characters, and print as A~ (C) A~ o-with-cross. In the command line, on Windows the output encoding should be set to the default MBCS code page, but the default encoding for 8-bit strings could be set to something user-specified, e.g. Latin-1. A similar thing should happen for input (and the input and output should normally be switched together, so that a user entering e.g. shift-JIS would also get shift-JIS on putput). This is quite independent of the source encoding when reading from a file. I have some issues with the current approach (which seems to be "use whatever bytes you read" and thus defaults to Latin-1 if you use non-ASCII characters inUnicode string literals; otherwise it's whatever the user wants it to be. Note in particular that a user who edits her source code in shift-JIS can currently *not* use shift-JIS in Unicode literals -- she must use something like unicode(".....","shift-jis") to get a Unicode string containing the correct Japanese characters encoded in Unicode. Of course, when entering source code interactively, this should be tied to the encoding for stdin. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Tue Apr 11 16:38:32 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 11 Apr 2000 17:38:32 +0200 Subject: [I18n-sig] Changing case References: <38F31F66.8C147314@lemburg.com> <200004111455.KAA08048@eric.cnri.reston.va.us> Message-ID: <38F346F8.63A5C793@lemburg.com> Guido van Rossum wrote: > > This is quite independent of the source encoding when reading from a > file. I have some issues with the current approach (which seems to be > "use whatever bytes you read" and thus defaults to Latin-1 if you use > non-ASCII characters inUnicode string literals; otherwise it's > whatever the user wants it to be. What direction should we be heading: interpret the source files under some encoding assumption deduced from the platform, a command line switch or a #pragma, or simply fix one encoding (e.g. Latin-1) ? The current divergence between u"...chars..." and "...chars..." really only stems from the fact that "...chars..." doesn't have to know about the used encoding, while u"...chars..." does to be able to convert the data to Unicode. > Note in particular that a user who > edits her source code in shift-JIS can currently *not* use shift-JIS > in Unicode literals -- she must use something like > unicode(".....","shift-jis") to get a Unicode string containing the > correct Japanese characters encoded in Unicode. See above -- without any further knowledge about the encoding used to write the source file, there is no other way than to simply fix one encoding (which happens to be Latin-1 due to the way the first 256 Unicode ordinals are defined). Note that even if the parser would know the encoding, you'd still have a problem processing the strings at run-time: 8-bit strings do not carry any encoding information. The only ways to fix this would be to define a global 8-bit string encoding or add an encoding attribute to strings. One possible way would be to define that all 8-bit strings get converted to UTF-8 when parsed (by the compiler, eval(), etc.). This would assure that all strings used at run-time would in fact be UTF-8 and conversions to and from Unicode would be possible without information loss. The downside of this approach is that indexing and slicing do not work well with UTF-8: a single input character can be encoded by as much as 6 bytes (for 32-bit Unicode) ! I also assume that many applications rely on the fact that len("äö") == 2 and not 4. Perhaps we should just loosen the used encoding for u"...chars..." using #pragmas and/or cmd line switches. Then people around the world would at least have a simple way to write programs which still work everywhere, but can be written using any of the encodings known to Python. 8-bit "...chars..." would then be interpreted as before: user defined data using a user defined encoding (the string->Unicode conversion would still need to make the UTF-8 assumption, though). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@python.org Tue Apr 11 17:56:21 2000 From: guido@python.org (Guido van Rossum) Date: Tue, 11 Apr 2000 12:56:21 -0400 Subject: [I18n-sig] Changing case In-Reply-To: Your message of "Tue, 11 Apr 2000 17:38:32 +0200." <38F346F8.63A5C793@lemburg.com> References: <38F31F66.8C147314@lemburg.com> <200004111455.KAA08048@eric.cnri.reston.va.us> <38F346F8.63A5C793@lemburg.com> Message-ID: <200004111656.MAA08887@eric.cnri.reston.va.us> > What direction should we be heading: interpret the source > files under some encoding assumption deduced from the > platform, a command line switch or a #pragma, or simply fix > one encoding (e.g. Latin-1) ? I think we'll have to allow user-specified encodings -- including UTF-8 and eventually UTF-16. How these are communicated to the parser is a separate design issue; we could start with a command line switch (assuming the standard library is ASCII only) and later migrate to a per-file pragma. There should also be a default encoding; I would propose UTF-8, as this is already the default encoding used at run-time. (And because it annoys everyone roughly equally. :-) Once we know the source encoding, it's obvious what to do with Unicode literals: translate from the input encoding. I want to propose a very simple rule for 8-bit literals: these use the source encoding -- in other words, they aren't changed from what is read from the file. This is most likely to yield what the user wants. Especially if the user doesn't use Unicode explicitly (neither literals nor via conversions) the user sees their native character set when editing the source file, and probably uses the same encoding for output files, so if the user simply prints strings, the right thing should happen automatically. If the user *does* use Unicode conversions, the user has to specify their encoding explicitly (unless it's UTF-8). This seems only fair -- the runtime can't know whether an 8-bit string being converted to Unicode started its life as an 8-bit literal or whether it was read from a file with an encoding that may only be known to the user. > The current divergence between u"...chars..." and "...chars..." > really only stems from the fact that "...chars..." doesn't > have to know about the used encoding, while u"...chars..." does > to be able to convert the data to Unicode. Right. Hence my deduction that currently the source encoding is Latin-1. > Note that even if the parser would know the encoding, you'd > still have a problem processing the strings at run-time: > 8-bit strings do not carry any encoding information. > The only ways to fix this would be to define a global 8-bit > string encoding or add an encoding attribute to strings. The former we decided against -- the latter can be done by the user (sublcassing UserString). > One possible way would be to define that all 8-bit strings > get converted to UTF-8 when parsed (by the compiler, eval(), etc.). > This would assure that all strings used at run-time would > in fact be UTF-8 and conversions to and from Unicode would > be possible without information loss. No -- this does NOT guarantee that all 8-bit strings are UTF-8. It doesn't cover strings explicitly encoded using octal escapes, and (much more importantly) it doesn't cover strings read from files or sockets or constructed in other ways. (We can know that all strings we get out of Tkinter are UTF-8 encoded though! Provided we're using Tcl/Tk 8.1 or higher.) > The downside of this approach is that indexing and slicing do > not work well with UTF-8: a single input character can be > encoded by as much as 6 bytes (for 32-bit Unicode) ! I also > assume that many applications rely on the fact that > len("äö") == 2 and not 4. Agreed. If we tried to make everything UTF-8, we should never have started down the path of a separate Unicode string datatype. I say: 8-bit strings have no fixed encoding -- they are 8-bit bytes and their interpretation is determined by the program. The default of UTF-8 when converting to a Unicode string is just because we need a default. > Perhaps we should just loosen the used encoding for u"...chars..." > using #pragmas and/or cmd line switches. Then people around the > world would at least have a simple way to write programs which > still work everywhere, but can be written using any of the > encodings known to Python. 8-bit "...chars..." would then > be interpreted as before: user defined data using a user > defined encoding (the string->Unicode conversion would still > need to make the UTF-8 assumption, though). This sounds like my proposal. Let's do it. --Guido van Rossum (home page: http://www.python.org/~guido/) From chris@ccbs.ntu.edu.tw Wed Apr 12 08:36:05 2000 From: chris@ccbs.ntu.edu.tw (Christian Wittern) Date: Wed, 12 Apr 2000 15:36:05 +0800 Subject: [I18n-sig] P1.6a1- Win98 - unicode issues In-Reply-To: Message-ID: > > Apparently my efforts to send unicode via Juno failed. > > d.edmunds > > > ???????? ????? ??????? ? ??????????? ???????? ??????????????. ??? > > > ????? > > > ??? ?????? ??????? ?? ????? ????? ?????? ?? ????????????? ??????? ? > > > ??????. > > > > I think this is a Windows feature at the moment. Office 2000 > apps, IE5 and > Outlook Express allow input and display in any language if you have the > right OS add-ons loaded. But at the moment when you past to the clipboard > or save to a file, they get turned to question marks, presumably to avod > upsetting older apps that are not so Unicode aware. I am told Win2000 is > better - need to try it. > As far as I know, the WIndows clipboard offers the text in different formats, the ??? is just the text-only fallback in cases the application does not know the magic to read the Unicode portion of the clipboard. I don't know either, but I know it is possible... All the best, Christian Wittern From mal@lemburg.com Wed Apr 12 08:59:25 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 12 Apr 2000 09:59:25 +0200 Subject: [I18n-sig] Changing case References: <38F31F66.8C147314@lemburg.com> <200004111455.KAA08048@eric.cnri.reston.va.us> <38F346F8.63A5C793@lemburg.com> <200004111656.MAA08887@eric.cnri.reston.va.us> Message-ID: <38F42CDD.49AE89B3@lemburg.com> Guido van Rossum wrote: > > > Perhaps we should just loosen the used encoding for u"...chars..." > > using #pragmas and/or cmd line switches. Then people around the > > world would at least have a simple way to write programs which > > still work everywhere, but can be written using any of the > > encodings known to Python. 8-bit "...chars..." would then > > be interpreted as before: user defined data using a user > > defined encoding (the string->Unicode conversion would still > > need to make the UTF-8 assumption, though). > > This sounds like my proposal. Let's do it. Thinking about this some more: while adding a flag to designate the u"" encoding would be easy, should the encoded string also be able to contain \uXXXX and the like sequences ? If yes, we'd need a two level approach: 1. decode the input encoding to Unicode 2. decode the embedded \uXXXX et al. escape sequences (now within Unicode) We'd need a new codec for 2 and this codec would have to be able to translate Unicode to Unicode -- nothing difficult, but a new technique since all others currently do 8-bit <-> Unicode. "Draft proposal"ing here: Let's start the experiment with a command line switch until #pragma handling has been properly defined. #pragmas should then be used for scripts read from files to ensure that they work elsewhere in the world. What command line switch should we use... -e as in "encoding" ? We'd also need an environment variable ro make things easier, say PYTHONENCODING... The value should be available within Python as e.g. sys.encoding. The given encoding would only be used by the compiler (the part that translates u"..." strings into objects). Usage in scripts in then up to user-land routines (via sys.encoding). To make all this work without too many hassles we'd need (at least the most commonly used) CJKV codecs in the core distribution. How big would these be ? Would someone contribute them... Tamito ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@reportlab.com Wed Apr 12 09:28:49 2000 From: andy@reportlab.com (Andy Robinson) Date: Wed, 12 Apr 2000 09:28:49 +0100 Subject: [I18n-sig] Changing case In-Reply-To: <200004111656.MAA08887@eric.cnri.reston.va.us> Message-ID: > I say: 8-bit strings have no fixed encoding -- they are 8-bit bytes > and their interpretation is determined by the program. The default of > UTF-8 when converting to a Unicode string is just because we need a > default. This makes perfect sense to me and I agree 100%. Guido, thanks for summing up the issues so clearly. - Andy Robinson From mal@lemburg.com Wed Apr 12 10:30:40 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 12 Apr 2000 11:30:40 +0200 Subject: [I18n-sig] Changing case References: Message-ID: <38F44240.CB4A9291@lemburg.com> [CCing to i18n too] Andy Robinson wrote: > > > To make all this work without too many hassles we'd need > > (at least the most commonly used) CJKV codecs in the core > > distribution. How big would these be ? Would someone contribute > > them... Tamito ? > > > He may be at home by now, but he indicated to me that he was > happy for them to be used in any way. The nice things about > his codecs are > (a) one could extract the mapping tables for other codecs > from data at www.unicode org and use a very similar > approach. > (b) the mappings may be 168k, but they at least zip nicely. > I'm guessing at 5-6 such codecs in the distribution > initially. > (c) the algorithmic bit can be accelerated later in C or our > vaporware state machine, and nobody needs to change > any interfaces. > (d) if we slightly parameterise his codecs so that one could > substitute a different mapping table if needed, then > all the corporate variations just need to create a > new dictionary with the deltas - Microsoft Code Page > 932 would not be another 168k, but just a few k and > could build its mapping on the fly. Sounds ok to me. > However, I suspect putting it in the core for June 1st may > be too aggressive; if the compiler is going to use them on > every source file for a Japanese user, we really want to > move from byte-level loops in Python to something much faster. Speed is not an issue now: what we need is a good concept and some proof-of-concept code to go with it. BTW, all this will go into 1.7 AFAIK... 1.6 will have to do with what's there now. I may get a patch done for the -e command line switch -- but only as experimental feature in 1.6. Unfortunately, Guido's out at the moment, so he can't comment on this... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From kajiyama@grad.sccs.chukyo-u.ac.jp Wed Apr 12 18:30:50 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 13 Apr 2000 02:30:50 +0900 Subject: [I18n-sig] Changing case In-Reply-To: <38F44240.CB4A9291@lemburg.com> (mal@lemburg.com) References: <38F46F0524E.B96AOTSUK@boomt.bt.kznet> Message-ID: <200004121730.CAA03719@dhcp236.grad.sccs.chukyo-u.ac.jp> * M.-A. Lemburg: | | > > To make all this work without too many hassles we'd need | > > (at least the most commonly used) CJKV codecs in the core | > > distribution. How big would these be ? Would someone contribute | > > them... Tamito ? * Andy Robinson: | | > He may be at home by now, but he indicated to me that he was | > happy for them to be used in any way. The nice things about | > his codecs are | > (a) one could extract the mapping tables for other codecs | > from data at www.unicode org and use a very similar | > approach. In fact, I generated the mappings in my Japanese codecs using simple Python scripts based on the mapping table provided by Unicode Inc.: ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT The version I used is 0.9 (8 March 1994). The perfectness of the mappings are totally due to the authors of the original mapping table, not me ;) | > (b) the mappings may be 168k, but they at least zip nicely. | > I'm guessing at 5-6 such codecs in the distribution | > initially. Thanks for the considerations on size. I personally consider the size issue is less important than the speed issue, though. | > (c) the algorithmic bit can be accelerated later in C or our | > vaporware state machine, and nobody needs to change | > any interfaces. | > (d) if we slightly parameterise his codecs so that one could | > substitute a different mapping table if needed, then | > all the corporate variations just need to create a | > new dictionary with the deltas - Microsoft Code Page | > 932 would not be another 168k, but just a few k and | > could build its mapping on the fly. Good ideas. | > However, I suspect putting it in the core for June 1st may | > be too aggressive; if the compiler is going to use them on | > every source file for a Japanese user, we really want to | > move from byte-level loops in Python to something much faster. | | Speed is not an issue now: what we need is a good concept | and some proof-of-concept code to go with it. I think my pure Python implementation of Japanese codecs is a kind of "proof of concept" at most. I run a simple benchmark test on my codecs; it took about 7 minutes to convert a 7MB Japanese text file from EUC-JP to EUC-JP via UTF-8. It seems that my codecs are too slow to use for most applications. I believe the char-by-char iteration on strings in EUC-JP and Shift_JIS needs to be implemented in C. Best regards, -- KAJIYAMA, Tamito From guido@python.org Thu Apr 27 16:01:48 2000 From: guido@python.org (Guido van Rossum) Date: Thu, 27 Apr 2000 11:01:48 -0400 Subject: [I18n-sig] Unicode debate In-Reply-To: Your message of "Thu, 27 Apr 2000 06:42:43 BST." References: Message-ID: <200004271501.LAA13535@eric.cnri.reston.va.us> I'd like to reset this discussion. I don't think we need to involve c.l.py yet -- I haven't seen anyone with Asian language experience chime in there, and that's where this matters most. I am directing this to the Python i18n-sig mailing list, because that's where the debate belongs, and there interested parties can join the discussion without having to be vetted as "fit for python-dev" first. I apologize for having been less than responsive in the matter; unfortunately there's lots of other stuff on my mind right now that has recently had a tendency to distract me with higher priority crises. I've heard a few people claim that strings should always be considered to contain "characters" and that there should be one character per string element. I've also heard a clamoring that there should only be one string type. You folks have never used Asian encodings. In countries like Japan, China and Korea, encodings are a fact of life, and the most popular encodings are ASCII supersets that use a variable number of bytes per character, just like UTF-8. Each country or language uses different encodings, even though their characters look mostly the same to western eyes. UTF-8 and Unicode is having a hard time getting adopted in these countries because most software that people use deals only with the local encodings. (Sounds familiar?) These encodings are much less "pure" than UTF-8, because they only encode the local characters (and ASCII), and because of various problems with slicing: if you look "in the middle" of an encoded string or file, you may not know how to interpret the bytes you see. There are overlaps (in most of these encodings anyway) between the codes used for single-byte and double-byte encodings, and you may have to look back one or more characters to know what to make of the particular byte you see. To get an idea of the nightmares that non-UTF-8 multibyte encodings give C/C++ programmers, see the Multibyte Character Set (MBCS) Survival Guide (http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm). See also the home page of the i18n-sig for more background information on encoding (and other i18n) issues (http://www.python.org/sigs/i18n-sig/). UTF-8 attempts to solve some of these problems: the multi-byte encodings are chosen such that you can tell by the high bits of each byte whether it is (1) a single-byte (ASCII) character (top bit off), (2) the start of a multi-byte character (at least two top bits on; how many indicates the total number of bytes comprising the character), or (3) a continuation byte in a multi-byte character (top bit on, next bit off). Many of the problems with non-UTF-8 multibyte encodings are the same as for UTF-8 though: #bytes != #characters, a byte may not be a valid character, regular expression patterns using "." may give the wrong results, and so on. The truth of the matter is: the encoding of string objects is in the mind of the programmer. When I read a GIF file into a string object, the encoding is "binary goop". When I read a line of Japanese text from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to be an assumption built-in to my program, or perhaps information supplied separately (there's no easy way to guess based on the actual data). When I type a string literal using Latin-1 characters, the encoding is Latin-1. When I use octal escapes in a string literal, e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla). When I type a 7-bit string literal, the encoding is ASCII. The moral of all this? 8-bit strings are not going away. They are not encoded in UTF-8 henceforth. Like before, and like 8-bit text files, they are encoded in whatever encoding you want. All you get is an extra mechanism to convert them to Unicode, and the Unicode conversion defaults to UTF-8 because it is the only conversion that is reversible. And, as Tim Peters quoted Andy Robinson (paraphrasing Tim's paraphrase), UTF-8 annoys everyone equally. Where does the current approach require work? - We need a way to indicate the encoding of Python source code. (Probably a "magic comment".) - We need a way to indicate the encoding of input and output data files, and we need shortcuts to set the encoding of stdin, stdout and stderr (and maybe all files opened without an explicit encoding). Marc-Andre showed some sample code, but I believe it is still cumbersome. (I have to play with it more to see how it could be improved.) - We need to discuss whether there should be a way to change the default conversion between Unicode and 8-bit strings (currently hardcoded to UTF-8), in order to make life easier for people who want to continue to use their favorite 8-bit encoding (e.g. Latin-1, or shift-JIS) but who also want to make use of the new Unicode datatype. We're still in alpha, so we can still fix things. --Guido van Rossum (home page: http://www.python.org/~guido/) From gresham@mediavisual.com Thu Apr 27 17:41:04 2000 From: gresham@mediavisual.com (Paul Gresham) Date: Fri, 28 Apr 2000 00:41:04 +0800 Subject: [I18n-sig] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> Message-ID: <010f01bfb067$64e43260$9a2b440a@miv01> Hi, I'm not sure how much value I can add, as I know little about the charsets etc. and a bit more about Python. As a user of these, and running a consultancy firm in Hong Kong, I can at least pass on some points and perhaps help you with testing later on. My first touch on international PCs was fixing a Japanese 8086 back in 1989, it didn't even have colour ! Hong Kong is quite an experience as there are two formats in common use, plus occasionally another gets thrown in. In HK they use the Traditional Chinese, whereas the mainland uses Simplified, as Guido says, there are a number of different types of these. Occasionally we see the Taiwanese charsets used. It seems to me that having each individual string variable encoded might just be too atomic, perhaps creating a cumbersome overhead in the system. For most applications I can settle for the entire app to be using a single charset, however from experience there are exceptions. We are normally working with prior knowledge of the charset being used, rather than having to deal with any charset which may come along (at an application level), and therefore generally work in a context, just as a European programmer would be working in say English or German. As you know, storage/retrieval is not a problem, but manipulation and comparison is. A nice way to handle this would be like operator overloading such that string operations would be perfomed in the context of the current charset, I could then change context as needed, removing the need for metadata surrounding the actual data. This should speed things up as each overloaded library could be optimised given the different quirks, and new ones could be added easily. My code could be easily re-used on different charsets by simply changing context externally to the code, rather than passing in lots of stuff and expecting Python to deal with it. Also I'd like very much to compile/load in only the International charsets that I need. I wouldn't want to see Java type bloat occurring to Python, and adding internationalisation for everything, is huge. I think what I am suggesting is a different approach which obviously places more onus on the programmer rather than Python. Perhaps this is not acceptable, I don't know as I've never developed a programming language. I hope this is a helpful point of view to get you thinking further, otherwise ... please ignore me and I'll keep quiet : ) Regards Paul ----- Original Message ----- From: "Guido van Rossum" To: ; Cc: "Just van Rossum" Sent: Thursday, April 27, 2000 11:01 PM Subject: [I18n-sig] Unicode debate > I'd like to reset this discussion. I don't think we need to involve > c.l.py yet -- I haven't seen anyone with Asian language experience > chime in there, and that's where this matters most. I am directing > this to the Python i18n-sig mailing list, because that's where the > debate belongs, and there interested parties can join the discussion > without having to be vetted as "fit for python-dev" first. > > I apologize for having been less than responsive in the matter; > unfortunately there's lots of other stuff on my mind right now that > has recently had a tendency to distract me with higher priority > crises. > > I've heard a few people claim that strings should always be considered > to contain "characters" and that there should be one character per > string element. I've also heard a clamoring that there should only be > one string type. You folks have never used Asian encodings. In > countries like Japan, China and Korea, encodings are a fact of life, > and the most popular encodings are ASCII supersets that use a variable > number of bytes per character, just like UTF-8. Each country or > language uses different encodings, even though their characters look > mostly the same to western eyes. UTF-8 and Unicode is having a hard > time getting adopted in these countries because most software that > people use deals only with the local encodings. (Sounds familiar?) > > These encodings are much less "pure" than UTF-8, because they only > encode the local characters (and ASCII), and because of various > problems with slicing: if you look "in the middle" of an encoded > string or file, you may not know how to interpret the bytes you see. > There are overlaps (in most of these encodings anyway) between the > codes used for single-byte and double-byte encodings, and you may have > to look back one or more characters to know what to make of the > particular byte you see. To get an idea of the nightmares that > non-UTF-8 multibyte encodings give C/C++ programmers, see the > Multibyte Character Set (MBCS) Survival Guide > (http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm). > See also the home page of the i18n-sig for more background information > on encoding (and other i18n) issues > (http://www.python.org/sigs/i18n-sig/). > > UTF-8 attempts to solve some of these problems: the multi-byte > encodings are chosen such that you can tell by the high bits of each > byte whether it is (1) a single-byte (ASCII) character (top bit off), > (2) the start of a multi-byte character (at least two top bits on; how > many indicates the total number of bytes comprising the character), or > (3) a continuation byte in a multi-byte character (top bit on, next > bit off). > > Many of the problems with non-UTF-8 multibyte encodings are the same > as for UTF-8 though: #bytes != #characters, a byte may not be a valid > character, regular expression patterns using "." may give the wrong > results, and so on. > > The truth of the matter is: the encoding of string objects is in the > mind of the programmer. When I read a GIF file into a string object, > the encoding is "binary goop". When I read a line of Japanese text > from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to > be an assumption built-in to my program, or perhaps information > supplied separately (there's no easy way to guess based on the actual > data). When I type a string literal using Latin-1 characters, the > encoding is Latin-1. When I use octal escapes in a string literal, > e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla). > When I type a 7-bit string literal, the encoding is ASCII. > > The moral of all this? 8-bit strings are not going away. They are > not encoded in UTF-8 henceforth. Like before, and like 8-bit text > files, they are encoded in whatever encoding you want. All you get is > an extra mechanism to convert them to Unicode, and the Unicode > conversion defaults to UTF-8 because it is the only conversion that is > reversible. And, as Tim Peters quoted Andy Robinson (paraphrasing > Tim's paraphrase), UTF-8 annoys everyone equally. > > Where does the current approach require work? > > - We need a way to indicate the encoding of Python source code. > (Probably a "magic comment".) > > - We need a way to indicate the encoding of input and output data > files, and we need shortcuts to set the encoding of stdin, stdout and > stderr (and maybe all files opened without an explicit encoding). > Marc-Andre showed some sample code, but I believe it is still > cumbersome. (I have to play with it more to see how it could be > improved.) > > - We need to discuss whether there should be a way to change the > default conversion between Unicode and 8-bit strings (currently > hardcoded to UTF-8), in order to make life easier for people who want > to continue to use their favorite 8-bit encoding (e.g. Latin-1, or > shift-JIS) but who also want to make use of the new Unicode datatype. > > We're still in alpha, so we can still fix things. > > --Guido van Rossum (home page: http://www.python.org/~guido/) > > > > From billtut@microsoft.com Fri Apr 28 00:50:53 2000 From: billtut@microsoft.com (Bill Tutt) Date: Thu, 27 Apr 2000 16:50:53 -0700 Subject: [I18n-sig] Re: Unicode debate Message-ID: <4D0A23B3F74DD111ACCD00805F31D8101D8BD020@RED-MSG-50> > Christopher Petrilli petrilli@amber.org >> Guido van Rossum [guido@python.org ] wrote: >> I've heard a few people claim that strings should always be considered >> to contain "characters" and that there should be one character per >> string element. I've also heard a clamoring that there should only be >> one string type. You folks have never used Asian encodings. In >> countries like Japan, China and Korea, encodings are a fact of life, >> and the most popular encodings are ASCII supersets that use a variable >> number of bytes per character, just like UTF-8. Each country or >> language uses different encodings, even though their characters look >> mostly the same to western eyes. UTF-8 and Unicode is having a hard >> time getting adopted in these countries because most software that >> people use deals only with the local encodings. (Sounds familiar?) > Actually a bigger concern that we hear from our customers in Japan is > that Unicode has *serious* problems in asian languages. Theey took > the "unification" of Chinese and Japanese, rather than both, and > therefore can not represent los of phrases quite right. I can have > someone write up a better dscription, but I was told by several > Japanese people that they wouldn't use Unicode come hell or high > water, basically. Yeah, not all of the east asian ideographs are availble in Unicode atm. :( Currently there are two pending extensions to the unified CJK ideographs. Extension A is slated as part of the BMP. 0x0000 - 0xAAFF in Plane 2 is currently slated for use by Extension B. BMP Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2213.pdf Plane 2 Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2215.pdf On top of which is there is this serious problem of end user defined characters in a number of these MBCS encodings. Win32 OSs handles mapping these characters into Unicode in the following way: In the Win32 registry at: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\EUDCCodeRan ge There exists several REG_SZ registry values. The names of the values are MBCS code pages. The values are source ranges in the codepage's code space. e.g.: 932: F040-F9FC 936: AAA1-AFFE,F8A1-FEFE,A140-A7A0 949: C9A1-C9FE,FEA1-FEFE 950: FA40-FEFE,8E40-A0FE,8140-8DFE,C6A1-C8FE etc.... These ranges get mapped into Unicode code space starting at U+E000 (the beginning of the BMP private use area). > Basically it's JJIS, Shift-JIS or nothing for most Japanese > companies. This was my experience working with Konica a few years ago > as well. Don't forget the new JIS X 0213. :) Bill From tree@basistech.com Fri Apr 28 01:01:17 2000 From: tree@basistech.com (Tom Emerson) Date: Thu, 27 Apr 2000 20:01:17 -0400 (EDT) Subject: [I18n-sig] Re: Unicode debate In-Reply-To: <4D0A23B3F74DD111ACCD00805F31D8101D8BD020@RED-MSG-50> References: <4D0A23B3F74DD111ACCD00805F31D8101D8BD020@RED-MSG-50> Message-ID: <14600.54477.689349.86328@cymru.basistech.com> Bill Tutt writes: > > Actually a bigger concern that we hear from our customers in Japan is > > that Unicode has *serious* problems in asian languages. Theey took > > the "unification" of Chinese and Japanese, rather than both, and > > therefore can not represent los of phrases quite right. I can have > > someone write up a better dscription, but I was told by several > > Japanese people that they wouldn't use Unicode come hell or high > > water, basically. Then tell them to use JIS X 0221 instead of Unicode! Since it is a Japanese National Standard they'll be pacified into using it, even though it is nothing more than the Japanese translation of ISO/IEC 10646-1.1993. This is becoming a bit of an urban legend: while it is true that during the initial Han unification period for Unicode 1.0 there was pushback from the Japanese who thought that characters were being left out. This issue is one of glyph variants between Japanese kanji, Simplified and Traditional Chinese hanzi, and Korean hanja: the same character can take different forms in each of these locales. Remember that one of the criterion for the Unified ideographs was that mapping between legacy encodings and Unicode can be accomplished. If a character can be found in an existing national standard (in the case of Japan), then chances are that code point is found in the Unicode block. > Yeah, not all of the east asian ideographs are availble in Unicode atm. :( But most, if not all, of the commonly used characters *are* available in Unicode 3.0. It is rare, especially for Japanese, to find words that cannot be encoded in Unicode. > Currently there are two pending extensions to the unified CJK ideographs. > Extension A is slated as part of the BMP. 0x0000 - 0xAAFF in Plane 2 is Extension A is part of Unicode 3.0 and will be in the BMP when ISO/IEC 10646.2000 is released. > On top of which is there is this serious problem of end user defined > characters in a number of these MBCS encodings. Especially true when dealing with the Hong Kong Supplementary Character Set (HKSCS). However, the HKSAR provides mapping tables for between Big Five and HKSCS and ISO/IEC 10646.1993 and .2000 (two 10646 tables are required since some of the code points in the HKSCS are included in IEB-A --- the rest should appear in IEB-B). The problem is when you want to transcode between Chinese encodings: you cannot go from HKSCS to GB2312 or GBK --- the mappings simply do not exist. > Don't forget the new JIS X 0213. :) Has it been published? -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From billtut@microsoft.com Fri Apr 28 01:17:01 2000 From: billtut@microsoft.com (Bill Tutt) Date: Thu, 27 Apr 2000 17:17:01 -0700 Subject: [I18n-sig] Re: Unicode debate Message-ID: <4D0A23B3F74DD111ACCD00805F31D8101D8BD021@RED-MSG-50> > From: Tom Emerson [mailto:tree@cymru.basistech.com] > > > > Don't forget the new JIS X 0213. :) > > Has it been published? > Apparently so. http://jcs.aa.tufs.ac.jp/jcs/index-e.htm notes: The new Japanese Industrial Standard for a coded character set, JIS X0213 (an enhancement to the current X0208), has been established on January the 21th, 2000. The standard has been published on February the 29th, 2000. The standard (written in Japanese) is priced 11,000(Japanese Yen, 541pages), and is distributed by Japanese Standards Association Bill From paul@prescod.net Fri Apr 28 03:20:22 2000 From: paul@prescod.net (Paul Prescod) Date: Thu, 27 Apr 2000 21:20:22 -0500 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> Message-ID: <3908F566.8E5747C@prescod.net> Guido van Rossum wrote: > > ... > > I've heard a few people claim that strings should always be considered > to contain "characters" and that there should be one character per > string element. I've also heard a clamoring that there should only be > one string type. You folks have never used Asian encodings. In > countries like Japan, China and Korea, encodings are a fact of life, > and the most popular encodings are ASCII supersets that use a variable > number of bytes per character, just like UTF-8. Each country or > language uses different encodings, even though their characters look > mostly the same to western eyes. UTF-8 and Unicode is having a hard > time getting adopted in these countries because most software that > people use deals only with the local encodings. (Sounds familiar?) I think that maybe an important point is getting lost here. I could be wrong, but it seems that all of this emphasis on encodings is misplaced. The physical and logical makeup of character strings are entirely separate issues. Unicode is a character set. It works in the logical domain. Dozens of different physical encodings can be used for Unicode characters. There are XML users who work with XML (and thus Unicode) every day and never see UTF-8, UTF-16 or any other Unicode-consortium "sponsored" encoding. If you invent an encoding tomorrow, it can still be XML-compatible. There are many encodings older than Unicode that are XML (and Unicode) compatible. I have not heard complaints about the XML way of looking at the world and in fact it was explicitly endorsed by many of the world's leading experts on internationalization. I haven't followed the Java situation as closely but I have also not heard screams about its support for il8n. > The truth of the matter is: the encoding of string objects is in the > mind of the programmer. When I read a GIF file into a string object, > the encoding is "binary goop". IMHO, it's a mistake of history that you would even think it makes sense to read a GIF file into a "string" object and we should be trying to erase that mistake, as quickly as possible (which is admittedly not very quickly) not building more and more infrastructure around it. How can we make the transition to a "binary goops are not strings" world easiest? > The moral of all this? 8-bit strings are not going away. If that is a statement of your long term vision, then I think that it is very unfortunate. Treating string literals as if they were isomorphic with byte arrays was probably the right thing in 1991 but it won't be in 2005. It doesn't meet the definition of string used in the Unicode spec., nor in XML, nor in Java, nor at the W3C nor in most other up and coming specifications. From the W3C site: ""While ISO-2022-JP is not sufficient for every ISO10646 document, it is the case that ISO10646 is a sufficient document character set for any entity encoded with ISO-2022-JP."" http://www.w3.org/MarkUp/html-spec/charset-harmful.html -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From just@letterror.com Fri Apr 28 09:33:16 2000 From: just@letterror.com (Just van Rossum) Date: Fri, 28 Apr 2000 09:33:16 +0100 Subject: [I18n-sig] Re: Unicode debate In-Reply-To: <200004271501.LAA13535@eric.cnri.reston.va.us> References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: At 11:01 AM -0400 27-04-2000, Guido van Rossum wrote: >Where does the current approach require work? > >- We need a way to indicate the encoding of Python source code. >(Probably a "magic comment".) How will other parts of a program know which encoding was used for non-unicode string literals? It seems to me that an encoding attribute for 8-bit strings solves this nicely. The attribute should only be set automatically if the encoding of the source file was specified or when the string has been encoded from a unicode string. The attribute should *only* be used when converting to unicode. (Hm, it could even be used when calling unicode() without the encoding argument.) It should *not* be used when comparing (or adding, etc.) 8-bit strings to each other, since they still may contain binary goop, even in a source file with a specified encoding! >- We need a way to indicate the encoding of input and output data >files, and we need shortcuts to set the encoding of stdin, stdout and >stderr (and maybe all files opened without an explicit encoding). Can you open a file *with* an explicit encoding? Just From mal@lemburg.com Fri Apr 28 10:39:37 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 28 Apr 2000 11:39:37 +0200 Subject: [I18n-sig] Re: [Python-Dev] Re: [XML-SIG] Python 1.6a2 Unicode experiences? References: <200004270208.WAA01413@newcnri.cnri.reston.va.us> <001c01bfb033$96bf66d0$01ac2ac0@boulder> <3908F5B8.9F8D8A9A@prescod.net> <20000428001229.A4790@trump.amber.org> Message-ID: <39095C59.A5916EEB@lemburg.com> [Note: These discussion should all move to 18n-sig... CCing there] Christopher Petrilli wrote: > > Paul Prescod [paul@prescod.net] wrote: > > > Even working with exotic languages, there is always a native > > > 8-bit encoding. > > > > Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use > > 8-bit encodings of Unicode if you want. > > Um, if you go: > > JIS -> Unicode -> JIS > > you don't get the same thing out that you put in (at least this is > what I've been told by a lot of Japanese developers), and therefore > it's not terribly popular because of the nature of the Japanese (and > Chinese) langauge. > > My experience with Unicode is that a lot of Western people think it's > the answer to every problem asked, while most asian language people > disagree vehemently. This says the problem isn't solved yet, even if > people wish to deny it. Isn't this a problem of the translation rather than Unicode itself (Andy mentioned several times that you can use the private BMP areas to implement 1-1 round-trips) ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Apr 28 11:28:48 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 28 Apr 2000 12:28:48 +0200 Subject: [I18n-sig] Re: Unicode debate References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: <390967DF.5424E6DF@lemburg.com> Just van Rossum wrote: > > At 11:01 AM -0400 27-04-2000, Guido van Rossum wrote: > >Where does the current approach require work? > > > >- We need a way to indicate the encoding of Python source code. > >(Probably a "magic comment".) > > How will other parts of a program know which encoding was used for > non-unicode string literals? > > It seems to me that an encoding attribute for 8-bit strings solves this > nicely. The attribute should only be set automatically if the encoding of > the source file was specified or when the string has been encoded from a > unicode string. The attribute should *only* be used when converting to > unicode. (Hm, it could even be used when calling unicode() without the > encoding argument.) It should *not* be used when comparing (or adding, > etc.) 8-bit strings to each other, since they still may contain binary > goop, even in a source file with a specified encoding! This would indeed solve some issues... it would cost sizeof(short) per string object though (the integer would map into a table of encoding names). I'm not sure what to do with the attribute when strings with differing encodings meet. UTF-8 + ASCII will still be UTF-8, but e.g. UTF-8 + Latin will not result in meaningful data. Two ideas for coercing strings with different encodings: 1. the encoding of the resulting string is set to 'undefined' 2. coerce both strings to Unicode and then apply the action Also, how would one create a string having a specific encoding ? str(object, encname) would match unicode(object, encname)... > >- We need a way to indicate the encoding of input and output data > >files, and we need shortcuts to set the encoding of stdin, stdout and > >stderr (and maybe all files opened without an explicit encoding). > > Can you open a file *with* an explicit encoding? You can specify the encoding by means of using codecs.open() instead of open(), but the interface will currently only accept (.write) and return (.read) Unicode objects. We'll probably have to make these a little more comfortable, e.g. by accepting strings and Unicode objects. The needed machinery is there -- we'd only need to define a suitable interface on top of the classic file interface. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tree@basistech.com Fri Apr 28 11:44:00 2000 From: tree@basistech.com (Tom Emerson) Date: Fri, 28 Apr 2000 06:44:00 -0400 (EDT) Subject: [I18n-sig] Re: Unicode debate In-Reply-To: References: Message-ID: <14601.27504.337569.201251@cymru.basistech.com> Just van Rossum writes: > How will other parts of a program know which encoding was used for > non-unicode string literals? This is the exact reason that Unicode should be used for all string literals: from a language design perspective I don't understand the rationale for providing "traditional" and "unicode" string. > It seems to me that an encoding attribute for 8-bit strings solves this > nicely. The attribute should only be set automatically if the encoding of > the source file was specified or when the string has been encoded from a > unicode string. The attribute should *only* be used when converting to > unicode. (Hm, it could even be used when calling unicode() without the > encoding argument.) It should *not* be used when comparing (or adding, > etc.) 8-bit strings to each other, since they still may contain binary > goop, even in a source file with a specified encoding! In Dylan there is an explicit split between 'characters' (which are always Unicode) and 'bytes'. What are the compelling reasons to not use UTF-8 as the (source) document encoding? In the past the usual response is, "the tools are't there for authoring UTF-8 documents". This argument becomes more specious as more OS's move towards Unicode. I firmly believe this can be done without Java's bloat. One off-the-cuff solution is this: All character strings are Unicode (utf-8 encoding). Language terminals and operators are restricted to US-ASCII, which are identical to UTF8. The contents of comments are not interpreted in any way. > >- We need a way to indicate the encoding of input and output data > >files, and we need shortcuts to set the encoding of stdin, stdout and > >stderr (and maybe all files opened without an explicit encoding). > > Can you open a file *with* an explicit encoding? If you cannot, you lose. You absolutely must be able to specify the encoding of a file when opening it, so that the runtime can transcode into the native encoding as you read it. This should be otherwise transparent the user. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From just@letterror.com Fri Apr 28 12:58:28 2000 From: just@letterror.com (Just van Rossum) Date: Fri, 28 Apr 2000 12:58:28 +0100 Subject: [I18n-sig] Re: Unicode debate In-Reply-To: <390967DF.5424E6DF@lemburg.com> References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: At 12:28 PM +0200 28-04-2000, M.-A. Lemburg wrote: [ encoding attr for 8 bit strings ] >This would indeed solve some issues... it would cost sizeof(short) >per string object though (the integer would map into a table >of encoding names). > >I'm not sure what to do with the attribute when strings with >differing encodings meet. UTF-8 + ASCII will still be UTF-8, >but e.g. UTF-8 + Latin will not result in meaningful data. Two >ideas for coercing strings with different encodings: > > 1. the encoding of the resulting string is set to 'undefined' > > 2. coerce both strings to Unicode and then apply the action 1, because 2 can lead to surprises when two strings containing binary goop are added and only one was a literal in a source file with an explicit encoding. (Would "undefined" be the same as "default"? It would still be nice to be able to set the global default encoding.) >Also, how would one create a string having a specific encoding ? >str(object, encname) would match unicode(object, encname)... Dunno. Is such a high level interface needed? I'm not proposing to make 8-bit strings almost as powerful as unicode strings: unicode strings are just fine for those kinds of operations... Hm, I just realized that the encoding attr can't be mutable (doh!), so maybe your suggestion isn't so bad at all. Off-topic, what's the idea behind this behavior?: >>> unicode(u"abc") u'\000a\000b\000c' >> Can you open a file *with* an explicit encoding? > >You can specify the encoding by means of using codecs.open() >instead of open(), but the interface will currently only >accept (.write) and return (.read) Unicode objects. Thanks, I wasn't aware of that. Can't the builtin open() function get an additional encoding argument? Just From tree@basistech.com Fri Apr 28 11:56:50 2000 From: tree@basistech.com (Tom Emerson) Date: Fri, 28 Apr 2000 06:56:50 -0400 (EDT) Subject: [I18n-sig] Re: [Python-Dev] Re: [XML-SIG] Python 1.6a2 Unicode experiences? In-Reply-To: <39095C59.A5916EEB@lemburg.com> References: <200004270208.WAA01413@newcnri.cnri.reston.va.us> <001c01bfb033$96bf66d0$01ac2ac0@boulder> <3908F5B8.9F8D8A9A@prescod.net> <20000428001229.A4790@trump.amber.org> <39095C59.A5916EEB@lemburg.com> Message-ID: <14601.28274.667733.660938@cymru.basistech.com> M.-A. Lemburg writes: > > > Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use > > > 8-bit encodings of Unicode if you want. This is meaningless: legacy encodings of national character sets such Shift-JIS, Big Five, GB2312, or TIS620 are not "encodings" of Unicode. TIS620 is a single-byte, 8-bit encoding: each character is represented by a single byte. The Japanese and Chinese encodings are multibyte, 8-bit, encodings. ISO-2022 is a multi-byte, 7-bit encoding for multiple character sets. Unicode has several possible encodings: UTF-8, UCS-2, UCS-4, UTF-16... You can view all of these as 8-bit encodings, if you like. Some are multibyte (such as UTF-8, where each character in Unicode is represented in 1 to 3 bytes) while others are fixed length, two or four bytes per character. > > Um, if you go: > > > > JIS -> Unicode -> JIS > > > > you don't get the same thing out that you put in (at least this is > > what I've been told by a lot of Japanese developers), and therefore > > it's not terribly popular because of the nature of the Japanese (and > > Chinese) langauge. This is simply not true any more. The ability to round trip between Unicode and legacy encodings is dependent on the software: being able to use code points in the PUA for this is acceptable and commonly done. The big advantage is in using Unicode as a pivot when transcoding between different CJK encodings. It is very difficult to map between, say, Shift JIS and GB2312, directly. However, Unicode provides a good go-between. It isn't a panacea: transcoding between legacy encodings like GB2312 and Big Five is still difficult: Unicode or not. > > My experience with Unicode is that a lot of Western people think it's > > the answer to every problem asked, while most asian language people > > disagree vehemently. This says the problem isn't solved yet, even if > > people wish to deny it. This is a shame: it is an indication that they don't understand the technology. Unicode is a tool: nothing more. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From fredrik@pythonware.com Fri Apr 28 13:15:06 2000 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 28 Apr 2000 14:15:06 +0200 Subject: [I18n-sig] Re: [Python-Dev] Re: [XML-SIG] Python 1.6a2 Unicode experiences? References: <200004270208.WAA01413@newcnri.cnri.reston.va.us> <001c01bfb033$96bf66d0$01ac2ac0@boulder> <3908F5B8.9F8D8A9A@prescod.net> <20000428001229.A4790@trump.amber.org> <39095C59.A5916EEB@lemburg.com> Message-ID: <00d101bfb10b$68585800$0500a8c0@secret.pythonware.com> Christopher Petrilli wrote: >=20 > Paul Prescod [paul@prescod.net] wrote: > > > Even working with exotic languages, there is always a native > > > 8-bit encoding. > > > > Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use > > 8-bit encodings of Unicode if you want. >=20 > Um, if you go: >=20 > JIS -> Unicode -> JIS >=20 > you don't get the same thing out that you put in (at least this is > what I've been told by a lot of Japanese developers), and therefore > it's not terribly popular because of the nature of the Japanese (and > Chinese) langauge. >=20 > My experience with Unicode is that a lot of Western people think it's > the answer to every problem asked, while most asian language people > disagree vehemently. This says the problem isn't solved yet, even if > people wish to deny it. this is partly true, partly caused by a confusion over what unicode really is. there are at least two issues involved here: * the unicode character repertoire is not complete unicode contains all characters from the basic JIS X character sets (please correct me if I'm wrong), but it doesn't include all characters in common use in Japan. as far as I've understood, this is mostly personal names and trade names. however, different vendors tend to use different sets, with different encodings, and there has been no consensus on which to add, and how. so in other words, if you're "transcoding" from one encoding to another (when converting data, or printing or displaying on a device assuming a different encoding), unicode isn't good enough. as MAL pointed out, you can work around this by using custom codecs, mapping the vendor specific characters that you happen to use to private regions in the unicode code space. but afaik, there is no standard way to do that at this time. (this probably applies to other "CJK languages" too. if anyone could verify that, I'd be grateful). * unicode is about characters, not languages if you have a unicode string, you still don't know how to display it. the string tells you what characters to use, not what language the text is written in. and while using one standard "glyph" per unicode character works pretty well for latin characters (no, it's not perfect, but it's not much of a problem in real life), it doesn't work for asian languages. you need extra language/locale information to pick the right glyph for any given unicode character. and the crux is that before unicode, this wasn't really a problem -- if you knew the encoding, you knew what language to use. when using unicode, you need to put that information somewhere else (in an XML attribute, for example). * corrections and additions are welcome, of course. From mal@lemburg.com Fri Apr 28 13:13:56 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 28 Apr 2000 14:13:56 +0200 Subject: [I18n-sig] Re: Unicode debate References: <14601.27504.337569.201251@cymru.basistech.com> Message-ID: <39098084.C9600963@lemburg.com> Tom Emerson wrote: > > Just van Rossum writes: > > How will other parts of a program know which encoding was used for > > non-unicode string literals? > > This is the exact reason that Unicode should be used for all string > literals: from a language design perspective I don't understand the > rationale for providing "traditional" and "unicode" string. > > > It seems to me that an encoding attribute for 8-bit strings solves this > > nicely. The attribute should only be set automatically if the encoding of > > the source file was specified or when the string has been encoded from a > > unicode string. The attribute should *only* be used when converting to > > unicode. (Hm, it could even be used when calling unicode() without the > > encoding argument.) It should *not* be used when comparing (or adding, > > etc.) 8-bit strings to each other, since they still may contain binary > > goop, even in a source file with a specified encoding! > > In Dylan there is an explicit split between 'characters' (which are > always Unicode) and 'bytes'. > > What are the compelling reasons to not use UTF-8 as the (source) > document encoding? In the past the usual response is, "the tools are't > there for authoring UTF-8 documents". This argument becomes more > specious as more OS's move towards Unicode. I firmly believe this can > be done without Java's bloat. > > One off-the-cuff solution is this: > > All character strings are Unicode (utf-8 encoding). Language terminals > and operators are restricted to US-ASCII, which are identical to > UTF8. The contents of comments are not interpreted in any way. That would be an option... albeit one that would probably render many of the existing programs useless (I do believe that many people have encoded their local charset into their programs, either by entering locale dependent strings directly in the source code or by making some assumption about their encoding). > > >- We need a way to indicate the encoding of input and output data > > >files, and we need shortcuts to set the encoding of stdin, stdout and > > >stderr (and maybe all files opened without an explicit encoding). > > > > Can you open a file *with* an explicit encoding? > > If you cannot, you lose. You absolutely must be able to specify the > encoding of a file when opening it, so that the runtime can transcode > into the native encoding as you read it. This should be otherwise > transparent the user. You can: codecs.open(). The interface needs some further refinement though. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Apr 28 13:09:36 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 28 Apr 2000 14:09:36 +0200 Subject: [I18n-sig] Re: Unicode debate References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: <39097F80.6A0E9FBD@lemburg.com> [Diving off into the Great Unkown... perhaps we'll end up with a useful proposal ;-)] Just van Rossum wrote: > > At 12:28 PM +0200 28-04-2000, M.-A. Lemburg wrote: > [ encoding attr for 8 bit strings ] > >This would indeed solve some issues... it would cost sizeof(short) > >per string object though (the integer would map into a table > >of encoding names). > > > >I'm not sure what to do with the attribute when strings with > >differing encodings meet. UTF-8 + ASCII will still be UTF-8, > >but e.g. UTF-8 + Latin will not result in meaningful data. Two > >ideas for coercing strings with different encodings: > > > > 1. the encoding of the resulting string is set to 'undefined' > > > > 2. coerce both strings to Unicode and then apply the action > > 1, because 2 can lead to surprises when two strings containing binary goop > are added and only one was a literal in a source file with an explicit > encoding. > > (Would "undefined" be the same as "default"? It would still be nice to be > able to set the global default encoding.) I should have been more precise: 2. provided both strings have encodings which can be converted to Unicode, coerce them to Unicode and then apply the action; otherwise proceed as in 1., i.e. the result has an undefined encoding. If 2. does try to convert to Unicode, conversion errors should be raised (just like they are now for Unicode coercion errors). Some more tricky business: How should str('bla', 'enc1') and str('bla', 'enc2') compare ? What about the hash values of the two ? > >Also, how would one create a string having a specific encoding ? > >str(object, encname) would match unicode(object, encname)... > > Dunno. Is such a high level interface needed? I'm not proposing to make > 8-bit strings almost as powerful as unicode strings: unicode strings are > just fine for those kinds of operations... Hm, I just realized that the > encoding attr can't be mutable (doh!), so maybe your suggestion isn't so > bad at all. That's why I was proposing str(obj, encname)... because the encoding can't be changed after creation. Default encoding would be 'undefined' for strings created dynamically using just "..." and the source code encoding in case the strings were defined in a Python source file (the compiler would set the encoding). Hmm, we'd still loose big in case someone puts a raw data string into a Python source file without changing the encoding to e.g. 'binary'. We'd then have to write: s = "...bla..." # source code encoding data = str("...data...","binary") # binary data Although binary data should really use: data = buffer("...data...") Side note: "...bla..." + buffer("...data...") currently returns "...bla......data..." -- not very useful: I would have expected a new buffer object instead. With string encoding attribute this could be remedied to produce a string having 'binary' encoding (at least). Some more issues: How should str(obj,encname) extract the information from the object: via getcharbuf or getreadbuf ? Should it take the encoding of the obj into account (in case it is a string object) ? What should str(unicode, encname) return (the same as unicode.encode(encname)) ? What would file.read() return (a string with 'undefined' encoding ?) ? An extra parameter to open() could be added to have it return strings with a predefined encoding. > Off-topic, what's the idea behind this behavior?: > >>> unicode(u"abc") > u'\000a\000b\000c' Hmm, I get: >>> unicode(u"abc") u'abc' This was fixed upon Guido's request some weeks ago. > >> Can you open a file *with* an explicit encoding? > > > >You can specify the encoding by means of using codecs.open() > >instead of open(), but the interface will currently only > >accept (.write) and return (.read) Unicode objects. > > Thanks, I wasn't aware of that. Can't the builtin open() function get an > additional encoding argument? That would be probably be an option after some rounds of refinement of the interface. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@python.org Fri Apr 28 14:24:29 2000 From: guido@python.org (Guido van Rossum) Date: Fri, 28 Apr 2000 09:24:29 -0400 Subject: [I18n-sig] Re: [Python-Dev] Re: [XML-SIG] Python 1.6a2 Unicode experiences? In-Reply-To: Your message of "Fri, 28 Apr 2000 11:39:37 +0200." <39095C59.A5916EEB@lemburg.com> References: <200004270208.WAA01413@newcnri.cnri.reston.va.us> <001c01bfb033$96bf66d0$01ac2ac0@boulder> <3908F5B8.9F8D8A9A@prescod.net> <20000428001229.A4790@trump.amber.org> <39095C59.A5916EEB@lemburg.com> Message-ID: <200004281324.JAA15642@eric.cnri.reston.va.us> > [Note: These discussion should all move to 18n-sig... CCing there] > > Christopher Petrilli wrote: > > you don't get the same thing out that you put in (at least this is > > what I've been told by a lot of Japanese developers), and therefore > > it's not terribly popular because of the nature of the Japanese (and > > Chinese) langauge. > > > > My experience with Unicode is that a lot of Western people think it's > > the answer to every problem asked, while most asian language people > > disagree vehemently. This says the problem isn't solved yet, even if > > people wish to deny it. [Marc-Andre Lenburg] > Isn't this a problem of the translation rather than Unicode > itself (Andy mentioned several times that you can use the private > BMP areas to implement 1-1 round-trips) ? Maybe, but apparently such high-quality translations are rare (note that Andy said "can"). Anyway, a word of caution here. Years ago I attended a number of IETF meetings on internationalization, in a time when Unicode wasn't as accepted as it is now. The one thing I took away from those meetings was that this is a *highly* emotional and controversial issue. As the Python community, I feel we have no need to discuss "why Unicode." Therein lies madness, controversy, and no progress. We know there's a clear demand for Unicode, and we've committed to support it. The question now at hand is "how Unicode." Let's please focus on that, e.g. in the other thread ("Unicode debate") in i18n-sig and python-dev. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Fri Apr 28 15:10:27 2000 From: guido@python.org (Guido van Rossum) Date: Fri, 28 Apr 2000 10:10:27 -0400 Subject: [I18n-sig] Re: [Python-Dev] Re: Unicode debate In-Reply-To: Your message of "Fri, 28 Apr 2000 09:33:16 BST." References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: <200004281410.KAA16104@eric.cnri.reston.va.us> [GvR] > >- We need a way to indicate the encoding of Python source code. > >(Probably a "magic comment".) [JvR] > How will other parts of a program know which encoding was used for > non-unicode string literals? > > It seems to me that an encoding attribute for 8-bit strings solves this > nicely. The attribute should only be set automatically if the encoding of > the source file was specified or when the string has been encoded from a > unicode string. The attribute should *only* be used when converting to > unicode. (Hm, it could even be used when calling unicode() without the > encoding argument.) It should *not* be used when comparing (or adding, > etc.) 8-bit strings to each other, since they still may contain binary > goop, even in a source file with a specified encoding! Marc-Andre took this idea a bit further, but I think it's not practical given the current implementation: there are too many places where the C code would have to be changed in order to propagate the string encoding information, and there are too many sources of strings with unknown encodings to make it very useful. Plus, it would slow down 8-bit string ops. I have a better idea: rather than carrying around 8-bit strings with an encoding, use Unicode literals in your source code. If the source encoding is known, these will be converted using the appropriate codec. If you object to having to write u"..." all the time, we could say that "..." is a Unicode literal if it contains any characters with the top bit on (of course the source file encoding would be used just like for u"..."). But I think this should be enabled by a separate pragma -- people who want to write Unicode-unaware code manipulating 8-bit strings in their favorite encoding (e.g. shift-JIS or Latin-1) should not silently get Unicode strings. (I thought about an option to make *all strings* (not just literals) Unicode, but the current implementation would require too much hacking. This is what JPython does, and maybe it should be what Python 3000 does; I don't see it as a realistic option for the 1.x series.) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Fri Apr 28 15:32:28 2000 From: guido@python.org (Guido van Rossum) Date: Fri, 28 Apr 2000 10:32:28 -0400 Subject: [I18n-sig] Re: Unicode debate In-Reply-To: Your message of "Fri, 28 Apr 2000 06:44:00 EDT." <14601.27504.337569.201251@cymru.basistech.com> References: <14601.27504.337569.201251@cymru.basistech.com> Message-ID: <200004281432.KAA16418@eric.cnri.reston.va.us> > This is the exact reason that Unicode should be used for all string > literals: from a language design perspective I don't understand the > rationale for providing "traditional" and "unicode" string. In Python 3000, you would have a point. In current Python, there simply are too many programs and extensions written in other languages that manipulating 8-bit strings to ignore their existence. We're trying to add Unicode support to Python 1.6 without breaking code that used to run under Python 1.5.x; practicalities just make it impossible to go with Unicode for everything. I think that if Python didn't have so many extension modules (many maintained by 3rd party modules) it would be a lot easier to switch to Unicode for all strings (I think JavaScript has done this). In Python 3000, we'll have to seriously consider having separate character string and byte array objects, along the lines of Java's model. Note that I say "seriously consider." We'll first have to see how well the current solution works *in practice*. There's time before we fix Py3k in stone. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@python.org Fri Apr 28 15:50:05 2000 From: guido@python.org (Guido van Rossum) Date: Fri, 28 Apr 2000 10:50:05 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: Your message of "Thu, 27 Apr 2000 21:20:22 CDT." <3908F566.8E5747C@prescod.net> References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> Message-ID: <200004281450.KAA16493@eric.cnri.reston.va.us> [Paul Prescod] > I think that maybe an important point is getting lost here. I could be > wrong, but it seems that all of this emphasis on encodings is misplaced. In practical applications that manipulate text, encodings creep up all the time. I remember a talk or message by Andy Robinson about the messiness of producing printed reports in Japanese for a large investment firm. Most off the issues that took his time had to do with encodings, if I recall correctly. (Andy, do you remember what I'm talking about? Do you have a URL?) > > The truth of the matter is: the encoding of string objects is in the > > mind of the programmer. When I read a GIF file into a string object, > > the encoding is "binary goop". > > IMHO, it's a mistake of history that you would even think it makes sense > to read a GIF file into a "string" object and we should be trying to > erase that mistake, as quickly as possible (which is admittedly not very > quickly) not building more and more infrastructure around it. How can we > make the transition to a "binary goops are not strings" world easiest? I'm afraid that's a bigger issue than we can solve for Python 1.6. We're committed to by and large backwards compatibility while supporting Unicode -- the backwards compatibility with tons of extension module (many 3rd party) requires that we deal with 8-bit strings in basically the same way as we did before. > > The moral of all this? 8-bit strings are not going away. > > If that is a statement of your long term vision, then I think that it is > very unfortunate. Treating string literals as if they were isomorphic > with byte arrays was probably the right thing in 1991 but it won't be in > 2005. I think you're a tad too optimistic about the evolution speed of software (Windows 2000 *still* has to support DOS programs), but I see your point. As I stated in another message, in Python 3000 we'll have to consider a more Java-esque solution: *character* strings are Unicode, and for bytes we have (mutable!) byte arras. Certainly 8-bit bytes as the smallest storage unit aren't going away. > It doesn't meet the definition of string used in the Unicode spec., nor > in XML, nor in Java, nor at the W3C nor in most other up and coming > specifications. OK, so that's a good indication of where you're coming from. Maybe you should spend a little more time in the trenches and a little less in standards bodies. Standards are good, but sometimes disconnected from reality (remember ISO networking? :-). > From the W3C site: > > ""While ISO-2022-JP is not sufficient for every ISO10646 document, it is > the case that ISO10646 is a sufficient document character set for any > entity encoded with ISO-2022-JP."" And this is exactly why encodings will remain important: entities encoded in ISO-2022-JP have no compelling reason to be recoded permanently into ISO10646, and there are lots of forces that make it convenient to keep it encoded in ISO-2022-JP (like existing tools). > http://www.w3.org/MarkUp/html-spec/charset-harmful.html I know that document well. --Guido van Rossum (home page: http://www.python.org/~guido/) From andy@reportlab.com Fri Apr 28 17:12:39 2000 From: andy@reportlab.com (Andy Robinson) Date: Fri, 28 Apr 2000 17:12:39 +0100 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <200004281450.KAA16493@eric.cnri.reston.va.us> Message-ID: Guido> In practical applications that manipulate text, encodings creep up all Guido> the time. I remember a talk or message by Andy Robinson about the Guido> messiness of producing printed reports in Japanese for a large Guido> investment firm. Most off the issues that took his time had to do Guido> with encodings, if I recall correctly. (Andy, do you remember what Guido> I'm talking about? Do you have a URL?) Guido> I attach the 'Case Study' I posted to the python-dev list when I first joined. If anyone else can tell their own stories, however long or short, I feel it would be a useful addition to the present discussion. - Andy >To: python-dev@python.org >Subject: [Python-Dev] Internationalisation Case Study >From: Andy Robinson >Date: Tue, 9 Nov 1999 05:57:46 -0800 (PST) > >Guido has asked me to get involved in this discussion, >as I've been working practically full-time on i18n for >the last year and a half and have done quite a bit >with Python in this regard. I thought the most >helpful thing would be to describe the real-world >business problems I have been tackling so people can >understand what one might want from an encoding >toolkit. In this (long) post I have included: >1. who I am and what I want to do >2. useful sources of info >3. a real world i18n project >4. what I'd like to see in an encoding toolkit > > >Grab a coffee - this is a long one. > >1. Who I am >-------------- >Firstly, credentials. I'm a Python programmer by >night, and when I can involve it in my work which >happens perhaps 20% of the time. More relevantly, I >did a postgrad course in Japanese Studies and lived in >Japan for about two years; in 1990 when I returned, I >was speaking fairly fluently and could read a >newspaper with regular reference tio a dictionary. >Since then my Japanese has atrophied badly, but it is >good enough for IT purposes. For the last year and a >half I have been internationalizing a lot of systems - >more on this below. > >My main personal interest is that I am hoping to >launch a company using Python for reporting, data >cleaning and transformation. An encoding library is >sorely needed for this. > >2. Sources of Knowledge >------------------------------ >We should really go for world class advice on this. >Some people who could really contribute to this >discussion are: >- Ken Lunde, author of "CJKV Information Processing" >and head of Asian Type Development at Adobe. >- Jeffrey Friedl, author of "Mastering Regular >Expressions", and a long time Japan resident and >expert on things Japanese >- Maybe some of the Ruby community? > >I'll list up books URLs etc. for anyone who needs them >on request. > >3. A Real World Project >---------------------------- >18 months ago I was offered a contract with one of the >world's largest investment management companies (which >I will nickname HugeCo) , who (after many years having >analysts out there) were launching a business in Japan >to attract savers; due to recent legal changes, >Japanese people can now freely buy into mutual funds >run by foreign firms. Given the 2% they historically >get on their savings, and the 12% that US equities >have returned for most of this century, this is a >business with huge potential. I've been there for a >while now, >rotating through many different IT projects. > >HugeCo runs its non-US business out of the UK. The >core deal-processing business runs on IBM AS400s. >These are kind of a cross between a relational >database and a file system, and speak their own >encoding called EBCDIC. Five years ago the AS400 >had limited >connectivity to everything else, so they also started >deploying Sybase databases on Unix to support some >functions. This means 'mirroring' data between the >two systems on a regular basis. IBM has always >included encoding information on the AS400 and it >converts from EBCDIC to ASCII on request with most of >the transfer tools (FTP, database queries etc.) > >To make things work for Japan, everyone realised that >a double-byte representation would be needed. >Japanese has about 7000 characters in most IT-related >character sets, and there are a lot of ways to store >it. Here's a potted language lesson. (Apologies to >people who really know this field -- I am not going to >be fully pedantic or this would take forever). > >Japanese includes two phonetic alphabets (each with >about 80-90 characters), the thousands of Kanji, and >English characters, often all in the same sentence. >The first attempt to display something was to >make a single -byte character set which included >ASCII, and a simplified (and very ugly) katakana >alphabet in the upper half of the code page. So you >could spell out the sounds of Japanese words using >'half width katakana'. > >The basic 'character set' is Japan Industrial Standard >0208 ("JIS"). This was defined in 1978, the first >official Asian character set to be defined by a >government. This can be thought of as a printed >chart >showing the characters - it does not define their >storage on a computer. It defined a logical 94 x 94 >grid, and each character has an index in this grid. > >The "JIS" encoding was a way of mixing ASCII and >Japanese in text files and emails. Each Japanese >character had a double-byte value. It had 'escape >sequences' to say 'You are now entering ASCII >territory' or the opposite. In 1978 Microsoft >quickly came up with Shift-JIS, a smarter encoding. >This basically said "Look at the next byte. If below >127, it is ASCII; if between A and B, it is a >half-width >katakana; if between B and C, it is the first half of >a double-byte character and the next one is the second >half". Extended Unix Code (EUC) does similar tricks. >Both have the property that there are no control >characters, and ASCII is still ASCII. There are a few >other encodings too. > >Unfortunately for me and HugeCo, IBM had their own >standard before the Japanese government did, and it >differs; it is most commonly called DBCS (Double-Byte >Character Set). This involves shift-in and shift-out >sequences (0x16 and 0x17, cannot remember which way >round), so you can mix single and double bytes in a >field. And we used AS400s for our core processing. > >So, back to the problem. We had a FoxPro system using >ShiftJIS on the desks in Japan which we wanted to >replace in stages, and an AS400 database to replace it >with. The first stage was to hook them up so names >and addresses could be uploaded to the AS400, and data >files consisting of daily report input could be >downloaded to the PCs. The AS400 supposedly had a >library which did the conversions, but no one at IBM >knew how it worked. The people who did all the >evaluations had basically proved that 'Hello World' in >Japanese could be stored on an AS400, but never looked >at the conversion issues until mid-project. Not only >did we need a conversion filter, we had the problem >that the character sets were of different sizes. So >it was possible - indeed, likely - that some of our >ten thousand customers' names and addresses would >contain characters only on one system or the other, >and fail to >survive a round trip. (This is the absolute key issue >for me - will a given set of data survive a round trip >through various encoding conversions?) > >We figured out how to get the AS400 do to the >conversions during a file transfer in one direction, >and I wrote some Python scripts to make up files with >each official character in JIS on a line; these went >up with conversion, came back binary, and I was able >to build a mapping table and 'reverse engineer' the >IBM encoding. It was straightforward in theory, "fun" >in practice. I then wrote a python library which knew >about the AS400 and Shift-JIS encodings, and could >translate a string between them. It could also detect >corruption and warn us when it occurred. (This is >another key issue - you will often get badly encoded >data, half a kanji or a couple of random bytes, and >need to be clear on your strategy for handling it in >any library). It was slow, but it got us our gateway >in both directions, and it warned us of bad input. 360 >characters in the DBCS encoding actually appear twice, >so perfect round trips are impossible, but practically >you can survive with some validation of input at both >ends. The final story was that our names and >addresses were mostly safe, but a few obscure symbols >weren't. > >A big issue was that field lengths varied. An address >field 40 characters long on a PC might grow to 42 or >44 on an AS400 because of the shift characters, so the >software would truncate the address during import, and >cut a kanji in half. This resulted in a string that >was illegal DBCS, and errors in the database. To >guard against this, you need really picky input >validation. You not only ask 'is this string valid >Shift-JIS', you check it will fit on the other system >too. > >The next stage was to bring in our Sybase databases. >Sybase make a Unicode database, which works like the >usual one except that all your SQL code suddenly >becomes case sensitive - more (unrelated) fun when >you have 2000 tables. Internally it stores data in >UTF8, which is a 'rearrangement' of Unicode which is >much safer to store in conventional systems. >Basically, a UTF8 character is between one and three >bytes, there are no nulls or control characters, and >the ASCII characters are still the same ASCII >characters. UTF8<->Unicode involves some bit >twiddling but is one-to-one and entirely algorithmic. > >We had a product to 'mirror' data between AS400 and >Sybase, which promptly broke when we fed it Japanese. >The company bought a library called Unilib to do >conversions, and started rewriting the data mirror >software. This library (like many) uses Unicode as a >central point in all conversions, and offers most of >the world's encodings. We wanted to test it, and used >the Python routines to put together a regression >test. As expected, it was mostly right but had some >differences, which we were at least able to document. > >We also needed to rig up a daily feed from the legacy >FoxPro database into Sybase while it was being >replaced (about six months). We took the same >library, built a DLL wrapper around it, and I >interfaced to this with DynWin , so we were able to do >the low-level string conversion in compiled code and >the high-level >control in Python. A FoxPro batch job wrote out >delimited text in shift-JIS; Python read this in, ran >it through the DLL to convert it to UTF8, wrote that >out as UTF8 delimited files, ftp'ed them to an >in directory on the Unix box ready for daily import. >At this point we had a lot of fun with field widths - >Shift-JIS is much more compact than UTF8 when you have >a lot of kanji (e.g. address fields). > >Another issue was half-width katakana. These were the >earliest attempt to get some form of Japanese out of a >computer, and are single-byte characters above 128 in >Shift-JIS - but are not part of the JIS0208 standard. > >They look ugly and are discouraged; but when you ar >enterinh a long address in a field of a database, and >it won't quite fit, the temptation is to go from >two-bytes-per -character to one (just hit F7 in >windows) to save space. Unilib rejected these (as >would Java), but has optional modes to preserve them >or 'expand them out' to their full-width equivalents. > > >The final technical step was our reports package. >This is a 4GL using a really horrible 1980s Basic-like >language which reads in fixed-width data files and >writes out Postscript; you write programs saying 'go >to x,y' and 'print customer_name', and can build up >anything you want out of that. It's a monster to >develop in, but when done it really works - >million page jobs no problem. We had bought into this >on the promise that it supported Japanese; actually, I >think they had got the equivalent of 'Hello World' out >of it, since we had a lot of problems later. > >The first stage was that the AS400 would send down >fixed width data files in EBCDIC and DBCS. We ran >these through a C++ conversion utility, again using >Unilib. We had to filter out and warn about corrupt >fields, which the conversion utility would reject. >Surviving records then went into the reports program. > >It then turned out that the reports program only >supported some of the Japanese alphabets. >Specifically, it had a built in font switching system >whereby when it encountered ASCII text, it would flip >to the most recent single byte text, and when it found >a byte above 127, it would flip to a double byte font. > This is because many Chinese fonts do (or did) >not include English characters, or included really >ugly ones. This was wrong for Japanese, and made the >half-width katakana unprintable. I found out that I >could control fonts if I printed one character at a >time with a special escape sequence, so wrote my own >bit-scanning code (tough in a language without ord() >or bitwise operations) to examine a string, classify >every byte, and control the fonts the way I wanted. >So a special subroutine is used for every name or >address field. This is apparently not unusual in GUI >development (especially web browsers) - you rarely >find a complete Unicode font, so you have to switch >fonts on the fly as you print a string. > >After all of this, we had a working system and knew >quite a bit about encodings. Then the curve ball >arrived: User Defined Characters! > >It is not true to say that there are exactly 6879 >characters in Japanese, and more than counting the >number of languages on the Indian sub-continent or the >types of cheese in France. There are historical >variations and they evolve. Some people's names got >missed out, and others like to write a kanji in an >unusual way. Others arrived from China where they >have more complex variants of the same characters. >Despite the Japanese government's best attempts, these >people have dug their heels in and want to keep their >names the way they like them. My first reaction was >'Just Say No' - I basically said that it one of these >customers (14 out of a database of 8000) could show me >a tax form or phone bill with the correct UDC on it, >we would implement it but not otherwise (the usual >workaround is to spell their name phonetically in >katakana). But our marketing people put their foot >down. > >A key factor is that Microsoft has 'extended the >standard' a few times. First of all, Microsoft and >IBM include an extra 360 characters in their code page >which are not in the JIS0208 standard. This is well >understood and most encoding toolkits know what 'Code >Page 932' is Shift-JIS plus a few extra characters. >Secondly, Shift-JIS has a User-Defined region of a >couple of thousand characters. They have lately been >taking Chinese variants of Japanese characters (which >are readable but a bit old-fashioned - I can imagine >pipe-smoking professors using these forms as an >affectation) and adding them into their standard >Windows fonts; so users are getting used to these >being available. These are not in a standard. >Thirdly, they include something called the 'Gaiji >Editor' in Japanese Win95, which lets you add new >characters to the fonts on your PC within the >user-defined region. The first step was to review all >the PCs in the Tokyo office, and get one centralized >extension font file on a server. This was also fun as >people had assigned different code points to >characters on differene machines, so what looked >correct on your word processor was a black square on >mine. Effectively, each company has its own custom >encoding a bit bigger than the standard. > >Clearly, none of these extensions would convert >automatically to the other platforms. > >Once we actually had an agreed list of code points, we >scanned the database by eye and made sure that the >relevant people were using them. We decided that >space for 128 User-Defined Characters would be >allowed. We thought we would need a wrapper around >Unilib to intercept these values and do a special >conversion; but to our amazement it worked! Somebody >had already figured out a mapping for at least 1000 >characters for all the Japanes encodings, and they did >the round trips from Shift-JIS to Unicode to DBCS and >back. So the conversion problem needed less code than >we thought. This mapping is not defined in a standard >AFAIK (certainly not for DBCS anyway). > >We did, however, need some really impressive >validation. When you input a name or address on any >of the platforms, the system should say >(a) is it valid for my encoding? >(b) will it fit in the available field space in the >other platforms? >(c) if it contains user-defined characters, are they >the ones we know about, or is this a new guy who will >require updates to our fonts etc.? > >Finally, we got back to the display problems. Our >chosen range had a particular first byte. We built a >miniature font with the characters we needed starting >in the lower half of the code page. I then >generalized by name-printing routine to say 'if the >first character is XX, throw it away, and print the >subsequent character in our custom font'. This worked >beautifully - not only could we print everything, we >were using type 1 embedded fonts for the user defined >characters, so we could distill it and also capture it >for our internal document imaging systems. > >So, that is roughly what is involved in building a >Japanese client reporting system that spans several >platforms. > >I then moved over to the web team to work on our >online trading system for Japan, where I am now - >people will be able to open accounts and invest on the >web. The first stage was to prove it all worked. >With HTML, Java and the Web, I had high hopes, which >have mostly been fulfilled - we set an option in the >database connection to say 'this is a UTF8 database', >and Java converts it to Unicode when reading the >results, and we set another option saying 'the output >stream should be Shift-JIS' when we spew out the HTML. > There is one limitations: Java sticks to the JIS0208 >standard, so the 360 extra IBM/Microsoft Kanji and our >user defined characters won't work on the web. You >cannot control the fonts on someone else's web >browser; management accepted this because we gave them >no alternative. Certain customers will need to be >warned, or asked to suggest a standard version of a >charactere if they want to see their name on the web. >I really hope the web actually brings character usage >in line with the standard in due course, as it will >save a fortune. > >Our system is multi-language - when a customer logs >in, we want to say 'You are a Japanese customer of our >Tokyo Operation, so you see page X in language Y'. >The language strings all all kept in UTF8 in XML >files, so the same file can hold many languages. This >and the database are the real-world reasons why you >want to store stuff in UTF8. There are very few tools >to let you view UTF8, but luckily there is a free Word >Processor that lets you type Japanese and save it in >any encoding; so we can cut and paste between >Shift-JIS and UTF8 as needed. > >And that's it. No climactic endings and a lot of real >world mess, just like life in IT. But hopefully this >gives you a feel for some of the practical stuff >internationalisation projects have to deal with. See >my other mail for actual suggestions > >- Andy Robinson > >===== >Andy Robinson >Robinson Analytics Ltd. >------------------ >My opinions are the official policy of Robinson Analytics Ltd. >They just vary from day to day. From just@letterror.com Fri Apr 28 18:38:14 2000 From: just@letterror.com (Just van Rossum) Date: Fri, 28 Apr 2000 18:38:14 +0100 Subject: [I18n-sig] Re: Unicode debate In-Reply-To: <39097F80.6A0E9FBD@lemburg.com> References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: At 2:09 PM +0200 28-04-2000, M.-A. Lemburg wrote: >> 1, because 2 can lead to surprises when two strings containing binary goop >> are added and only one was a literal in a source file with an explicit >> encoding. > [...] >I should have been more precise: > >2. provided both strings have encodings which can be converted > to Unicode, coerce them to Unicode and then apply the action; > otherwise proceed as in 1., i.e. the result has an undefined > encoding. > >If 2. does try to convert to Unicode, conversion errors should >be raised (just like they are now for Unicode coercion errors). But that doesn't solve the binary goop problem: two binary gooplets may have different "encodings", which happen to be valid (ie. not raise an exception). Conversion to unicode is no way what you want. >Some more tricky business: > >How should str('bla', 'enc1') and str('bla', 'enc2') compare ? >What about the hash values of the two ? I proposed to *only* use the encoding attr when dealing with 8-bit string/unicode string combo's. Just ignore it completely when there's no unicode string in sight. Just From just@letterror.com Fri Apr 28 18:51:03 2000 From: just@letterror.com (Just van Rossum) Date: Fri, 28 Apr 2000 18:51:03 +0100 Subject: [I18n-sig] Re: [Python-Dev] Re: Unicode debate In-Reply-To: <200004281410.KAA16104@eric.cnri.reston.va.us> References: Your message of "Fri, 28 Apr 2000 09:33:16 BST." Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: [GvR, on string.encoding ] >Marc-Andre took this idea a bit further, but I think it's not >practical given the current implementation: there are too many places >where the C code would have to be changed in order to propagate the >string encoding information, I may miss something, but the encoding attr just travels with the string object, no? Like I said in my reply to MAL, I think it's undesirable to do *anything* with the encoding attr if not in combination with a unicode string. >and there are too many sources of strings >with unknown encodings to make it very useful. That's why the default encoding must be settable as well, as Fredrik suggested. >Plus, it would slow down 8-bit string ops. Not if you ignore it most of the time, and just pass it along when concatenating. >I have a better idea: rather than carrying around 8-bit strings with >an encoding, use Unicode literals in your source code. Explain that to newbies... I guess is that they will want simple 8 bit strings in their native encoding. Dunno. >If the source >encoding is known, these will be converted using the appropriate >codec. > >If you object to having to write u"..." all the time, we could say >that "..." is a Unicode literal if it contains any characters with the >top bit on (of course the source file encoding would be used just like >for u"..."). Only if "\377" would still yield an 8-bit string, for binary goop... Just From guido@python.org Fri Apr 28 19:31:19 2000 From: guido@python.org (Guido van Rossum) Date: Fri, 28 Apr 2000 14:31:19 -0400 Subject: [I18n-sig] Re: [Python-Dev] Re: Unicode debate In-Reply-To: Your message of "Fri, 28 Apr 2000 18:51:03 BST." References: Your message of "Fri, 28 Apr 2000 09:33:16 BST." Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: <200004281831.OAA17406@eric.cnri.reston.va.us> > [GvR, on string.encoding ] > >Marc-Andre took this idea a bit further, but I think it's not > >practical given the current implementation: there are too many places > >where the C code would have to be changed in order to propagate the > >string encoding information, [JvR] > I may miss something, but the encoding attr just travels with the string > object, no? Like I said in my reply to MAL, I think it's undesirable to do > *anything* with the encoding attr if not in combination with a unicode > string. But just propagating affects every string op -- s+s, s*n, s[i], s[:], s.strip(), s.split(), s.lower(), ... > >and there are too many sources of strings > >with unknown encodings to make it very useful. > > That's why the default encoding must be settable as well, as Fredrik > suggested. I'm open for debate about this. There's just something about a changeable global default encoding that worries me -- like any global property, it requires conventions and defensive programming to make things work in larger programs. For example, a module that deals with Latin-1 strings can't just set the default encoding to Latin-1: it might be imported by a program that needs it to be UTF-8. This model is currently used by the locale in C, where all locale properties are global, and it doesn't work well. For example, Python needs to go through a lot of hoops so that Python numeric literals use "." for the decimal indicator even if the user's locale specifies "," -- we can't change Python to swap the meaning of "." and "," in all contexts. So I think that a changeable default encoding is of limited value. That's different from being able to set the *source file* encoding -- this only affects Unicode string literals. > >Plus, it would slow down 8-bit string ops. > > Not if you ignore it most of the time, and just pass it along when > concatenating. And slicing, and indexing, and... > >I have a better idea: rather than carrying around 8-bit strings with > >an encoding, use Unicode literals in your source code. > > Explain that to newbies... I guess is that they will want simple 8 bit > strings in their native encoding. Dunno. If they are hap-py with their native 8-bit encoding, there's no need for them to ever use Unicode objects in their program, so they should be fine. 8-bit strings aren't ever interpreted or encoded except when mixed with Unicode objects. > >If the source > >encoding is known, these will be converted using the appropriate > >codec. > > > >If you object to having to write u"..." all the time, we could say > >that "..." is a Unicode literal if it contains any characters with the > >top bit on (of course the source file encoding would be used just like > >for u"..."). > > Only if "\377" would still yield an 8-bit string, for binary goop... Correct. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Fri Apr 28 19:52:04 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 28 Apr 2000 20:52:04 +0200 Subject: [I18n-sig] Re: Unicode debate References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: <3909DDD4.D32296CE@lemburg.com> Just van Rossum wrote: > > At 2:09 PM +0200 28-04-2000, M.-A. Lemburg wrote: > >> 1, because 2 can lead to surprises when two strings containing binary goop > >> are added and only one was a literal in a source file with an explicit > >> encoding. > > > [...] > >I should have been more precise: > > > >2. provided both strings have encodings which can be converted > > to Unicode, coerce them to Unicode and then apply the action; > > otherwise proceed as in 1., i.e. the result has an undefined > > encoding. > > > >If 2. does try to convert to Unicode, conversion errors should > >be raised (just like they are now for Unicode coercion errors). > > But that doesn't solve the binary goop problem: two binary gooplets may > have different "encodings", which happen to be valid (ie. not raise an > exception). Conversion to unicode is no way what you want. See the first line ;-) ... "provided both strings have encodings which can be converted to Unicode" ... binary encodings would not fall under these. str('...data1...','binary') + str('...data2...','UTF-8') would yield str('...data1......data2...','undefined') Plus, we'd need to add a third case: 3. Of course, actions on strings of the same encoding should result in strings of the same encodings, e.g. str('...data1...','enc1') + str('...data2...','enc1') should yield str('...data1......data2...','enc1') > >Some more tricky business: > > > >How should str('bla', 'enc1') and str('bla', 'enc2') compare ? > >What about the hash values of the two ? > > I proposed to *only* use the encoding attr when dealing with 8-bit > string/unicode string combo's. Just ignore it completely when there's no > unicode string in sight. You can't ignore it completely because that would quickly render it useless: point 3. is very important to assure that strings with known encoding propogate their encoding as they get processed. Otherwise you'd soon only deal with undefined encoding strings and the whole strategy would be pointless. Hmm, I think this road doesn't lead anywhere (but it was fun anyway ;). As I've written a few times before: if you intend to go Unicode, make all your strings Unicode. Perhaps there should be an experimental command line flag which turns "..." in source code into u"..." to be able to test this setup ?! If someone is interested, I have a patch which adds a -U flag. The Python compiler will then interpret all '...' strings as u'...' strings. Hmm, that switch should probably be called something like -Py3k ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From just@letterror.com Fri Apr 28 21:04:46 2000 From: just@letterror.com (Just van Rossum) Date: Fri, 28 Apr 2000 21:04:46 +0100 Subject: [I18n-sig] Re: Unicode debate In-Reply-To: <3909DDD4.D32296CE@lemburg.com> References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: At 8:52 PM +0200 28-04-2000, M.-A. Lemburg wrote: >See the first line ;-) ... "provided both strings have encodings >which can be converted to Unicode" ... binary encodings would >not fall under these. Won't a string literal in a source file with an explicit encoding get *that* encoding, whether the string contains binary goop or not?! Just From mal@lemburg.com Fri Apr 28 20:51:26 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 28 Apr 2000 21:51:26 +0200 Subject: [I18n-sig] Re: Unicode debate References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: <3909EBBE.FB64589D@lemburg.com> Just van Rossum wrote: > > At 8:52 PM +0200 28-04-2000, M.-A. Lemburg wrote: > >See the first line ;-) ... "provided both strings have encodings > >which can be converted to Unicode" ... binary encodings would > >not fall under these. > > Won't a string literal in a source file with an explicit encoding get > *that* encoding, whether the string contains binary goop or not?! Right. Binary data in such a string literal would have to use str('...data...','binary') to get the correct encoding attached to it. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From Moshe Zadka Sat Apr 29 03:08:48 2000 From: Moshe Zadka (Moshe Zadka) Date: Sat, 29 Apr 2000 05:08:48 +0300 (IDT) Subject: [I18n-sig] Re: [Python-Dev] Unicode debate In-Reply-To: <200004281450.KAA16493@eric.cnri.reston.va.us> Message-ID: I agree with most of what you say, but... On Fri, 28 Apr 2000, Guido van Rossum wrote: > As I stated in another message, in Python 3000 we'll have > to consider a more Java-esque solution: *character* strings are > Unicode, and for bytes we have (mutable!) byte arras. I would prefer a different distinction: mutable immutable chars string string_buffer bytes bytes bytes_buffer Why not allow me the freedom to index a dictionary with goop? (Here's a sample application: UNIX "file" command) -- Moshe Zadka . http://www.oreilly.com/news/prescod_0300.html http://www.linux.org.il -- we put the penguin in .com From kentsin@poboxes.com Sat Apr 29 04:07:12 2000 From: kentsin@poboxes.com (Sin Hang Kin) Date: Sat, 29 Apr 2000 11:07:12 +0800 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> Message-ID: <003f01bfb188$0ee7bcc0$770da8c0@bbs> I am not quite follow on the discussion. But I am interested in Unicode-ify python: Python should be able to be an native language of any language. For given all nations a fair ground for computer programming. The recently english-oriented python syntax should be easily ported to other languages and python programs written in all languages can be converted to another one automatically. i.e., a french speaking children can use french command words to write python code, and this python code can convert to Englihs, Chinese, ... Backward compatibility is a must. The current implementation of unicode string might break some code. The ability to convert from/to unicode is not enough. For example, it might for a search engine to collect many text from different encoding, and I have seen that mixed encoding in a single text. I did it once with in a Chinese application, I received a collective text file which someone who collect them from mainland China with GB encoding and locally with Big-5 encoding. The one who collect them do not read them carefully, and he got a mighty environment (richwin) which automatically recognize the encoding and adapt to it. So he just paste all these text together. With such an mixed text, no conversion to/from unicode handling is able to handle. Think if you run a mailing list, one like this, with people quoting each other's message and write in their native encoding, you will get a funny text collection with different encoding. This also can happen to the digest of such an mailing list: you may try now writing in all encoding :) So, I perfer to have people choosing their encoding. Setting a flag inside a program will switch the internal handling of utf-8, 8-bit code. With time pass, we may drop that, but now, we can not abandom the 8-bit code. Rgs, Kent Sin From kentsin@poboxes.com Sat Apr 29 04:07:06 2000 From: kentsin@poboxes.com (Sin Hang Kin) Date: Sat, 29 Apr 2000 11:07:06 +0800 Subject: [I18n-sig] Re: Unicode debate References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: <003c01bfb188$0ba08100$770da8c0@bbs> For python source, we would enforce all to write in utf-8. Provided that they would freely choose their own natively encoding if they wish, but to convert them to unicode if they publish them. Rgs, Kent Sin ----- Original Message ----- From: "Just van Rossum" To: "Guido van Rossum" ; ; Sent: Friday, April 28, 2000 4:33 PM Subject: [I18n-sig] Re: Unicode debate > At 11:01 AM -0400 27-04-2000, Guido van Rossum wrote: > >Where does the current approach require work? > > > >- We need a way to indicate the encoding of Python source code. > >(Probably a "magic comment".) > > How will other parts of a program know which encoding was used for > non-unicode string literals? > > It seems to me that an encoding attribute for 8-bit strings solves this > nicely. The attribute should only be set automatically if the encoding of > the source file was specified or when the string has been encoded from a > unicode string. The attribute should *only* be used when converting to > unicode. (Hm, it could even be used when calling unicode() without the > encoding argument.) It should *not* be used when comparing (or adding, > etc.) 8-bit strings to each other, since they still may contain binary > goop, even in a source file with a specified encoding! > > >- We need a way to indicate the encoding of input and output data > >files, and we need shortcuts to set the encoding of stdin, stdout and > >stderr (and maybe all files opened without an explicit encoding). > > Can you open a file *with* an explicit encoding? > > Just > > > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://www.python.org/mailman/listinfo/i18n-sig > From just@letterror.com Sat Apr 29 08:03:14 2000 From: just@letterror.com (Just van Rossum) Date: Sat, 29 Apr 2000 08:03:14 +0100 Subject: [I18n-sig] Re: Unicode debate In-Reply-To: <3909EBBE.FB64589D@lemburg.com> References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: At 9:51 PM +0200 28-04-2000, M.-A. Lemburg wrote: >Right. Binary data in such a string literal would have to >use str('...data...','binary') to get the correct encoding >attached to it. And that sucks. I stick to my point that the encoding attr should *not* be used when dealing strictly with bit strings. Ever. At all. Its' *only* purpose is to aid "upcasting" to unicode. (But maybe that purpose is too weak to warrant an entirely new attribute...) Just From mal@lemburg.com Sat Apr 29 14:25:47 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 29 Apr 2000 15:25:47 +0200 Subject: [I18n-sig] Re: Unicode debate References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: <390AE2DB.1EB8692A@lemburg.com> Just van Rossum wrote: > > At 9:51 PM +0200 28-04-2000, M.-A. Lemburg wrote: > >Right. Binary data in such a string literal would have to > >use str('...data...','binary') to get the correct encoding > >attached to it. > > And that sucks. Not sure why... after all the point of adding encoding information to strings was to add missing information: the current usage as binary data container would then be justified provided the strings are marked as containing binary data. > I stick to my point that the encoding attr should *not* be > used when dealing strictly with bit strings. Ever. At all. Its' *only* > purpose is to aid "upcasting" to unicode. (But maybe that purpose is too > weak to warrant an entirely new attribute...) I think the little experiment with adding an encoding attribute to strings is not going to be the right solution. People will get all confused, the implementation won't be able make much use of it without proper forarding of the information and that forwarding costs performance even for those programs which do not need this at all. Guido's suggestion is more practical: either go all the way (meaning to write all *text* as Unicode objects) or don't use Unicode at all. Note that the patch I sent to the patches list enables you to test the "go all the way" strategy in an even more radical way: it converts all "..." strings to u"..." when the -U command line option is given. I think we should use the experience gained with that patch to make the standard Python library (and the interpreter) Unicode capable. Here's a list of what I've found by running some of the regression tests: * import string fails due to the way _idtable is constructed * getattr() doesn't like Unicode as second argument, same for delattr() and hasattr() * eval() expects a string object * there still are some string exceptions around in the regr. tests which cause a failure (Unicode exceptions don't work) * struct.pack('s') doesn't like Unicode as argument * re doesn't work: pcre_expand() needs a string object * regex doesn't work either because string objects are hard-coded * mmap doesn't like Unicode: "mmap assignment must be single-character string" * cPickle.loads() doesn't like Unicode as data storage * keywords must be strings (f(1, 2, 3, **{'a':4, 'b':5}) doesn't work) * rotor doesn't work Some of these could be fixed by putting a str() call around the '...' constants. Others need fixes in C code. Yet others would be better off if they used the buffer interfaces (basically all APIs which work on raw data like cPickle or rotor). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paul@prescod.net Sat Apr 29 15:18:05 2000 From: paul@prescod.net (Paul Prescod) Date: Sat, 29 Apr 2000 09:18:05 -0500 Subject: [I18n-sig] Re: [Python-Dev] Unicode debate References: <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> Message-ID: <390AEF1D.253B93EF@prescod.net> Guido van Rossum wrote: > > [Paul Prescod] > > I think that maybe an important point is getting lost here. I could be > > wrong, but it seems that all of this emphasis on encodings is misplaced. > > In practical applications that manipulate text, encodings creep up all > the time. I'm not saying that encodings are unimportant. I'm saying that that they are *different* than what Fredrik was talking about. He was talking about a coherent logical model for characters and character strings based on the conventions of more modern languages and systems than C and Python. > > How can we > > make the transition to a "binary goops are not strings" world easiest? > > I'm afraid that's a bigger issue than we can solve for Python 1.6. I understand that we can't fix the problem now. I just think that we shouldn't go out of our ways to make it worst. If we make byte-array strings "magically" cast themselves into character-strings, people will expect that behavior forever. > > It doesn't meet the definition of string used in the Unicode spec., nor > > in XML, nor in Java, nor at the W3C nor in most other up and coming > > specifications. > > OK, so that's a good indication of where you're coming from. Maybe > you should spend a little more time in the trenches and a little less > in standards bodies. Standards are good, but sometimes disconnected > from reality (remember ISO networking? :-). As far as I know, XML and Java are used a fair bit in the real world...even somewhat in Asia. In fact, there is a book titled "XML and Java" written by three Japanese men. > And this is exactly why encodings will remain important: entities > encoded in ISO-2022-JP have no compelling reason to be recoded > permanently into ISO10646, and there are lots of forces that make it > convenient to keep it encoded in ISO-2022-JP (like existing tools). You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is a character *set* and not an encoding. ISO-2022-JP says how you should represent characters in terms of bits and bytes. ISO10646 defines a mapping from integers to characters. They are both important, but separate. I think that this automagical re-encoding conflates them. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html From Fredrik Lundh" <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> Message-ID: <006e01bfb1ea$900d0820$34aab5d4@hagrid> Paul Prescod wrote: > > > I think that maybe an important point is getting lost here. I = could be > > > wrong, but it seems that all of this emphasis on encodings is = misplaced. > >=20 > > In practical applications that manipulate text, encodings creep up = all > > the time. =20 >=20 > I'm not saying that encodings are unimportant. I'm saying that that = they > are *different* than what Fredrik was talking about. He was talking > about a coherent logical model for characters and character strings > based on the conventions of more modern languages and systems than > C and Python. note that the existing Python language reference describes this model very clearly: [Sequences] represent finite ordered sets indexed by natural numbers. The built-in function len() returns the number of items of a sequence. When the length of a sequence is n, the index set contains the numbers 0, 1, ..., n-1. Item i of sequence a is selected by a[i]. An object of an immutable sequence type cannot change once it is created. The items of a string are characters. There is no separate character type; a character is represented by a string of one item. Characters represent (at least) 8-bit bytes. The built-in functions chr() and ord() convert between characters and nonnegative integers representing the byte values. Bytes with the values 0-127 usually represent the corre- sponding ASCII values, but the interpretation of values is up to the program. The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file.=20 as I've pointed out before, I want this to apply to all kinds of strings in 1.6. imo, the cleanest way to do this is to change the last three sentences to: The built-in functions chr() and ord() convert between characters and nonnegative integers representing the character codes. Character codes usually represent the corresponding unicode characters. The 8-bit string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. the encodings debate has nothing to do with this model. ... more later. gotta run. From just@letterror.com Sat Apr 29 18:40:22 2000 From: just@letterror.com (Just van Rossum) Date: Sat, 29 Apr 2000 18:40:22 +0100 Subject: [I18n-sig] Re: Unicode debate In-Reply-To: <390AE2DB.1EB8692A@lemburg.com> References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." Message-ID: At 3:25 PM +0200 29-04-2000, M.-A. Lemburg wrote: >Just van Rossum wrote: >> >> At 9:51 PM +0200 28-04-2000, M.-A. Lemburg wrote: >> >Right. Binary data in such a string literal would have to >> >use str('...data...','binary') to get the correct encoding >> >attached to it. >> >> And that sucks. > >Not sure why... after all the point of adding encoding information >to strings was to add missing information: the current usage >as binary data container would then be justified provided the >strings are marked as containing binary data. For one, it's just too much hassle to write str('...data...','binary')... All my proposal was, was a very lightweight way to ensure correct translation to unicode when needed. What you seem to suggest, is that the encoding attribute could be used to make 8-bit strings almost as powerful as unicode strings, by converting to unicode whenever there's an action that involves two 8-bit strings with different encodings. While I'm sure that would have it's uses, I think that's too ambitious, and seems to get too much in the way of 8-bit strings doubling as byte arrays. As I've admitted before, what I had in mind for the encoding attribute is probably to weak a use to warrant the effort, and there are indeed too many things that can still go wrong. So for now I'll let it go... (But it was fun indeed ;-) (Oh, and I still stand by my and Fredrik's point that utf-8 is a poor default choice when coercing 8-bit strings to unicode, for the sole reason a utf-8 string is a byte array, and not a character string.) Just From tree@basistech.com Sun Apr 30 06:29:08 2000 From: tree@basistech.com (Tom Emerson) Date: Sun, 30 Apr 2000 01:29:08 -0400 (EDT) Subject: [I18n-sig] codec questions Message-ID: <14603.50340.167067.470930@cymru.basistech.com> I'm using 1.6a2 and the following doesn't run. I must be doing something brain-dead here (I'm jet lagged right now): -- import codecs; foo = codecs.open('Sc-orig.utf', 'rb', 'utf-8') line = foo.readline() while (line != ""): print line line = foo.readline() foo.close() -- When I attempt to run that, in the directory containing 'Sc-orig.utf', I get: (0) tree% python process.py Traceback (most recent call last): File "process.py", line 5, in ? line = foo.readline() File "/opt/tree/lib/python1.6/codecs.py", line 318, in readline return self.reader.readline(size) NameError: self Any ideas? I'm trying to grok the architecture so I can add transcoding support for TIS-620 (Thai, an 8-bit encoding which should work fine with the mapping codecs), GB2312 (multibyte, simplified Chinese), and Big-5 (multibyte, traditional Chinese). But I can't even get the simplest code to work, so I need someone to hit me with a stick. Also, are transcoding tables loaded as needed? Or all at once? What are the plans for managing transcoding tables? Thanks. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From tree@basistech.com Sun Apr 30 21:20:18 2000 From: tree@basistech.com (Tom Emerson) Date: Sun, 30 Apr 2000 16:20:18 -0400 (EDT) Subject: [I18n-sig] codec questions In-Reply-To: <14603.50340.167067.470930@cymru.basistech.com> References: <14603.50340.167067.470930@cymru.basistech.com> Message-ID: <14604.38274.378633.509796@cymru.basistech.com> Tom Emerson writes: > Any ideas? I'm trying to grok the architecture so I can add > transcoding support for TIS-620 (Thai, an 8-bit encoding which should [snip] TIS-620 is mostly the same as CP874, so for my purposes this is done. Never mind. 8-) I looked at the encodings directory after I send my mail. Of course I still cannot get codecs.open() to work, so it is a small victory. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"