From mhammond@skippinet.com.au Mon Nov 1 01:51:56 1999 From: mhammond@skippinet.com.au (Mark Hammond) Date: Mon, 1 Nov 1999 12:51:56 +1100 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? Message-ID: <002301bf240b$ae61fa00$0501a8c0@bobcat> I have for some time been wondering about the usefulness of this mailing list. It seems to have produced staggeringly few results since inception. This is not a critisism of any individual, but of the process. It is proof in my mind of how effective the benevolent dictator model is, and how ineffective a language run by committee would be. This "committee" never seems to be capable of reaching a consensus on anything. A number of issues dont seem to provoke any responses. As a result, many things seem to die a slow and lingering death. Often there is lots of interesting discussion, but still precious few results. In the pre python-dev days, the process seemed easier - we mailed Guido directly, and he either stated "yea" or "nay" - maybe we didnt get the response we hoped for, but at least we got a response. Now, we have the result that even if Guido does enter into a thread, the noise seems to drown out any hope of getting anything done. Guido seems to be faced with the dilemma of asserting his dictatorship in the face of many dissenting opinions from many people he respects, or putting it in the too hard basket. I fear the latter is the easiest option. At the end of this mail I list some of the major threads over the last few months, and can't see a single thread that has resulted in a CVS checkin, and only one that has resulted in agreement. This, to my mind at least, is proof that things are really not working. I long for the "good old days" - take the replacement of "ni" with built-in functionality, for example. I posit that if this was discussed on python-dev, it would have caused a huge flood of mail, and nothing remotely resembling a consensus. Instead, Guido simply wrote an essay and implemented some code that he personally liked. No debate, no discussion. Still an excellent result. Maybe not a perfect result, but a result nonetheless. However, Guido's time is becoming increasingly limited. So should we consider moving to a "benevolent lieutenent" model, in conjunction with re-ramping up the SIGS? This would provide 2 ways to get things done: * A new SIG. Take relative imports, for example. If we really do need a change in this fairly fundamental area, a SIG would be justified ("import-sig"). The responsibility of the SIG is to form a consensus (and code that reflects it), and report back to Guido (and the main newsgroup) with the result of this. It worked well for RE, and allowed those of us not particularly interested to keep out of the debate. If the SIG can not form consensus, then tough - it dies - and should not be mourned. Presumably Guido would keep a watchful eye over the SIG, providing direction where necessary, but in general stay out of the day to day traffic. New SIGs seem to have stopped since this list creation, and it seems that issues that should be discussed in new SIGS are now discussed here. * Guido could delegate some of his authority to a single individual responsible for a certain limited area - a benevolent lieutenent. We may have a lieutentant responsible for different areas, and could only exercise their authority with small, trivial changes. Eg, the "getopt helper" thread - if a lieutenant was given authority for the "standard library", they could simply make a yea or nay decision, and present it to Guido. Presumably Guido trusts this person he delegated to enough that the majority of the lieutenant's recommendations would be accepted. Presumably there would be a small number of lieutentants, and they would then become the new "python-dev" - say up to 5 people. This list then discusses high level strategies and seek direction from each other when things get murky. This select group of people may not (indeed, probably would not) include me, but I would have no problem with that - I would prefer to see results achieved than have my own ego stroked by being included in a select, but ineffective group. In parting, I repeat this is not a direct critisism, simply an observation of the last few months. I am on this list, so I am definately as guilty as any one else - which is "not at all" - ie, no one is guilty, I simply see it as endemic to a committee with people of diverse backgrounds, skills and opinions. Any thoughts? Long live the dictator! :-) Mark. Recent threads, and my take on the results: * getopt helper? Too much noise regarding semantic changes. * Alternative Approach to Relative Imports * Relative package imports * Path hacking * Towards a Python based import scheme Too much noise - no one could really agree on the semantics. Implementation thrown in the ring, and promptly forgotten. * Corporate installations Very young, but no result at all. * Embedding Python when using different calling conventions Quite young, but no result as yet, and I have no reason to believe there will be. * Catching "return" and "return expr" at compile time Seemed to be blessed - yay! Dont believe I have seen a check-in yet. * More Python command-line features Seemed general agreement, but nothing happened? * Tackling circular dependencies in 2.0? Lots of noise, but no results other than "GC may be there in 2.0" * Buffer interface in abstract.c Determined it could break - no solution proposed. Lots of noise regarding if is is a good idea at all! * mmapfile module No result. * Quick-and-dirty weak references No result. * Portable "spawn" module for core? No result. * Fake threads Seemed to spawn stackless Python, but in the face of Guido being "at best, lukewarm" about this issue, I would again have to conclude "no result". An authorative "no" in this area may have saved lots of effort and heartache. * add Expat to 1.6 No result. * I'd like list.pop to accept an optional second argument giving a default value No result * etc No result. From jack@oratrix.nl Mon Nov 1 09:56:48 1999 From: jack@oratrix.nl (Jack Jansen) Date: Mon, 01 Nov 1999 10:56:48 +0100 Subject: [Python-Dev] Embedding Python when using different calling conventions. In-Reply-To: Message by "M.-A. Lemburg" , Sat, 30 Oct 1999 10:46:30 +0200 , <381AB066.B54A47E0@lemburg.com> Message-ID: <19991101095648.DC2E535BB1E@snelboot.oratrix.nl> > OTOH, we could take chance to reorganize these macros from bottom > up: when I started coding extensions I found them not very useful > mostly because I didn't have control over them meaning "export > this symbol" or "import the symbol". Especially the DL_IMPORT > macro is strange because it seems to handle both import *and* > export depending on whether Python is compiled or not. This would be very nice. The DL_IMPORT/DL_EXPORT stuff is really weird unless you're working with it all the time. We were trying to build a plugin DLL for PythonWin and first you spend hours finding out that you have to set DL_IMPORT (and how to set it), and the you spend another few hours before you realize that you can't simply copy the DL_IMPORT and DL_EXPORT from, say, timemodule.c because timemodule.c is going to be in the Python core (and hence can use DL_IMPORT for its init() routine declaration) while your module is going to be a plugin so it can't. I would opt for a scheme where the define shows where the symbols is expected to live (DL_CORE and DL_THISMODULE would be needed at least, but probably one or two more for .h files). -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From jack@oratrix.nl Mon Nov 1 10:12:37 1999 From: jack@oratrix.nl (Jack Jansen) Date: Mon, 01 Nov 1999 11:12:37 +0100 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: Message by "Mark Hammond" , Mon, 1 Nov 1999 12:51:56 +1100 , <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <19991101101238.3D6FA35BB1E@snelboot.oratrix.nl> I think I agree with Mark's post, although I do see a little more light (the relative imports dicussion resulted in working code, for instance). The benevolent lieutenant idea may work, _if_ the lieutenants can be found. I myself will quickly join Mark in wishing the new python-dev well and abandoning ship (half a :-). If that doesn't work maybe we should try at the very least to create a "memory". If you bring up a subject for discussion and you don't have working code that's fine the first time. But if anyone brings it up a second time they're supposed to have code. That way at least we won't be rehashing old discussions (as happend on the python-list every time, with subjects like GC or optimizations). And maybe we should limit ourselves in our replies: don't speak up too much in discussions if you're not going to write code. I know that I'm pretty good at answering with my brilliant insights to everything myself:-). It could well be that refining and refining the design (as in the getopt discussion) results in such a mess of opinions that no-one has the guts to write the code anymore. -- Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++ Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++ www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm From mal@lemburg.com Mon Nov 1 11:09:21 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 01 Nov 1999 12:09:21 +0100 Subject: [Python-Dev] dircache.py References: <1270737688-19939033@hypernet.com> Message-ID: <381D74E0.1AE3DA6A@lemburg.com> Gordon McMillan wrote: > > Pursuant to my volunteering to implement Guido's plan to > combine cmp.py, cmpcache.py, dircmp.py and dircache.py > into filecmp.py, I did some investigating of dircache.py. > > I find it completely unreliable. On my NT box, the mtime of the > directory is updated (on average) 2 secs after a file is added, > but within 10 tries, there's always one in which it takes more > than 100 secs (and my test script quits). My Linux box hardly > ever detects a change within 100 secs. > > I've tried a number of ways of testing this ("this" being > checking for a change in the mtime of the directory), the latest > of which is below. Even if dircache can be made to work > reliably and surprise-free on some platforms, I doubt it can be > done cross-platform. So I'd recommend that it just get dropped. > > Comments? Note that you'll have to flush and close the tmp file to actually have it written to the file system. That's why you are not seeing any new mtimes on Linux. Still, I'd suggest declaring it obsolete. Filesystem access is usually cached by the underlying OS anyway, so adding another layer of caching on top of it seems not worthwhile (plus, the OS knows better when and what to cache). Another argument against using stat() time entries for caching purposes is the resolution of 1 second. It makes the dircache.py unreliable per se for fast changing directories. The problem is most probably even worse for NFS and on Samba mounted WinXX filesystems the mtime trick doesn't work at all (stat() returns the creation time for atime, mtime and ctime). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 60 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From gward@cnri.reston.va.us Mon Nov 1 13:28:51 1999 From: gward@cnri.reston.va.us (Greg Ward) Date: Mon, 1 Nov 1999 08:28:51 -0500 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat>; from mhammond@skippinet.com.au on Mon, Nov 01, 1999 at 12:51:56PM +1100 References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <19991101082851.A16952@cnri.reston.va.us> On 01 November 1999, Mark Hammond said: > I have for some time been wondering about the usefulness of this > mailing list. It seems to have produced staggeringly few results > since inception. Perhaps this is an indication of stability rather than stagnation. Of course we can't have *total* stability or Python 1.6 will never appear, but... > * Portable "spawn" module for core? > No result. ...I started this little thread to see if there was any interest, and to find out the easy way if VMS/Unix/DOS-style "spawn sub-process with list of strings as command-line arguments" makes any sense at all on the Mac without actually having to go learn about the Mac. The result: if 'spawn()' is added to the core, it should probably be 'os.spawn()', but it's not really clear if this is necessary or useful to many people; and, no, it doesn't make sense on the Mac. That answered my questions, so I don't really see the thread as a failure. I might still turn the distutils.spawn module into an appendage of the os module, but there doesn't seem to be a compelling reason to do so. Not every thread has to result in working code. In other words, negative results are results too. Greg From skip@mojam.com (Skip Montanaro) Mon Nov 1 16:58:41 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Mon, 1 Nov 1999 10:58:41 -0600 (CST) Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat> References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <14365.50881.778143.590205@dolphin.mojam.com> Mark> * Catching "return" and "return expr" at compile time Mark> Seemed to be blessed - yay! Dont believe I have seen a check-in Mark> yet. I did post a patch to compile.c here and to the announce list. I think the temporal distance between the furor in the main list and when it appeared "in print" may have been a problem. Also, as the author of that code I surmised that compile.c was the wrong place for it. I would have preferred to see it in some Python code somewhere, but there's no obvious place to put it. Finally, there is as yet no convention about how to handle warnings. (Maybe some sort of PyLint needs to be "blessed" and made part of the distribution.) Perhaps python-dev would be good to generate SIGs, sort of like a hurricane spinning off tornadoes. Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From guido@CNRI.Reston.VA.US Mon Nov 1 18:41:32 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Mon, 01 Nov 1999 13:41:32 -0500 Subject: [Python-Dev] Misleading syntax error text In-Reply-To: Your message of "Mon, 01 Nov 1999 00:00:55 +0100." <381CCA27.59506CF6@lemburg.com> References: <1270838575-13870925@hypernet.com> <381CCA27.59506CF6@lemburg.com> Message-ID: <199911011841.NAA06233@eric.cnri.reston.va.us> > How about chainging the com_assign_trailer function in Python/compile.c > to: Please don't use the python-dev list for issues like this. The place to go is the python-bugs database (http://www.python.org/search/search_bugs.html) or you could just send me a patch (please use a context diff and include the standard disclaimer language). --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Mon Nov 1 19:06:39 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 01 Nov 1999 20:06:39 +0100 Subject: [Python-Dev] Misleading syntax error text References: <1270838575-13870925@hypernet.com> <381CCA27.59506CF6@lemburg.com> <199911011841.NAA06233@eric.cnri.reston.va.us> Message-ID: <381DE4BF.951B03F0@lemburg.com> Guido van Rossum wrote: > > > How about chainging the com_assign_trailer function in Python/compile.c > > to: > > Please don't use the python-dev list for issues like this. The place > to go is the python-bugs database > (http://www.python.org/search/search_bugs.html) or you could just send > me a patch (please use a context diff and include the standard disclaimer > language). This wasn't really a bug report... I was actually looking for some feedback prior to sending a real (context) patch. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 60 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jim@interet.com Tue Nov 2 15:43:56 1999 From: jim@interet.com (James C. Ahlstrom) Date: Tue, 02 Nov 1999 10:43:56 -0500 Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee? References: <002301bf240b$ae61fa00$0501a8c0@bobcat> Message-ID: <381F06BC.CC2CBFBD@interet.com> Mark Hammond wrote: > > I have for some time been wondering about the usefulness of this > mailing list. It seems to have produced staggeringly few results > since inception. I appreciate the points you made, but I think this list is still a valuable place to air design issues. I don't want to see too many Python core changes anyway. Just my 2.E-2 worth. Jim Ahlstrom From Vladimir.Marangozov@inrialpes.fr Wed Nov 3 22:34:44 1999 From: Vladimir.Marangozov@inrialpes.fr (Vladimir Marangozov) Date: Wed, 3 Nov 1999 23:34:44 +0100 (NFT) Subject: [Python-Dev] paper available Message-ID: <199911032234.XAA26442@pukapuka.inrialpes.fr> I've OCR'd Saltzer's paper. It's available temporarily (in MS Word format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip Since there may be legal problems with LNCS, I will disable the link shortly (so those of you who have not received a copy and are interested in reading it, please grab it quickly) If prof. Saltzer agrees (and if he can, legally) put it on his web page, I guess that the paper will show up at http://mit.edu/saltzer/ Jeremy, could you please check this with prof. Saltzer? (This version might need some corrections due to the OCR process, despite that I've made a significant effort to clean it up) -- Vladimir MARANGOZOV | Vladimir.Marangozov@inrialpes.fr http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252 From guido@CNRI.Reston.VA.US Thu Nov 4 20:58:53 1999 From: guido@CNRI.Reston.VA.US (Guido van Rossum) Date: Thu, 04 Nov 1999 15:58:53 -0500 Subject: [Python-Dev] wish list Message-ID: <199911042058.PAA15437@eric.cnri.reston.va.us> I got the wish list below. Anyone care to comment on how close we are on fulfilling some or all of this? --Guido van Rossum (home page: http://www.python.org/~guido/) ------- Forwarded Message Date: Thu, 04 Nov 1999 20:26:54 +0700 From: "Claudio Ramón" To: guido@python.org Hello, I'm a python user (excuse my english, I'm spanish and...). I think it is a very complete language and I use it in solve statistics, phisics, mathematics, chemistry and biology problemns. I'm not an experienced programmer, only a scientific with problems to solve. The motive of this letter is explain to you a needs that I have in the python use and I think in the next versions... * GNU CC for Win32 compatibility (compilation of python interpreter and "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative eviting the cygwin dll user. * Add low level programming capabilities for system access and speed of code fragments eviting the C-C++ or Java code use. Python, I think, must be a complete programming language in the "programming for every body" philosofy. * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI in the standard distribution. For example, Wxpython permit an html browser. It is very importan for document presentations. And Wxwindows and Gtk+ are faster than tk. * Incorporate a database system in the standard library distribution. To be possible with relational and documental capabilites and with import facility of DBASE, Paradox, MSAccess files. * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to be possible with XML how internal file format). And to be possible with Microsoft Word import export facility. For example, AbiWord project can be an alternative but if lacks programming language. If we can make python the programming language for AbiWord project... Thanks. Ramón Molina. rmn70@hotmail.com ______________________________________________________ Get Your Private, Free Email at http://www.hotmail.com ------- End of Forwarded Message From skip@mojam.com (Skip Montanaro) Thu Nov 4 21:06:53 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Thu, 4 Nov 1999 15:06:53 -0600 (CST) Subject: [Python-Dev] wish list In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us> References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <14369.62829.389307.377095@dolphin.mojam.com> * Incorporate a database system in the standard library distribution. To be possible with relational and documental capabilites and with import facility of DBASE, Paradox, MSAccess files. I know Digital Creations has a dbase module knocking around there somewhere. I hacked on it for them a couple years ago. You might see if JimF can scrounge it up and donate it to the cause. Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From fdrake@acm.org Thu Nov 4 21:08:26 1999 From: fdrake@acm.org (Fred L. Drake, Jr.) Date: Thu, 4 Nov 1999 16:08:26 -0500 (EST) Subject: [Python-Dev] wish list In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us> References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <14369.62922.994300.233350@weyr.cnri.reston.va.us> Guido van Rossum writes: > I got the wish list below. Anyone care to comment on how close we are > on fulfilling some or all of this? Claudio Ramón wrote: > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. And GTK+ looks better, too. ;-) None the less, I don't think GTK+ is as solid or mature as Tk. There are still a lot of oddities, and several warnings/errors get messages printed on stderr/stdout (don't know which) rather than raising exceptions. (This is a failing of GTK+, not PyGTK.) There isn't an equivalent of the Tk text widget, which is a real shame. There are people working on something better, but it's not a trivial project and I don't have any idea how its going. > * Incorporate a database system in the standard library distribution. To be > possible with relational and documental capabilites and with import facility > of DBASE, Paradox, MSAccess files. Doesn't sound like part of a core library really, though I could see combining the Win32 extensions with the core package to produce a single installable. That should at least provide access to MSAccess, and possible the others, via ODBC. > * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to > be possible with XML how internal file format). And to be possible with > Microsoft Word import export facility. For example, AbiWord project can be > an alternative but if lacks programming language. If we can make python the > programming language for AbiWord project... I think this would be great to have. But I wouldn't put the editor/browser in the core. I would stick something like the XML-SIG's package in, though, once that's better polished. -Fred -- Fred L. Drake, Jr. Corporation for National Research Initiatives From jim@interet.com Fri Nov 5 00:09:40 1999 From: jim@interet.com (James C. Ahlstrom) Date: Thu, 04 Nov 1999 19:09:40 -0500 Subject: [Python-Dev] wish list References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <38222044.46CB297E@interet.com> Guido van Rossum wrote: > > I got the wish list below. Anyone care to comment on how close we are > on fulfilling some or all of this? > * GNU CC for Win32 compatibility (compilation of python interpreter and > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > eviting the cygwin dll user. I don't know what this means. > * Add low level programming capabilities for system access and speed of code > fragments eviting the C-C++ or Java code use. Python, I think, must be a > complete programming language in the "programming for every body" philosofy. I don't know what this means in practical terms either. I use the C interface for this. > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. As a Windows user, I don't feel comfortable publishing GUI code based on these tools. Maybe they have progressed and I should look at them again. But I doubt the Python world is going to standardize on a single GUI anyway. Does anyone out there publish Windows Python code with a Windows Python GUI? If so, what GUI toolkit do you use? Jim Ahlstrom From rushing@nightmare.com Fri Nov 5 07:22:22 1999 From: rushing@nightmare.com (Sam Rushing) Date: Thu, 4 Nov 1999 23:22:22 -0800 (PST) Subject: [Python-Dev] wish list In-Reply-To: <668469884@toto.iv> Message-ID: <14370.34222.884193.260990@seattle.nightmare.com> James C. Ahlstrom writes: > Guido van Rossum wrote: > > I got the wish list below. Anyone care to comment on how close we are > > on fulfilling some or all of this? > > > * GNU CC for Win32 compatibility (compilation of python interpreter and > > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > > eviting the cygwin dll user. > > I don't know what this means. mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying to be unix. It links against crtdll, so for example it can generate small executables that run on any win32 platform. Also, an alternative to plunking down money ever year to keep up with MSVC++ I used to use mingw32 a lot, and it's even possible to set up egcs to cross-compile to it. At one point using egcs on linux I was able to build a stripped-down python.exe for win32... http://agnes.dida.physik.uni-essen.de/~janjaap/mingw32/ -Sam From jim@interet.com Fri Nov 5 14:04:59 1999 From: jim@interet.com (James C. Ahlstrom) Date: Fri, 05 Nov 1999 09:04:59 -0500 Subject: [Python-Dev] wish list References: <14370.34222.884193.260990@seattle.nightmare.com> Message-ID: <3822E40B.99BA7CA0@interet.com> Sam Rushing wrote: > mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying > to be unix. It links against crtdll, so for example it can generate OK, thanks. But I don't believe this is something that Python should pursue. Binaries are available for Windows and Visual C++ is widely available and has a professional debugger (etc.). Jim Ahlstrom From skip@mojam.com (Skip Montanaro) Fri Nov 5 17:17:58 1999 From: skip@mojam.com (Skip Montanaro) (Skip Montanaro) Date: Fri, 5 Nov 1999 11:17:58 -0600 (CST) Subject: [Python-Dev] paper available In-Reply-To: <199911032234.XAA26442@pukapuka.inrialpes.fr> References: <199911032234.XAA26442@pukapuka.inrialpes.fr> Message-ID: <14371.4422.96832.498067@dolphin.mojam.com> Vlad> I've OCR'd Saltzer's paper. It's available temporarily (in MS Word Vlad> format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip I downloaded it and took a very quick peek at it, but it's applicability to Python wasn't immediately obvious to me. Did you download it in response to some other thread I missed somewhere? Skip Montanaro | http://www.mojam.com/ skip@mojam.com | http://www.musi-cal.com/ 847-971-7098 | Python: Programming the way Guido indented... From gstein@lyra.org Fri Nov 5 22:19:49 1999 From: gstein@lyra.org (Greg Stein) Date: Fri, 5 Nov 1999 14:19:49 -0800 (PST) Subject: [Python-Dev] wish list In-Reply-To: <3822E40B.99BA7CA0@interet.com> Message-ID: On Fri, 5 Nov 1999, James C. Ahlstrom wrote: > Sam Rushing wrote: > > mingw32: 'minimalist gcc for win32'. it's gcc on win32 without trying > > to be unix. It links against crtdll, so for example it can generate > > OK, thanks. But I don't believe this is something that > Python should pursue. Binaries are available for Windows > and Visual C++ is widely available and has a professional > debugger (etc.). If somebody is willing to submit patches, then I don't see a problem with it. There are quite a few people who are unable/unwilling to purchase VC++. People may also need to build their own Python rather than using the prebuilt binaries. Cheers, -g -- Greg Stein, http://www.lyra.org/ From gstein@lyra.org Sun Nov 7 13:24:24 1999 From: gstein@lyra.org (Greg Stein) Date: Sun, 7 Nov 1999 05:24:24 -0800 (PST) Subject: [Python-Dev] updated modules Message-ID: Hi all... I've updated some of the modules at http://www.lyra.org/greg/python/. Specifically, there is a new httplib.py, davlib.py, qp_xml.py, and a new imputil.py. The latter will be updated again RSN with some patches from Jim Ahlstrom. Besides some tweaks/fixes/etc, I've also clarified the ownership and licensing of the things. httplib and davlib are (C) Guido, licensed under the Python license (well... anything he chooses :-). qp_xml and imputil are still Public Domain. I also added some comments into the headers to note where they come from (I've had a few people remark that they ran across the module but had no idea who wrote it or where to get updated versions :-), and I inserted a CVS Id to track the versions (yes, I put them into CVS just now). Note: as soon as I figure out the paperwork or whatever, I'll also be skipping the whole "wetsign.txt" thingy and just transfer everything to Guido. He remarked a while ago that he will finally own some code in the Python distribution(!) despite not writing it :-) I might encourage others to consider the same... Cheers, -g -- Greg Stein, http://www.lyra.org/ From mal@lemburg.com Mon Nov 8 09:33:30 1999 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 08 Nov 1999 10:33:30 +0100 Subject: [Python-Dev] wish list References: <199911042058.PAA15437@eric.cnri.reston.va.us> Message-ID: <382698EA.4DBA5E4B@lemburg.com> Guido van Rossum wrote: > > * GNU CC for Win32 compatibility (compilation of python interpreter and > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative > eviting the cygwin dll user. I think this would be a good alternative for all those not having MS VC for one reason or another. Since Mingw32 is free this might be an appropriate solution for e.g. schools which don't want to spend lots of money for VC licenses. > * Add low level programming capabilities for system access and speed of code > fragments eviting the C-C++ or Java code use. Python, I think, must be a > complete programming language in the "programming for every body" philosofy. Don't know what he meant here... > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI > in the standard distribution. For example, Wxpython permit an html browser. > It is very importan for document presentations. And Wxwindows and Gtk+ are > faster than tk. GUIs tend to be fast moving targets, better leave them out of the main distribution. > * Incorporate a database system in the standard library distribution. To be > possible with relational and documental capabilites and with import facility > of DBASE, Paradox, MSAccess files. Database interfaces are usually way to complicated and largish for the standard dist. IMHO, they should always be packaged separately. Note that simple interfaces such as a standard CSV file import/export module would be neat extensions to the dist. > * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to > be possible with XML how internal file format). And to be possible with > Microsoft Word import export facility. For example, AbiWord project can be > an alternative but if lacks programming language. If we can make python the > programming language for AbiWord project... I'm getting the feeling that Ramon is looking for a complete visual programming environment here. XML support in the standard dist (faster than xmllib.py) would be nice. Before that we'd need solid builtin Unicode support though... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 53 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@robanal.demon.co.uk Tue Nov 9 13:57:46 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 9 Nov 1999 05:57:46 -0800 (PST) Subject: [Python-Dev] Internationalisation Case Study Message-ID: <19991109135746.20446.rocketmail@web608.mail.yahoo.com> Guido has asked me to get involved in this discussion, as I've been working practically full-time on i18n for the last year and a half and have done quite a bit with Python in this regard. I thought the most helpful thing would be to describe the real-world business problems I have been tackling so people can understand what one might want from an encoding toolkit. In this (long) post I have included: 1. who I am and what I want to do 2. useful sources of info 3. a real world i18n project 4. what I'd like to see in an encoding toolkit Grab a coffee - this is a long one. 1. Who I am -------------- Firstly, credentials. I'm a Python programmer by night, and when I can involve it in my work which happens perhaps 20% of the time. More relevantly, I did a postgrad course in Japanese Studies and lived in Japan for about two years; in 1990 when I returned, I was speaking fairly fluently and could read a newspaper with regular reference tio a dictionary. Since then my Japanese has atrophied badly, but it is good enough for IT purposes. For the last year and a half I have been internationalizing a lot of systems - more on this below. My main personal interest is that I am hoping to launch a company using Python for reporting, data cleaning and transformation. An encoding library is sorely needed for this. 2. Sources of Knowledge ------------------------------ We should really go for world class advice on this. Some people who could really contribute to this discussion are: - Ken Lunde, author of "CJKV Information Processing" and head of Asian Type Development at Adobe. - Jeffrey Friedl, author of "Mastering Regular Expressions", and a long time Japan resident and expert on things Japanese - Maybe some of the Ruby community? I'll list up books URLs etc. for anyone who needs them on request. 3. A Real World Project ---------------------------- 18 months ago I was offered a contract with one of the world's largest investment management companies (which I will nickname HugeCo) , who (after many years having analysts out there) were launching a business in Japan to attract savers; due to recent legal changes, Japanese people can now freely buy into mutual funds run by foreign firms. Given the 2% they historically get on their savings, and the 12% that US equities have returned for most of this century, this is a business with huge potential. I've been there for a while now, rotating through many different IT projects. HugeCo runs its non-US business out of the UK. The core deal-processing business runs on IBM AS400s. These are kind of a cross between a relational database and a file system, and speak their own encoding called EBCDIC. Five years ago the AS400 had limited connectivity to everything else, so they also started deploying Sybase databases on Unix to support some functions. This means 'mirroring' data between the two systems on a regular basis. IBM has always included encoding information on the AS400 and it converts from EBCDIC to ASCII on request with most of the transfer tools (FTP, database queries etc.) To make things work for Japan, everyone realised that a double-byte representation would be needed. Japanese has about 7000 characters in most IT-related character sets, and there are a lot of ways to store it. Here's a potted language lesson. (Apologies to people who really know this field -- I am not going to be fully pedantic or this would take forever). Japanese includes two phonetic alphabets (each with about 80-90 characters), the thousands of Kanji, and English characters, often all in the same sentence. The first attempt to display something was to make a single -byte character set which included ASCII, and a simplified (and very ugly) katakana alphabet in the upper half of the code page. So you could spell out the sounds of Japanese words using 'half width katakana'. The basic 'character set' is Japan Industrial Standard 0208 ("JIS"). This was defined in 1978, the first official Asian character set to be defined by a government. This can be thought of as a printed chart showing the characters - it does not define their storage on a computer. It defined a logical 94 x 94 grid, and each character has an index in this grid. The "JIS" encoding was a way of mixing ASCII and Japanese in text files and emails. Each Japanese character had a double-byte value. It had 'escape sequences' to say 'You are now entering ASCII territory' or the opposite. In 1978 Microsoft quickly came up with Shift-JIS, a smarter encoding. This basically said "Look at the next byte. If below 127, it is ASCII; if between A and B, it is a half-width katakana; if between B and C, it is the first half of a double-byte character and the next one is the second half". Extended Unix Code (EUC) does similar tricks. Both have the property that there are no control characters, and ASCII is still ASCII. There are a few other encodings too. Unfortunately for me and HugeCo, IBM had their own standard before the Japanese government did, and it differs; it is most commonly called DBCS (Double-Byte Character Set). This involves shift-in and shift-out sequences (0x16 and 0x17, cannot remember which way round), so you can mix single and double bytes in a field. And we used AS400s for our core processing. So, back to the problem. We had a FoxPro system using ShiftJIS on the desks in Japan which we wanted to replace in stages, and an AS400 database to replace it with. The first stage was to hook them up so names and addresses could be uploaded to the AS400, and data files consisting of daily report input could be downloaded to the PCs. The AS400 supposedly had a library which did the conversions, but no one at IBM knew how it worked. The people who did all the evaluations had basically proved that 'Hello World' in Japanese could be stored on an AS400, but never looked at the conversion issues until mid-project. Not only did we need a conversion filter, we had the problem that the character sets were of different sizes. So it was possible - indeed, likely - that some of our ten thousand customers' names and addresses would contain characters only on one system or the other, and fail to survive a round trip. (This is the absolute key issue for me - will a given set of data survive a round trip through various encoding conversions?) We figured out how to get the AS400 do to the conversions during a file transfer in one direction, and I wrote some Python scripts to make up files with each official character in JIS on a line; these went up with conversion, came back binary, and I was able to build a mapping table and 'reverse engineer' the IBM encoding. It was straightforward in theory, "fun" in practice. I then wrote a python library which knew about the AS400 and Shift-JIS encodings, and could translate a string between them. It could also detect corruption and warn us when it occurred. (This is another key issue - you will often get badly encoded data, half a kanji or a couple of random bytes, and need to be clear on your strategy for handling it in any library). It was slow, but it got us our gateway in both directions, and it warned us of bad input. 360 characters in the DBCS encoding actually appear twice, so perfect round trips are impossible, but practically you can survive with some validation of input at both ends. The final story was that our names and addresses were mostly safe, but a few obscure symbols weren't. A big issue was that field lengths varied. An address field 40 characters long on a PC might grow to 42 or 44 on an AS400 because of the shift characters, so the software would truncate the address during import, and cut a kanji in half. This resulted in a string that was illegal DBCS, and errors in the database. To guard against this, you need really picky input validation. You not only ask 'is this string valid Shift-JIS', you check it will fit on the other system too. The next stage was to bring in our Sybase databases. Sybase make a Unicode database, which works like the usual one except that all your SQL code suddenly becomes case sensitive - more (unrelated) fun when you have 2000 tables. Internally it stores data in UTF8, which is a 'rearrangement' of Unicode which is much safer to store in conventional systems. Basically, a UTF8 character is between one and three bytes, there are no nulls or control characters, and the ASCII characters are still the same ASCII characters. UTF8<->Unicode involves some bit twiddling but is one-to-one and entirely algorithmic. We had a product to 'mirror' data between AS400 and Sybase, which promptly broke when we fed it Japanese. The company bought a library called Unilib to do conversions, and started rewriting the data mirror software. This library (like many) uses Unicode as a central point in all conversions, and offers most of the world's encodings. We wanted to test it, and used the Python routines to put together a regression test. As expected, it was mostly right but had some differences, which we were at least able to document. We also needed to rig up a daily feed from the legacy FoxPro database into Sybase while it was being replaced (about six months). We took the same library, built a DLL wrapper around it, and I interfaced to this with DynWin , so we were able to do the low-level string conversion in compiled code and the high-level control in Python. A FoxPro batch job wrote out delimited text in shift-JIS; Python read this in, ran it through the DLL to convert it to UTF8, wrote that out as UTF8 delimited files, ftp'ed them to an in directory on the Unix box ready for daily import. At this point we had a lot of fun with field widths - Shift-JIS is much more compact than UTF8 when you have a lot of kanji (e.g. address fields). Another issue was half-width katakana. These were the earliest attempt to get some form of Japanese out of a computer, and are single-byte characters above 128 in Shift-JIS - but are not part of the JIS0208 standard. They look ugly and are discouraged; but when you ar enterinh a long address in a field of a database, and it won't quite fit, the temptation is to go from two-bytes-per -character to one (just hit F7 in windows) to save space. Unilib rejected these (as would Java), but has optional modes to preserve them or 'expand them out' to their full-width equivalents. The final technical step was our reports package. This is a 4GL using a really horrible 1980s Basic-like language which reads in fixed-width data files and writes out Postscript; you write programs saying 'go to x,y' and 'print customer_name', and can build up anything you want out of that. It's a monster to develop in, but when done it really works - million page jobs no problem. We had bought into this on the promise that it supported Japanese; actually, I think they had got the equivalent of 'Hello World' out of it, since we had a lot of problems later. The first stage was that the AS400 would send down fixed width data files in EBCDIC and DBCS. We ran these through a C++ conversion utility, again using Unilib. We had to filter out and warn about corrupt fields, which the conversion utility would reject. Surviving records then went into the reports program. It then turned out that the reports program only supported some of the Japanese alphabets. Specifically, it had a built in font switching system whereby when it encountered ASCII text, it would flip to the most recent single byte text, and when it found a byte above 127, it would flip to a double byte font. This is because many Chinese fonts do (or did) not include English characters, or included really ugly ones. This was wrong for Japanese, and made the half-width katakana unprintable. I found out that I could control fonts if I printed one character at a time with a special escape sequence, so wrote my own bit-scanning code (tough in a language without ord() or bitwise operations) to examine a string, classify every byte, and control the fonts the way I wanted. So a special subroutine is used for every name or address field. This is apparently not unusual in GUI development (especially web browsers) - you rarely find a complete Unicode font, so you have to switch fonts on the fly as you print a string. After all of this, we had a working system and knew quite a bit about encodings. Then the curve ball arrived: User Defined Characters! It is not true to say that there are exactly 6879 characters in Japanese, and more than counting the number of languages on the Indian sub-continent or the types of cheese in France. There are historical variations and they evolve. Some people's names got missed out, and others like to write a kanji in an unusual way. Others arrived from China where they have more complex variants of the same characters. Despite the Japanese government's best attempts, these people have dug their heels in and want to keep their names the way they like them. My first reaction was 'Just Say No' - I basically said that it one of these customers (14 out of a database of 8000) could show me a tax form or phone bill with the correct UDC on it, we would implement it but not otherwise (the usual workaround is to spell their name phonetically in katakana). But our marketing people put their foot down. A key factor is that Microsoft has 'extended the standard' a few times. First of all, Microsoft and IBM include an extra 360 characters in their code page which are not in the JIS0208 standard. This is well understood and most encoding toolkits know what 'Code Page 932' is Shift-JIS plus a few extra characters. Secondly, Shift-JIS has a User-Defined region of a couple of thousand characters. They have lately been taking Chinese variants of Japanese characters (which are readable but a bit old-fashioned - I can imagine pipe-smoking professors using these forms as an affectation) and adding them into their standard Windows fonts; so users are getting used to these being available. These are not in a standard. Thirdly, they include something called the 'Gaiji Editor' in Japanese Win95, which lets you add new characters to the fonts on your PC within the user-defined region. The first step was to review all the PCs in the Tokyo office, and get one centralized extension font file on a server. This was also fun as people had assigned different code points to characters on differene machines, so what looked correct on your word processor was a black square on mine. Effectively, each company has its own custom encoding a bit bigger than the standard. Clearly, none of these extensions would convert automatically to the other platforms. Once we actually had an agreed list of code points, we scanned the database by eye and made sure that the relevant people were using them. We decided that space for 128 User-Defined Characters would be allowed. We thought we would need a wrapper around Unilib to intercept these values and do a special conversion; but to our amazement it worked! Somebody had already figured out a mapping for at least 1000 characters for all the Japanes encodings, and they did the round trips from Shift-JIS to Unicode to DBCS and back. So the conversion problem needed less code than we thought. This mapping is not defined in a standard AFAIK (certainly not for DBCS anyway). We did, however, need some really impressive validation. When you input a name or address on any of the platforms, the system should say (a) is it valid for my encoding? (b) will it fit in the available field space in the other platforms? (c) if it contains user-defined characters, are they the ones we know about, or is this a new guy who will require updates to our fonts etc.? Finally, we got back to the display problems. Our chosen range had a particular first byte. We built a miniature font with the characters we needed starting in the lower half of the code page. I then generalized by name-printing routine to say 'if the first character is XX, throw it away, and print the subsequent character in our custom font'. This worked beautifully - not only could we print everything, we were using type 1 embedded fonts for the user defined characters, so we could distill it and also capture it for our internal document imaging systems. So, that is roughly what is involved in building a Japanese client reporting system that spans several platforms. I then moved over to the web team to work on our online trading system for Japan, where I am now - people will be able to open accounts and invest on the web. The first stage was to prove it all worked. With HTML, Java and the Web, I had high hopes, which have mostly been fulfilled - we set an option in the database connection to say 'this is a UTF8 database', and Java converts it to Unicode when reading the results, and we set another option saying 'the output stream should be Shift-JIS' when we spew out the HTML. There is one limitations: Java sticks to the JIS0208 standard, so the 360 extra IBM/Microsoft Kanji and our user defined characters won't work on the web. You cannot control the fonts on someone else's web browser; management accepted this because we gave them no alternative. Certain customers will need to be warned, or asked to suggest a standard version of a charactere if they want to see their name on the web. I really hope the web actually brings character usage in line with the standard in due course, as it will save a fortune. Our system is multi-language - when a customer logs in, we want to say 'You are a Japanese customer of our Tokyo Operation, so you see page X in language Y'. The language strings all all kept in UTF8 in XML files, so the same file can hold many languages. This and the database are the real-world reasons why you want to store stuff in UTF8. There are very few tools to let you view UTF8, but luckily there is a free Word Processor that lets you type Japanese and save it in any encoding; so we can cut and paste between Shift-JIS and UTF8 as needed. And that's it. No climactic endings and a lot of real world mess, just like life in IT. But hopefully this gives you a feel for some of the practical stuff internationalisation projects have to deal with. See my other mail for actual suggestions - Andy Robinson ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com From andy@robanal.demon.co.uk Tue Nov 9 13:58:39 1999 From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=) Date: Tue, 9 Nov 1999 05:58:39 -0800 (PST) Subject: [Python-Dev] Internationalization Toolkit Message-ID: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> Here are the features I'd like to see in a Python Internationalisation Toolkit. I'm very open to persuasion about APIs and how to do it, but this is roughly the functionality I would have wanted for the last year (see separate post "Internationalization Case Study"): Built-in types: --------------- "Unicode String" and "Normal String". The normal string is can hold all 256 possible byte values and is analogous to java's Byte Array - in other words an ordinary Python string. Unicode strings iterate (and are manipulated) per character, not per byte. You knew that already. To manipulate anything in a funny encoding, you convert it to Unicode, manipulate it there, then convert it back. Easy Conversions ---------------------- This is modelled on Java which I think has it right. When you construct a Unicode string, you may supply an optional encoding argument. I'm not bothered if conversion happens in a global function, a constructor method or whatever. MyUniString = ToUnicode('hello') # assumes ASCII MyUniString = ToUnicode('pretend this is Japanese', 'ShiftJIS') #specified The converse applies when converting back. The encoding designators should agree with Java. If data is encountered which is not valid for the encoding, there are several strategies, and it would be nice if they could be specified explicitly: 1. replace offending characters with a question mark 2. try to recover intelligently (possible in some cases) 3. raise an exception A 'Unicode' designator is needed which performs a dummy conversion. File Opening: --------------- It should be possible to work with files as we do now - just streams of binary data. It should also be possible to read, say, a file of locally endoded addresses into a Unicode string. e.g. open(myfile, 'r', 'ShiftJIS'). It should also be possible to open a raw Unicode file and read the bytes into ordinary Python strings, or Unicode strings. In this case one needs to watch out for the byte-order marks at the beginning of the file. Not sure of a good API to do this. We could have OrdinaryFile objects and UnicodeFile objects, or proliferate the arguments to 'open. Doing the Conversions ---------------------------- All conversions should go through Unicode as the central point. Here is where we can start to define the territory. Some conversions are algorithmic, some are lookups, many are a mixture with some simple state transitions (e.g. shift characters to denote switches from double-byte to single-byte). I'd like to see an 'encoding engine' modelled on something like mxTextTools - a state machine with a few simple actions, effectively a mini-language for doing simple operations. Then a new encoding can be added in a data-driven way, and still go at C-like speeds. Making this open and extensible (and preferably not needing to code C to do it) is the only way I can see to get a really good solid encodings library. Not all encodings need go in the standard distribution, but all should be downloadable from www.python.org. A generalized two-byte-to-two-byte mapping is 128kb. But there are compact forms which can reduce these to a few kb, and also make the data intelligible. It is obviously desirable to store stuff compactly if we can unpack it fast. Typed Strings ---------------- When you are writing data conversion tools to sit in the middle of a bunch of databases, you could save a lot of grief with a string that knows its encoding. What follows could be done as a Python wrapper around something ordinary strings rather than as a new type, and thus need not be part of the language. This is analogous to Martin Fowler's Quantity pattern in Analysis Patterns, where a number knows its units and you cannot add dollars and pounds accidentally. These would do implicit conversions; and they would stop you assigning or confusing differently encoded strings. They would also validate when constructed. 'Typecasting' would be allowed but would require explicit code. So maybe something like... >>>ts1 = TypedString('hello', 'cp932ms') # specify encoding, it remembers it >>>ts2 = TypedString('goodbye','cp5035') >>>ts1 + ts2 #or any of a host of other encoding options EncodingError >>>ts3 = TypedString(ts1, 'cp5035') #converts it implicitly going via Unicode >>>ts4 = ts1.cast('ShiftJIS') #the developer knows that in this case the string is compatible. Going Deeper ---------------- The project I describe involved many more issues than just a straight conversion. I envisage an encodings package or module which power users could get at directly. We have be able to answer the questions: 'is string X a valid instance of encoding Y?' 'is string X nearly a valid instance of encoding Y, maybe with a little corruption, or is it something totally different?' - this one might be a task left to a programmer, but the toolkit should help where it can. 'can string X be converted from encoding Y to encoding Z without loss of data? If not, exactly what will get trashed' ? This is a really useful utility. More generally, I want tools to reason about character sets and encodings. I have 'Character Set' and 'Character Mapping' classes - very app-specific and proprietary - which let me express and answer questions about whether one character set is a superset of another and reason about round trips. I'd like to do these properly for the toolkit. They would need some C support for speed, but I think they could still be data driven. So we could have an Endoding object which could be pickled, and we could keep a directory full of them as our database. There might actually be two encoding objects - one for single-byte, one for multi-byte, with the same API. There are so many subtle differences between encodings (even within the Shift-JIS family) - company X has ten extra characters, and that is technically a new encoding. So it would be really useful to reason about these and say 'find me all JIS-compatible encodings', or 'report on the differences between Shift-JIS and 'cp932ms'. GUI Issues ------------- The new Pythonwin breaks somewhat on Japanese - editor windows are fine but console output is show as single-byte garbage. I will try to evaluate IDLE on a Japanese test box this week. I think these two need to work for double-byte languages for our credibility. Verifiability and printing ----------------------------- We will need to prove it all works. This means looking at text on a screen or on paper. A really wicked demo utility would be a GUI which could open files and convert encodings in an editor window or spreadsheet window, and specify conversions on copy/paste. If it could save a page as HTML (just an encoding tag and data between
 tags, then we
could use Netscape/IE for verification.  Better still,
a web server demo could convert on python.org and tag
the pages appropriately - browsers support most common
encodings.

All the encoding stuff is ultimately a bit meaningless
without a way to display a character.  I am hoping
that PDF and PDFgen may add a lot of value here. 
Adobe (and Ken Lunde) have spent years coming up with
a general architecture for this stuff in PDF. 
Basically, the multi-byte fonts they use are encoding
independent, and come with a whole bunch of mapping
tables.  So I can ask for the same Japanese font in
any of about ten encodings - font name is a
combination of face name and encoding.  The font
itself does the remapping.  They make available
downloadable font packs for Acrobat 4.0 for most
languages now; these are good places to raid for
building encoding databases.  

It also means that I can write a Python script to
crank out beautiful-looking code page charts for all
of our encodings from the database, and input and
output to regression tests.  I've done it for
Shift-JIS at Fidelity, and would have to rewrite it
once I am out of here.  But I think that some good
graphic design here would lead to a product that blows
people away - an encodings library that can print out
its own contents for viewing and thus help demonstrate
its own correctness (or make errors stick out like a
sore thumb).

Am I mad?  Have I put you off forever?  What I outline
above would be a serious project needing months of
work; I'd be really happy to take a role, if we could
find sponsors for the project.  But I believe we could
define the standard for years to come.  Furthermore,
it would go a long way to making Python the corporate
choice for data cleaning and transformation -
territory I think we should own.

Regards,

Andy Robinson
Robinson Analytics Ltd.









=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From guido@CNRI.Reston.VA.US  Tue Nov  9 16:46:41 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 09 Nov 1999 11:46:41 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: Your message of "Tue, 09 Nov 1999 05:58:39 PST."
 <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
Message-ID: <199911091646.LAA21467@eric.cnri.reston.va.us>

Andy,

Thanks a bundle for your case study and your toolkit proposal.  It's
interesting that you haven't touched upon internationalization of user
interfaces (dialog text, menus etc.) -- that's a whole nother can of
worms.

Marc-Andre Lemburg has a proposal for work that I'm asking him to do
(under pressure from HP who want Python i18n badly and are willing to
pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt

I think his proposal will go a long way towards your toolkit.  I hope
to hear soon from anybody who disagrees with Marc-Andre's proposal,
because without opposition this is going to be Python 1.6's offering
for i18n...  (Together with a new Unicode regex engine by /F.)

One specific question: in you discussion of typed strings, I'm not
sure why you couldn't convert everything to Unicode and be done with
it.  I have a feeling that the answer is somewhere in your case study
-- maybe you can elaborate?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From akuchlin@mems-exchange.org  Tue Nov  9 17:21:03 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 9 Nov 1999 12:21:03 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
 <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <14376.22527.323888.677816@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>I think his proposal will go a long way towards your toolkit.  I hope
>to hear soon from anybody who disagrees with Marc-Andre's proposal,
>because without opposition this is going to be Python 1.6's offering
>for i18n...  

The proposal seems reasonable to me.

>(Together with a new Unicode regex engine by /F.)

This is good news!  Would it be a from-scratch regex implementation,
or would it be an adaptation of an existing engine?  Would it involve
modifications to the existing re module, or a completely new unicodere
module?  (If, unlike re.py, it has POSIX longest-match semantics, that
would pretty much settle the question.)

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
All around me darkness gathers, fading is the sun that shone, we must speak of
other matters, you can be me when I'm gone...
    -- The train's clattering, in SANDMAN #67: "The Kindly Ones:11"



From guido@CNRI.Reston.VA.US  Tue Nov  9 17:26:38 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 09 Nov 1999 12:26:38 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: Your message of "Tue, 09 Nov 1999 12:21:03 EST."
 <14376.22527.323888.677816@amarok.cnri.reston.va.us>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us>
 <14376.22527.323888.677816@amarok.cnri.reston.va.us>
Message-ID: <199911091726.MAA21754@eric.cnri.reston.va.us>

[AMK]
> The proposal seems reasonable to me.

Thanks.  I really hope that this time we can move forward united...

> >(Together with a new Unicode regex engine by /F.)
> 
> This is good news!  Would it be a from-scratch regex implementation,
> or would it be an adaptation of an existing engine?  Would it involve
> modifications to the existing re module, or a completely new unicodere
> module?  (If, unlike re.py, it has POSIX longest-match semantics, that
> would pretty much settle the question.)

It's from scratch, and I believe it's got Perl style, not POSIX style
semantics -- per Tim Peters' recommendations.  Do we need to open the
discussion again?

It involves a redone re module (supporting Unicode as well as 8-bit),
but its API could be unchanged.  /F does the parsing and compilation
in Python, only the matching engine is in C -- not sure how that
impacts performance, but I imagine with aggressive caching it would be
okay.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From akuchlin@mems-exchange.org  Tue Nov  9 17:40:07 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 9 Nov 1999 12:40:07 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
 <199911091646.LAA21467@eric.cnri.reston.va.us>
 <14376.22527.323888.677816@amarok.cnri.reston.va.us>
 <199911091726.MAA21754@eric.cnri.reston.va.us>
Message-ID: <14376.23671.250752.637144@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>It's from scratch, and I believe it's got Perl style, not POSIX style
>semantics -- per Tim Peters' recommendations.  Do we need to open the
>discussion again?

No, no; I'm actually happier with Perl-style, because it's far better
documented and familiar to people. Worse *is* better, after all.

My concern is simply that I've started translating re.py into C, and
wonder how this affects the translation.  This isn't a pressing issue,
because the C version isn't finished yet.

>It involves a redone re module (supporting Unicode as well as 8-bit),
>but its API could be unchanged.  /F does the parsing and compilation
>in Python, only the matching engine is in C -- not sure how that
>impacts performance, but I imagine with aggressive caching it would be
>okay.

Can I get my paws on a copy of the modified re.py to see what
ramifications it has, or is this all still an unreleased
work-in-progress?

Doing the compilation in Python is a good idea, and will make it
possible to implement alternative syntaxes.  I would have liked to
make it possible to generate PCRE bytecodes from Python, but what
stopped me is the chance of bogus bytecode causing the engine to dump
core, loop forever, or some other nastiness.  (This is particularly
important for code that uses rexec.py, because you'd expect regexes to
be safe.)  Fixing the engine to be stable when faced with bad
bytecodes appears to require many additional checks that would slow
down the common case of correct code, which is unappealing.


-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Anybody else on the list got an opinion? Should I change the language or not?
    -- Guido van Rossum, 28 Dec 91



From ping@lfw.org  Tue Nov  9 18:08:05 1999
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 9 Nov 1999 10:08:05 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <14376.23671.250752.637144@amarok.cnri.reston.va.us>
Message-ID: 

On Tue, 9 Nov 1999, Andrew M. Kuchling wrote:
> Guido van Rossum writes:
> >It's from scratch, and I believe it's got Perl style, not POSIX style
> >semantics -- per Tim Peters' recommendations.  Do we need to open the
> >discussion again?
> 
> No, no; I'm actually happier with Perl-style, because it's far better
> documented and familiar to people. Worse *is* better, after all.

I would concur with the preference for Perl-style semantics.
Aside from the issue of consistency with other scripting
languages, i think it's easier to predict the behaviour of
these semantics.  You can run the algorithm in your head,
and try the backtracking yourself.  It's good for the algorithm
to be predictable and well understood.

> Doing the compilation in Python is a good idea, and will make it
> possible to implement alternative syntaxes.

Also agree.  I still have some vague wishes for a simpler,
more readable (more Pythonian?) way to express patterns --
perhaps not as powerful as full regular expressions, but
useful for many simpler cases (an 80-20 solution).


-- ?!ng



From bwarsaw@cnri.reston.va.us (Barry A. Warsaw)  Tue Nov  9 18:15:04 1999
From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw)
Date: Tue, 9 Nov 1999 13:15:04 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
 <199911091646.LAA21467@eric.cnri.reston.va.us>
 <14376.22527.323888.677816@amarok.cnri.reston.va.us>
 <199911091726.MAA21754@eric.cnri.reston.va.us>
 <14376.23671.250752.637144@amarok.cnri.reston.va.us>
Message-ID: <14376.25768.368164.88151@anthem.cnri.reston.va.us>

>>>>> "AMK" == Andrew M Kuchling  writes:

    AMK> No, no; I'm actually happier with Perl-style, because it's
    AMK> far better documented and familiar to people. Worse *is*
    AMK> better, after all.

Plus, you can't change re's semantics and I think it makes sense if
the Unicode engine is as close semantically as possible to the
existing engine.

We need to be careful not to worsen performance for 8bit strings.  I
think we're already on the edge of acceptability w.r.t. P*** and
hopefully we can /improve/ performance here.

MAL's proposal seems quite reasonable.  It would be excellent to see
these things done for Python 1.6.  There's still some discussion on
supporting internationalization of applications, e.g. using gettext
but I think those are smaller in scope.

-Barry


From akuchlin@mems-exchange.org  Tue Nov  9 19:36:28 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 9 Nov 1999 14:36:28 -0500 (EST)
Subject: [Python-Dev] I18N Toolkit
In-Reply-To: <14376.25768.368164.88151@anthem.cnri.reston.va.us>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
 <199911091646.LAA21467@eric.cnri.reston.va.us>
 <14376.22527.323888.677816@amarok.cnri.reston.va.us>
 <199911091726.MAA21754@eric.cnri.reston.va.us>
 <14376.23671.250752.637144@amarok.cnri.reston.va.us>
 <14376.25768.368164.88151@anthem.cnri.reston.va.us>
Message-ID: <14376.30652.201552.116828@amarok.cnri.reston.va.us>

Barry A. Warsaw writes:
(in relation to support for Unicode regexes)
>We need to be careful not to worsen performance for 8bit strings.  I
>think we're already on the edge of acceptability w.r.t. P*** and
>hopefully we can /improve/ performance here.

I don't think that will be a problem, given that the Unicode engine
would be a separate C implementation.  A bit of 'if type(strg) ==
UnicodeType' in re.py isn't going to cost very much speed.

(Speeding up PCRE -- that's another question.  I'm often tempted to
rewrite pcre_compile to generate an easier-to-analyse parse tree,
instead of its current complicated-but-memory-parsimonious compiler,
but I'm very reluctant to introduce a fork like that.)

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
The world does so well without me, that I am moved to wish that I could do
equally well without the world.
    -- Robertson Davies, _The Diary of Samuel Marchbanks_



From mhammond@skippinet.com.au  Tue Nov  9 22:27:45 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 10 Nov 1999 09:27:45 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <001c01bf2b01$a58d5d50$0501a8c0@bobcat>

> I think his proposal will go a long way towards your toolkit.  I
hope
> to hear soon from anybody who disagrees with Marc-Andre's proposal,

No disagreement as such, but a small hole:

From the proposal:

Internal Argument Parsing:
--------------------------
...
's':	For Unicode objects: auto convert them to the 
	and return a pointer to the object's  buffer.

--
Excellent - if someone passes a Unicode object, it can be
auto-converted to a string.  This will allow "open()" to accept
Unicode strings.

However, there doesnt appear to be a reverse.  Eg, if my extension
module interfaces to a library that uses Unicode natively, how can I
get a Unicode object when the user passes a string?  If I had to
explicitely check for a string, then check for a Unicode on failure it
would get messy pretty quickly...  Is it not possible to have "U" also
do a conversion?

Mark.



From tim_one@email.msn.com  Wed Nov 10 05:57:14 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 10 Nov 1999 00:57:14 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us>
Message-ID: <000001bf2b40$70183840$d82d153f@tim>

[Guido, on "a new Unicode regex engine by /F"]

> It's from scratch, and I believe it's got Perl style, not POSIX style
> semantics -- per Tim Peters' recommendations.  Do we need to open the
> discussion again?

No, but I get to whine just a little :  I didn't recommend either
approach.  I asked many futile questions about HP's requirements, and
sketched implications either way.  If HP *has* a requirement wrt
POSIX-vs-Perl, it would be good to find that out before it's too late.

I personally prefer POSIX semantics -- but, as Andrew so eloquently said,
worse is better here; all else being equal it's best to follow JPython's
Perl-compatible re lead.

last-time-i-ever-say-what-i-really-think-ly y'rs  - tim




From tim_one@email.msn.com  Wed Nov 10 06:25:07 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 10 Nov 1999 01:25:07 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <000201bf2b44$55b8ad00$d82d153f@tim>

> Marc-Andre Lemburg has a proposal for work that I'm asking him to do
> (under pressure from HP who want Python i18n badly and are willing to
> pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt

I can't make time for a close review now.  Just one thing that hit my eye
early:

    Python should provide a built-in constructor for Unicode strings
    which is available through __builtins__:

    u = unicode([,=
                                         ])

    u = u''

Two points on the Unicode literals (u'abc'):

UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
hand -- it breaks apart and rearranges bytes at the bit level, and
everything other than 7-bit ASCII requires solid strings of "high-bit"
characters.  This is painful for people to enter manually on both counts --
and no common reference gives the UTF-8 encoding of glyphs directly.  So, as
discussed earlier, we should follow Java's lead and also introduce a \u
escape sequence:

    octet:           hexdigit hexdigit
    unicodecode:     octet octet
    unicode_escape:  "\\u" unicodecode

Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
Unicode character at the unicodecode code position.  For consistency, then,
it should probably expand the same way inside "regular strings" too.  Unlike
Java does, I'd rather not give it a meaning outside string literals.

The other point is a nit:  The vast bulk of UTF-8 encodings encode
characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
those must either be explicitly outlawed, or explicitly defined.  I vote for
outlawed, in the sense of detected error that raises an exception.  That
leaves our future options open.

BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
inverse in the Unicode world?  Both seem essential.

international-in-spite-of-himself-ly y'rs  - tim




From fredrik@pythonware.com  Wed Nov 10 08:08:06 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 09:08:06 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> http://starship.skyport.net/~lemburg/unicode-proposal.txt

Marc-Andre writes:

    The internal format for Unicode objects should either use a Python
    specific fixed cross-platform format  (e.g. 2-byte
    little endian byte order) or a compiler provided wchar_t format (if
    available). Using the wchar_t format will ease embedding of Python in
    other Unicode aware applications, but will also make internal format
    dumps platform dependent. 

having been there and done that, I strongly suggest
a third option: a 16-bit unsigned integer, in platform
specific byte order (PY_UNICODE_T).  along all other
roads lie code bloat and speed penalties...

(besides, this is exactly how it's already done in
unicode.c and what 'sre' prefers...)





From andy@robanal.demon.co.uk  Wed Nov 10 08:09:26 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 00:09:26 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>

In general, I like this proposal a lot, but I think it
only covers half the story.  How we actually build the
encoder/decoder for each encoding is a very big issue.
 Thoughts on this below.

First, a little nit
>  u = u''
I don't like using funny prime characters - why not an
explicit function like "utf8()"


On to the important stuff:> 
>  unicodec.register(,,
>  [,, ])

> This registers the codecs under the given encoding
> name in the module global dictionary 
> unicodec.codecs. Stream codecs are optional: 
> the unicodec module will provide appropriate
> wrappers around  and
>  if not given.

I would MUCH prefer a single 'Encoding' class or type
to wrap up these things, rather than up to four
disconnected objects/functions.  Essentially it would
be an interface standard and would offer methods to do
the four things above.  

There are several reasons for this.  
(1) there are quite a lot of things you might want to
do with an encoding object, and we could extend the
interface in future easily.  As a minimum, give it the
four methods implied by the above, two of which can be
defaults.  But I'd like an encoding to be able to tell
me the set of characters to which it applies; validate
a string; and maybe tell me if it is a subset or
superset of another.

(2) especially with double-byte encodings, they will
need to load up some kind of database on startup and
use this for both encoding and decoding - much better
to share it and encapsulate it inside one object

(3) for some languages, there are extra functions
wanted.  For Japanese, you need two or three functions
to expand half-width to full-width katakana, convert
double-byte english to single-byte and vice versa.  A
Japanese encoding object would be a handy place to put
this knowledge.

(4) In the real world you get many encodings which are
subtle variations of the same thing, plus or minus a
few characters.  One bit of code might be able to
share the work of several encodings, by setting a few
flags.  Certainly true of Japanese.

(5) encoding/decoding algorithms can be program or
data or (very often) a bit of both.  We have not yet
discussed where to keep all the mapping tables, but if
data is involved it should be hidden in an object.

(6) See my comments on a state machine for doing the
encodings.  If this is done well, we might two
different standard objects which conform to the
Encoding interface (a really light one for single-byte
encodings, and a bigger one for multi-byte), and
everything else could be data driven.  

(6) Easy to grow - encodings can be prototyped and
proven in Python, ported to C if needed or when ready.
 

In summary, firm up the concept of an Encoding object
and give it room to grow - that's the key to
real-world usefulness.   If people feel the same way
I'll have a go at an interface for that, and try show
how it would have simplified specific problems I have
faced.

We also need to think about where encoding info will
live.  You cannot avoid mapping tables, although you
can hide them inside code modules or pickled objects
if you want.  Should there be a standard 
"..\Python\Enc" directory?

And we're going to need some kind of testing and
certification procedure when adding new encodings. 
This stuff has to be right.  

Guido asked about TypedString.  This can probably be
done on top of the built-in stuff - it is just a
convenience which would clarify intent, reduce lines
of code and prevent people shooting themselves in the
foot when juggling a lot of strings in different
(non-Unicode) encodings.  I can do a Python module to
implement that on top of whatever is built.


Regards,

Andy








=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From fredrik@pythonware.com  Wed Nov 10 08:14:21 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 09:14:21 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000201bf2b44$55b8ad00$d82d153f@tim>
Message-ID: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com>

Tim Peters wrote:
> UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
> hand -- it breaks apart and rearranges bytes at the bit level, and
> everything other than 7-bit ASCII requires solid strings of "high-bit"
> characters.

unless you're using a UTF-8 aware editor, of course ;-)

(some days, I think we need some way to tell the compiler
what encoding we're using for the source file...)

> This is painful for people to enter manually on both counts --
> and no common reference gives the UTF-8 encoding of glyphs
> directly.  So, as discussed earlier, we should follow Java's lead
> and also introduce a \u escape sequence:
> 
>     octet:           hexdigit hexdigit
>     unicodecode:     octet octet
>     unicode_escape:  "\\u" unicodecode
> 
> Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
> Unicode character at the unicodecode code position.  For consistency, then,
> it should probably expand the same way inside "regular strings" too.  Unlike
> Java does, I'd rather not give it a meaning outside string literals.

good idea.  and by some reason, patches for this is included
in the unicode distribution (see the attached str2utf.c).

> The other point is a nit:  The vast bulk of UTF-8 encodings encode
> characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
> those must either be explicitly outlawed, or explicitly defined.  I vote for
> outlawed, in the sense of detected error that raises an exception.  That
> leaves our future options open.

I vote for 'outlaw'.




/* A small code snippet that translates \uxxxx syntax to UTF-8 text.
   To be cut and pasted into Python/compile.c */

/* Written by Fredrik Lundh, January 1999. */

/* Documentation (for the language reference):

\uxxxx -- Unicode character with hexadecimal value xxxx.  The
character is stored using UTF-8 encoding, which means that this
sequence can result in up to three encoded characters.

Note that the 'u' must be followed by four hexadecimal digits.  If
fewer digits are given, the sequence is left in the resulting string
exactly as given.  If more digits are given, only the first four are
translated to Unicode, and the remaining digits are left in the
resulting string.

*/

#define Py_CHARMASK(ch) ch

void
convert(const char *s, char *p)
{
    while (*s) {
        if (*s != '\\') {
            *p++ = *s++;
            continue;
        }
        s++;
        switch (*s++) {

/* -------------------------------------------------------------------- */
/* copy this section to the appropriate place in compile.c... */

        case 'u':
            /* \uxxxx => UTF-8 encoded unicode character */
            if (isxdigit(Py_CHARMASK(s[0])) && isxdigit(Py_CHARMASK(s[1])) &&
                isxdigit(Py_CHARMASK(s[2])) && isxdigit(Py_CHARMASK(s[3]))) {
                /* fetch hexadecimal character value */
                unsigned int n, ch = 0;
                for (n = 0; n < 4; n++) {
                    int c = Py_CHARMASK(*s);
                    s++;
                    ch = (ch << 4) & ~0xF;
                    if (isdigit(c))
                        ch += c - '0';
                    else if (islower(c))
                        ch += 10 + c - 'a';
                    else
                        ch += 10 + c - 'A';
                }
                /* store as UTF-8 */
                if (ch < 0x80)
                    *p++ = (char) ch;
                else {
                    if (ch < 0x800) {
                        *p++ = 0xc0 | (ch >> 6);
                        *p++ = 0x80 | (ch & 0x3f);
                    } else {
                        *p++ = 0xe0 | (ch >> 12);
                        *p++ = 0x80 | ((ch >> 6) & 0x3f);
                        *p++ = 0x80 | (ch & 0x3f);
                    }
                }
                break;
            } else
                goto bogus;

/* -------------------------------------------------------------------- */

        default:

bogus:      *p++ = '\\';
            *p++ = s[-1];
            break;
        }
    }
    *p++ = '\0';
}

main()
{
    int i;
    unsigned char buffer[100];
    
    convert("Link\\u00f6ping", buffer);

    for (i = 0; buffer[i]; i++)
        if (buffer[i] < 0x20 || buffer[i] >= 0x80)
            printf("\\%03o", buffer[i]);
        else
            printf("%c", buffer[i]);
}




From gstein@lyra.org  Thu Nov 11 09:18:52 1999
From: gstein@lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 01:18:52 -0800 (PST)
Subject: [Python-Dev] Re: Internal Format
In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
Message-ID: 

On Wed, 10 Nov 1999, Fredrik Lundh wrote:
> Marc-Andre writes:
> 
>     The internal format for Unicode objects should either use a Python
>     specific fixed cross-platform format  (e.g. 2-byte
>     little endian byte order) or a compiler provided wchar_t format (if
>     available). Using the wchar_t format will ease embedding of Python in
>     other Unicode aware applications, but will also make internal format
>     dumps platform dependent. 
> 
> having been there and done that, I strongly suggest
> a third option: a 16-bit unsigned integer, in platform
> specific byte order (PY_UNICODE_T).  along all other
> roads lie code bloat and speed penalties...

I agree 100% !!

wchar_t will introduce portability issues right on up into the Python
level. The byte-order introduces speed issues and OS interoperability
issues, yet solves no portability problems (Byte Order Marks should still
be present and used).

There are two "platforms" out there that use Unicode: Win32 and Java. They
both use UCS-2, AFAIK.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From fredrik@pythonware.com  Wed Nov 10 08:24:16 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 09:24:16 +0100
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> One specific question: in you discussion of typed strings, I'm not
> sure why you couldn't convert everything to Unicode and be done with
> it.  I have a feeling that the answer is somewhere in your case study
> -- maybe you can elaborate?

Marc-Andre writes:

    Unicode objects should have a pointer to a cached (read-only) char
    buffer  holding the object's value using the current
    .  This is needed for performance and internal
    parsing (see below) reasons. The buffer is filled when the first
    conversion request to the  is issued on the object.

keeping track of an external encoding is better left
for the application programmers -- I'm pretty sure that
different application builders will want to handle this
in radically different ways, depending on their environ-
ment, underlying user interface toolkit, etc.

besides, this is how Tcl would have done it.  Python's
not Tcl, and I think you need *very* good arguments
for moving in that direction.





From mal@lemburg.com  Wed Nov 10 09:04:39 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 10:04:39 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <001c01bf2b01$a58d5d50$0501a8c0@bobcat>
Message-ID: <38293527.3CF5C7B0@lemburg.com>

Mark Hammond wrote:
> 
> > I think his proposal will go a long way towards your toolkit.  I
> hope
> > to hear soon from anybody who disagrees with Marc-Andre's proposal,
> 
> No disagreement as such, but a small hole:
> 
> >From the proposal:
> 
> Internal Argument Parsing:
> --------------------------
> ...
> 's':    For Unicode objects: auto convert them to the 
>         and return a pointer to the object's  buffer.
> 
> --
> Excellent - if someone passes a Unicode object, it can be
> auto-converted to a string.  This will allow "open()" to accept
> Unicode strings.

Well almost... it depends on the current value of .
If it's UTF8 and you only use normal ASCII characters the above is indeed
true, but UTF8 can go far beyond ASCII and have up to 3 bytes per
character (for UCS2, even more for UCS4). With  set
to other exotic encodings this is likely to fail though.
 
> However, there doesnt appear to be a reverse.  Eg, if my extension
> module interfaces to a library that uses Unicode natively, how can I
> get a Unicode object when the user passes a string?  If I had to
> explicitely check for a string, then check for a Unicode on failure it
> would get messy pretty quickly...  Is it not possible to have "U" also
> do a conversion?

"U" is meant to simplify checks for Unicode objects, much like "S".
It returns a reference to the object. Auto-conversions are not possible
due to this, because they would create new objects which don't get
properly garbage collected later on.

Another problem is that Unicode types differ between platforms
(MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit
wchar_t). Depending on the internal format of Unicode objects
this could mean calling different conversion APIs.

BTW, I'm still not too sure about the underlying internal format.
The problem here is that Unicode started out as 2-byte fixed length
representation (UCS2) but then shifted towards a 4-byte fixed length
reprensetation known as UCS4. Since having 4 bytes per character
is hard sell to customers, UTF16 was created to stuff the UCS4
code points (this is how character entities are called in Unicode)
into 2 bytes... with a variable length encoding.

Some platforms that started early into the Unicode business
such as the MS ones use UCS2 as wchar_t, while more recent
ones (e.g. the glibc2 on Linux) use UCS4 for wchar_t. I haven't
yet checked in what ways the two are compatible (I would suspect
the top bytes in UCS4 being 0 for UCS2 codes), but would like
to hear whether it wouldn't be a better idea to use UTF16
as internal format. The latter works in 2 bytes for most
characters and conversion to UCS2|4 should be fast. Still,
conversion to UCS2 could fail.

The downside of using UTF16: it is a variable length format,
so iterations over it will be slower than for UCS4.

Simply sticking to UCS2 is probably out of the question,
since Unicode 3.0 requires UCS4 and we are targetting
Unicode 3.0.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Wed Nov 10 09:49:01 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 10:49:01 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000201bf2b44$55b8ad00$d82d153f@tim>
Message-ID: <38293F8D.F60AE605@lemburg.com>

Tim Peters wrote:
> 
> > Marc-Andre Lemburg has a proposal for work that I'm asking him to do
> > (under pressure from HP who want Python i18n badly and are willing to
> > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt
> 
> I can't make time for a close review now.  Just one thing that hit my eye
> early:
> 
>     Python should provide a built-in constructor for Unicode strings
>     which is available through __builtins__:
> 
>     u = unicode([,=
>                                          ])
> 
>     u = u''
> 
> Two points on the Unicode literals (u'abc'):
> 
> UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
> hand -- it breaks apart and rearranges bytes at the bit level, and
> everything other than 7-bit ASCII requires solid strings of "high-bit"
> characters.  This is painful for people to enter manually on both counts --
> and no common reference gives the UTF-8 encoding of glyphs directly.  So, as
> discussed earlier, we should follow Java's lead and also introduce a \u
> escape sequence:
> 
>     octet:           hexdigit hexdigit
>     unicodecode:     octet octet
>     unicode_escape:  "\\u" unicodecode
> 
> Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
> Unicode character at the unicodecode code position.  For consistency, then,
> it should probably expand the same way inside "regular strings" too.  Unlike
> Java does, I'd rather not give it a meaning outside string literals.

It would be more conform to use the Unicode ordinal (instead of
interpreting the number as UTF8 encoding), e.g. \u03C0 for Pi. The
codes are easy to look up in the standard's UnicodeData.txt file or the
Unicode book for that matter.
 
> The other point is a nit:  The vast bulk of UTF-8 encodings encode
> characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
> those must either be explicitly outlawed, or explicitly defined.  I vote for
> outlawed, in the sense of detected error that raises an exception.  That
> leaves our future options open.

See my other post for a discussion of UCS4 vs. UTF16 vs. UCS2.

Perhaps we could add a flag to Unicode objects stating whether the characters
can be treated as UCS4 limited to the lower 16 bits (UCS4 and UTF16 are
the same in most ranges).

This flag could then be used to choose optimized algorithms for scanning
the strings. Fredrik's implementation currently uses UCS2, BTW.

> BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
> inverse in the Unicode world?  Both seem essential.

Good points.

How about 

  uniord(u[:1]) --> Unicode ordinal number (32-bit)

  unichr(i) --> Unicode object for character i (provided it is 32-bit);
                ValueError otherwise

They are inverse of each other, but note that Unicode allows 
private encodings too, which will of course not necessarily make
it across platforms or even from one PC to the next (see Andy Robinson's
interesting case study).

I've uploaded a new version of the proposal (0.3) to the URL:

	http://starship.skyport.net/~lemburg/unicode-proposal.txt

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From fredrik@pythonware.com  Wed Nov 10 10:50:05 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 11:50:05 +0100
Subject: regexp performance (Re: [Python-Dev] I18N Toolkit
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us>
Message-ID: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com>

Andrew M. Kuchling  wrote:
> (Speeding up PCRE -- that's another question.  I'm often tempted to
> rewrite pcre_compile to generate an easier-to-analyse parse tree,
> instead of its current complicated-but-memory-parsimonious compiler,
> but I'm very reluctant to introduce a fork like that.)

any special pattern constructs that are in need of per-
formance improvements?  (compared to Perl, that is).

or maybe anyone has an extensive performance test
suite for perlish regular expressions?  (preferrably based
on how real people use regular expressions, not only on
things that are known to be slow if not optimized)





From gstein@lyra.org  Thu Nov 11 10:46:55 1999
From: gstein@lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 02:46:55 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38293527.3CF5C7B0@lemburg.com>
Message-ID: 

On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
>...
> Well almost... it depends on the current value of .

Default encodings are kind of nasty when they can be altered. The same
problem occurred with import hooks. Only one can be present at a time.
This implies that modules, packages, subsystems, whatever, cannot set a
default encoding because something else might depend on it having a
different value. In the end, nobody uses the default encoding because it
is unreliable, so you end up with extra implementation/semantics that
aren't used/needed.

Have you ever noticed how Python modules, packages, tools, etc, never
define an import hook?

I'll bet nobody ever monkeys with the default encoding either...

I say axe it and say "UTF-8" is the fixed, default encoding. If you want
something else, then do that explicitly.

>...
> Another problem is that Unicode types differ between platforms
> (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit
> wchar_t). Depending on the internal format of Unicode objects
> this could mean calling different conversion APIs.

Exactly the reason to avoid wchar_t.

> BTW, I'm still not too sure about the underlying internal format.
> The problem here is that Unicode started out as 2-byte fixed length
> representation (UCS2) but then shifted towards a 4-byte fixed length
> reprensetation known as UCS4. Since having 4 bytes per character
> is hard sell to customers, UTF16 was created to stuff the UCS4
> code points (this is how character entities are called in Unicode)
> into 2 bytes... with a variable length encoding.

History is basically irrelevant. What is the situation today? What is in
use, and what are people planning for right now?

>...
> The downside of using UTF16: it is a variable length format,
> so iterations over it will be slower than for UCS4.

Bzzt. May as well go with UTF-8 as the internal format, much like Perl is
doing (as I recall).

Why go with a variable length format, when people seem to be doing fine
with UCS-2?

Like I said in the other mail note: two large platforms out there are
UCS-2 based. They seem to be doing quite well with that approach.

If people truly need UCS-4, then they can work with that on their own. One
of the major reasons for putting Unicode into Python is to
increase/simplify its ability to speak to the underlying platform. Hey!
Guess what? That generally means UCS2.

If we didn't need to speak to the OS with these Unicode values, then
people can work with the values entirely in Python,
PyUnicodeType-be-damned.

Are we digging a hole for ourselves? Maybe. But there are two other big
platforms that have the same hole to dig out of *IF* it ever comes to
that. I posit that it won't be necessary; that the people needing UCS-4
can do so entirely in Python.

Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and
vice-versa. But: it only does it from String to String -- you can't use
Unicode objects anywhere in there.

> Simply sticking to UCS2 is probably out of the question,
> since Unicode 3.0 requires UCS4 and we are targetting
> Unicode 3.0.

Oh? Who says?

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From fredrik@pythonware.com  Wed Nov 10 10:52:28 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 11:52:28 +0100
Subject: [Python-Dev] I18N Toolkit
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us>
Message-ID: <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com>

(a copy was sent to comp.lang.python by mistake;
sorry for that).

Andrew M. Kuchling  wrote:
> I don't think that will be a problem, given that the Unicode engine
> would be a separate C implementation.  A bit of 'if type(strg) ==
> UnicodeType' in re.py isn't going to cost very much speed.

a slightly hairer design issue is what combinations
of pattern and string the new 're' will handle.

the first two are obvious:
 
     ordinary pattern, ordinary string
     unicode pattern, unicode string
 
 but what about these?
 
     ordinary pattern, unicode string
     unicode pattern, ordinary string
 
 "coercing" patterns (i.e. recompiling, on demand)
 seem to be a somewhat risky business ;-)
 
 



From gstein@lyra.org  Thu Nov 11 10:50:56 1999
From: gstein@lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 02:50:56 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38293F8D.F60AE605@lemburg.com>
Message-ID: 

On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
> Tim Peters wrote:
> > BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
> > inverse in the Unicode world?  Both seem essential.
> 
> Good points.
> 
> How about 
> 
>   uniord(u[:1]) --> Unicode ordinal number (32-bit)
> 
>   unichr(i) --> Unicode object for character i (provided it is 32-bit);
>                 ValueError otherwise

Why new functions? Why not extend the definition of ord() and chr()?

In terms of backwards compatibility, the only issue could possibly be that
people relied on chr(x) to throw an error when x>=256. They certainly
couldn't pass a Unicode object to ord(), so that function can safely be
extended to accept a Unicode object and return a larger integer.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From jcw@equi4.com  Wed Nov 10 11:14:17 1999
From: jcw@equi4.com (Jean-Claude Wippler)
Date: Wed, 10 Nov 1999 12:14:17 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <38295389.397DDE5E@equi4.com>

Greg Stein wrote:
[MAL:]
> > The downside of using UTF16: it is a variable length format,
> > so iterations over it will be slower than for UCS4.
> 
> Bzzt. May as well go with UTF-8 as the internal format, much like Perl
> is doing (as I recall).

Ehm, pardon me for asking - what is the brief rationale for selecting
UCS2/4, or whetever it ends up being, over UTF8?

I couldn't find a discussion in the last months of the string SIG, was
this decided upon and frozen long ago?

I'm not trying to re-open a can of worms, just to understand.

-- Jean-Claude


From gstein@lyra.org  Thu Nov 11 11:17:56 1999
From: gstein@lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 03:17:56 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38295389.397DDE5E@equi4.com>
Message-ID: 

On Wed, 10 Nov 1999, Jean-Claude Wippler wrote:
> Greg Stein wrote:
> > Bzzt. May as well go with UTF-8 as the internal format, much like Perl
> > is doing (as I recall).
> 
> Ehm, pardon me for asking - what is the brief rationale for selecting
> UCS2/4, or whetever it ends up being, over UTF8?
> 
> I couldn't find a discussion in the last months of the string SIG, was
> this decided upon and frozen long ago?

Try sometime last year :-) ... something like July thru September as I
recall.

Things will be a lot faster if we have a fixed-size character. Variable
length formats like UTF-8 are a lot harder to slice, search, etc. Also,
(IMO) a big reason for this new type is for interaction with the
underlying OS/platform. I don't know of any platforms right now that
really use UTF-8 as their Unicode string representation (meaning we'd
have to convert back/forth from our UTF-8 representation to talk to the
OS).

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From mal@lemburg.com  Wed Nov 10 09:55:42 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 10:55:42 +0100
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
Message-ID: <3829411E.FD32F8CC@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum  wrote:
> > One specific question: in you discussion of typed strings, I'm not
> > sure why you couldn't convert everything to Unicode and be done with
> > it.  I have a feeling that the answer is somewhere in your case study
> > -- maybe you can elaborate?
> 
> Marc-Andre writes:
> 
>     Unicode objects should have a pointer to a cached (read-only) char
>     buffer  holding the object's value using the current
>     .  This is needed for performance and internal
>     parsing (see below) reasons. The buffer is filled when the first
>     conversion request to the  is issued on the object.
> 
> keeping track of an external encoding is better left
> for the application programmers -- I'm pretty sure that
> different application builders will want to handle this
> in radically different ways, depending on their environ-
> ment, underlying user interface toolkit, etc.

It's not that hard to implement. All you have to do is check
whether the current encoding in  still is the same
as the threads view of . The 
buffer is needed to implement "s" et al. argument parsing
anyways.
 
> besides, this is how Tcl would have done it.  Python's
> not Tcl, and I think you need *very* good arguments
> for moving in that direction.
> 
> 
> 
> _______________________________________________
> Python-Dev maillist  -  Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Wed Nov 10 11:42:00 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 12:42:00 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
Message-ID: <38295A08.D3928401@lemburg.com>

Andy Robinson wrote:
> 
> In general, I like this proposal a lot, but I think it
> only covers half the story.  How we actually build the
> encoder/decoder for each encoding is a very big issue.
>  Thoughts on this below.
> 
> First, a little nit
> >  u = u''
> I don't like using funny prime characters - why not an
> explicit function like "utf8()"

u = unicode('...I am UTF8...','utf-8')

will do just that. I've moved to Tim's proposal with the
\uXXXX encoding for u'', BTW.
 
> On to the important stuff:>
> >  unicodec.register(,,
> >  [,, ])
> 
> > This registers the codecs under the given encoding
> > name in the module global dictionary
> > unicodec.codecs. Stream codecs are optional:
> > the unicodec module will provide appropriate
> > wrappers around  and
> >  if not given.
> 
> I would MUCH prefer a single 'Encoding' class or type
> to wrap up these things, rather than up to four
> disconnected objects/functions.  Essentially it would
> be an interface standard and would offer methods to do
> the four things above.
> 
> There are several reasons for this.
>
> ...
>
> In summary, firm up the concept of an Encoding object
> and give it room to grow - that's the key to
> real-world usefulness.   If people feel the same way
> I'll have a go at an interface for that, and try show
> how it would have simplified specific problems I have
> faced.

Ok, you have a point there.

Here's a proposal (note that this only defines an interface,
not a class structure):

Codec Interface Definition:
---------------------------

The following base class should be defined in the module unicodec.

class Codec:

    def encode(self,u):
	
	""" Return the Unicode object u encoded as Python string.

	"""
	...

    def decode(self,s):

	""" Return an equivalent Unicode object for the encoded Python
	    string s.

	""" 
	...
	
    def dump(self,u,stream,slice=None):

	""" Writes the Unicode object's contents encoded to the stream.

	    stream must be a file-like object open for writing binary
	    data.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def load(self,stream,length=None):

	""" Reads an encoded string (up to  bytes) from the
	    stream and returns an equivalent Unicode object.

	    stream must be a file-like object open for reading binary
	    data.

	    If length is given, only length bytes are read. Note that
	    this can cause the decoding algorithm to fail due to
	    truncations in the encoding.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...

Codecs should raise an UnicodeError in case the conversion is
not possible.

It is not required by the unicodec.register() API to provide a
subclass of this base class, only the 4 given methods must be present.
This allows writing Codecs as extensions types.

XXX Still to be discussed: 

    ˇ support for line breaks (see
      http://www.unicode.org/unicode/reports/tr13/ )

    ˇ support for case conversion: 

      Problems: string lengths can change due to multiple
      characters being mapped to a single new one, capital letters
      starting a word can be different than ones occurring in the
      middle, there are locale dependent deviations from the standard
      mappings.

    ˇ support for numbers, digits, whitespace, etc.

    ˇ support (or no support) for private code point areas


> We also need to think about where encoding info will
> live.  You cannot avoid mapping tables, although you
> can hide them inside code modules or pickled objects
> if you want.  Should there be a standard
> "..\Python\Enc" directory?

Mapping tables should be incorporated into the codec
modules preferably as static C data. That way multiple
processes can share the same data.

> And we're going to need some kind of testing and
> certification procedure when adding new encodings.
> This stuff has to be right.

I will have to rely on your cooperation for the test data.
Roundtrip testing is easy to implement, but I will also have
to verify the output against prechecked data which is probably only
creatable using visual tools to which I don't have access
(e.g. a Japanese Windows installation).
 
> Guido asked about TypedString.  This can probably be
> done on top of the built-in stuff - it is just a
> convenience which would clarify intent, reduce lines
> of code and prevent people shooting themselves in the
> foot when juggling a lot of strings in different
> (non-Unicode) encodings.  I can do a Python module to
> implement that on top of whatever is built.

Ok.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mal@lemburg.com  Wed Nov 10 10:03:36 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 11:03:36 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
Message-ID: <382942F8.1921158E@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum  wrote:
> > http://starship.skyport.net/~lemburg/unicode-proposal.txt
> 
> Marc-Andre writes:
> 
>     The internal format for Unicode objects should either use a Python
>     specific fixed cross-platform format  (e.g. 2-byte
>     little endian byte order) or a compiler provided wchar_t format (if
>     available). Using the wchar_t format will ease embedding of Python in
>     other Unicode aware applications, but will also make internal format
>     dumps platform dependent.
> 
> having been there and done that, I strongly suggest
> a third option: a 16-bit unsigned integer, in platform
> specific byte order (PY_UNICODE_T).  along all other
> roads lie code bloat and speed penalties...
>
> (besides, this is exactly how it's already done in
> unicode.c and what 'sre' prefers...)

Ok, byte order can cause a speed penalty, so it might be
worthwhile introducing sys.bom (or sys.endianness) for this
reason and sticking to 16-bit integers as you have already done
in unicode.h.

What I don't like is using wchar_t if available (and then addressing
it as if it were defined as unsigned integer). IMO, it's better
to define a Python Unicode representation which then gets converted
to whatever wchar_t represents on the target machine.

Another issue is whether to use UCS2 (as you have done) or UTF16
(which is what Unicode 3.0 requires)... see my other post
for a discussion.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fredrik@pythonware.com  Wed Nov 10 12:32:16 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 13:32:16 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com>
Message-ID: <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com>

> What I don't like is using wchar_t if available (and then addressing
> it as if it were defined as unsigned integer). IMO, it's better
> to define a Python Unicode representation which then gets converted
> to whatever wchar_t represents on the target machine.

you should read the unicode.h file a bit more carefully:

...

/* Unicode declarations. Tweak these to match your platform */

/* set this flag if the platform has "wchar.h", "wctype.h" and the
   wchar_t type is a 16-bit unsigned type */
#define HAVE_USABLE_WCHAR_H

#if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H)

    (this uses wchar_t, and also iswspace and friends)

...

#else

/* Use if you have a standard ANSI compiler, without wchar_t support.
   If a short is not 16 bits on your platform, you have to fix the
   typedef below, or the module initialization code will complain. */

    (this maps iswspace to isspace, for 8-bit characters).

#endif

...

the plan was to use the second solution (using "configure"
to figure out what integer type to use), and its own uni-
code database table for the is/to primitives

(iirc, the unicode.txt file discussed this, but that one
seems to be missing from the zip archive).





From fredrik@pythonware.com  Wed Nov 10 12:39:56 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 13:39:56 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com>

Greg Stein  wrote:
> Have you ever noticed how Python modules, packages, tools, etc, never
> define an import hook?

hey, didn't MAL use one in one of his mx kits? ;-)

> I say axe it and say "UTF-8" is the fixed, default encoding. If you want
> something else, then do that explicitly.

exactly.

modes are evil.  python is not perl.  etc.

> Are we digging a hole for ourselves? Maybe. But there are two other big
> platforms that have the same hole to dig out of *IF* it ever comes to
> that. I posit that it won't be necessary; that the people needing UCS-4
> can do so entirely in Python.

last time I checked, there were no characters (even in the
ISO standard) outside the 16-bit range.  has that changed?





From mal@lemburg.com  Wed Nov 10 12:44:39 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 13:44:39 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <382968B7.ABFFD4C0@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
> > Tim Peters wrote:
> > > BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
> > > inverse in the Unicode world?  Both seem essential.
> >
> > Good points.
> >
> > How about
> >
> >   uniord(u[:1]) --> Unicode ordinal number (32-bit)
> >
> >   unichr(i) --> Unicode object for character i (provided it is 32-bit);
> >                 ValueError otherwise
> 
> Why new functions? Why not extend the definition of ord() and chr()?
> 
> In terms of backwards compatibility, the only issue could possibly be that
> people relied on chr(x) to throw an error when x>=256. They certainly
> couldn't pass a Unicode object to ord(), so that function can safely be
> extended to accept a Unicode object and return a larger integer.

Because unichr() will always have to return Unicode objects. You don't
want chr(i) to return Unicode for i>255 and strings for i<256.

OTOH, ord() could probably be extended to also work on Unicode objects.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Wed Nov 10 13:08:30 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 14:08:30 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <38296E4E.914C0ED7@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
> >...
> > Well almost... it depends on the current value of .
> 
> Default encodings are kind of nasty when they can be altered. The same
> problem occurred with import hooks. Only one can be present at a time.
> This implies that modules, packages, subsystems, whatever, cannot set a
> default encoding because something else might depend on it having a
> different value. In the end, nobody uses the default encoding because it
> is unreliable, so you end up with extra implementation/semantics that
> aren't used/needed.

I know, but this is a little different: you use strings a lot while
import hooks are rarely used directly by the user.

E.g. people in Europe will probably prefer Latin-1 as default
encoding while people in Asia will use one of the common CJK encodings.

The  decides what encoding to use for many typical
tasks: printing, str(u), "s" argument parsing, etc.

Note that setting the  is not intended to be
done prior to single operations. It is meant to be settable at
thread creation time.

> [...]
> 
> > BTW, I'm still not too sure about the underlying internal format.
> > The problem here is that Unicode started out as 2-byte fixed length
> > representation (UCS2) but then shifted towards a 4-byte fixed length
> > reprensetation known as UCS4. Since having 4 bytes per character
> > is hard sell to customers, UTF16 was created to stuff the UCS4
> > code points (this is how character entities are called in Unicode)
> > into 2 bytes... with a variable length encoding.
> 
> History is basically irrelevant. What is the situation today? What is in
> use, and what are people planning for right now?
> 
> >...
> > The downside of using UTF16: it is a variable length format,
> > so iterations over it will be slower than for UCS4.
> 
> Bzzt. May as well go with UTF-8 as the internal format, much like Perl is
> doing (as I recall).
> 
> Why go with a variable length format, when people seem to be doing fine
> with UCS-2?

The reason for UTF-16 is simply that it is identical to UCS-2
over large ranges which makes optimizations (e.g. the UCS2 flag
I mentioned in an earlier post) feasable and effective. UTF-8
slows things down for CJK encodings, since the APIs will very often
have to scan the string to find the correct logical position in
the data.
 
Here's a quote from the Unicode FAQ (http://www.unicode.org/unicode/faq/ ):
"""
Q: How about using UCS-4 interfaces in my APIs?

Given an internal UTF-16 storage, you can, of course, still index into text
using UCS-4 indices. However, while converting from a UCS-4 index to a
UTF-16 index or vice versa is fairly straightforward, it does involve a
scan through the 16-bit units up to the index point. In a test run, for
example, accessing UTF-16 storage as UCS-4 characters results in a
10X degradation. Of course, the precise differences will depend on the
compiler, and there are some interesting optimizations that can be
performed, but it will always be slower on average. This kind of
performance hit is unacceptable in many environments.

Most Unicode APIs are using UTF-16. The low-level character indexing
are at the common storage level, with higher-level mechanisms for
graphemes or words specifying their boundaries in terms of the storage
units. This provides efficiency at the low levels, and the required
functionality at the high levels.

Convenience APIs can be produced that take parameters in UCS-4
methods for common utilities: e.g. converting UCS-4 indices back and
forth, accessing character properties, etc. Outside of indexing, differences
between UCS-4 and UTF-16 are not as important. For most other APIs
outside of indexing, characters values cannot really be considered
outside of their context--not when you are writing internationalized code.
For such operations as display, input, collation, editing, and even upper
and lowercasing, characters need to be considered in the context of a
string. That means that in any event you end up looking at more than one
character. In our experience, the incremental cost of doing surrogates is
pretty small.
"""

> Like I said in the other mail note: two large platforms out there are
> UCS-2 based. They seem to be doing quite well with that approach.
> 
> If people truly need UCS-4, then they can work with that on their own. One
> of the major reasons for putting Unicode into Python is to
> increase/simplify its ability to speak to the underlying platform. Hey!
> Guess what? That generally means UCS2.

All those formats are upward compatible (within certain ranges) and
the Python Unicode API will provide converters between its internal
format and the few common Unicode implementations, e.g. for MS
compilers (16-bit UCS2 AFAIK), GLIBC (32-bit UCS4).
 
> If we didn't need to speak to the OS with these Unicode values, then
> people can work with the values entirely in Python,
> PyUnicodeType-be-damned.
> 
> Are we digging a hole for ourselves? Maybe. But there are two other big
> platforms that have the same hole to dig out of *IF* it ever comes to
> that. I posit that it won't be necessary; that the people needing UCS-4
> can do so entirely in Python.
> 
> Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and
> vice-versa. But: it only does it from String to String -- you can't use
> Unicode objects anywhere in there.

See above.
 
> > Simply sticking to UCS2 is probably out of the question,
> > since Unicode 3.0 requires UCS4 and we are targetting
> > Unicode 3.0.
> 
> Oh? Who says?

>From the FAQ:
"""
Q: What is UTF-16?

Unicode was originally designed as a pure 16-bit encoding, aimed at
representing all modern scripts. (Ancient scripts were to be represented
with private-use characters.) Over time, and especially after the addition
of over 14,500 composite characters for compatibility with legacy sets, it
became clear that 16-bits were not sufficient for the user community. Out
of this arose UTF-16.
"""

Note that there currently are no defined surrogate pairs for
UTF-16, meaning that in practice the difference between UCS-2 and
UTF-16 is probably negligable, e.g. we could define the internal
format to be UTF-16 and raise exception whenever the border between
UTF-16 and UCS-2 is crossed -- sort of as political compromise ;-).

But... I think HP has the last word on this one.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Wed Nov 10 12:36:44 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 13:36:44 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com> <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com>
Message-ID: <382966DC.F33E340E@lemburg.com>

Fredrik Lundh wrote:
> 
> > What I don't like is using wchar_t if available (and then addressing
> > it as if it were defined as unsigned integer). IMO, it's better
> > to define a Python Unicode representation which then gets converted
> > to whatever wchar_t represents on the target machine.
> 
> you should read the unicode.h file a bit more carefully:
> 
> ...
> 
> /* Unicode declarations. Tweak these to match your platform */
> 
> /* set this flag if the platform has "wchar.h", "wctype.h" and the
>    wchar_t type is a 16-bit unsigned type */
> #define HAVE_USABLE_WCHAR_H
> 
> #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H)
> 
>     (this uses wchar_t, and also iswspace and friends)
> 
> ...
> 
> #else
> 
> /* Use if you have a standard ANSI compiler, without wchar_t support.
>    If a short is not 16 bits on your platform, you have to fix the
>    typedef below, or the module initialization code will complain. */
> 
>     (this maps iswspace to isspace, for 8-bit characters).
> 
> #endif
> 
> ...
> 
> the plan was to use the second solution (using "configure"
> to figure out what integer type to use), and its own uni-
> code database table for the is/to primitives

Oh, I did read unicode.h, stumbled across the mixed usage
and decided not to like it ;-)

Seriously, I find the second solution where you use the 'unsigned
short' much more portable and straight forward. You never know what
the compiler does for isw*() and it's probably better sticking
to one format for all platforms. Only endianness gets in the way,
but that's easy to handle.

So I opt for 'unsigned short'. The encoding used in these 2 bytes
is a different question though. If HP insists on Unicode 3.0, there's
probably no other way than to use UTF-16.
 
> (iirc, the unicode.txt file discussed this, but that one
> seems to be missing from the zip archive).

It's not in the file I downloaded from your site. Could you post
it here ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Wed Nov 10 13:13:10 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 14:13:10 +0100
Subject: [Python-Dev] Internationalization Toolkit
References:  <38295389.397DDE5E@equi4.com>
Message-ID: <38296F66.5DF9263E@lemburg.com>

Jean-Claude Wippler wrote:
> 
> Greg Stein wrote:
> [MAL:]
> > > The downside of using UTF16: it is a variable length format,
> > > so iterations over it will be slower than for UCS4.
> >
> > Bzzt. May as well go with UTF-8 as the internal format, much like Perl
> > is doing (as I recall).
> 
> Ehm, pardon me for asking - what is the brief rationale for selecting
> UCS2/4, or whetever it ends up being, over UTF8?

UCS-2 is the native format on major platforms (meaning straight
fixed length encoding using 2 bytes), ie. interfacing between
Python's Unicode object and the platform APIs will be simple and
fast.

UTF-8 is short for ASCII users, but imposes a performance 
hit for the CJK (Asian character sets) world, since UTF8 uses
*variable* length encodings.
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From akuchlin@mems-exchange.org  Wed Nov 10 14:56:16 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Wed, 10 Nov 1999 09:56:16 -0500 (EST)
Subject: [Python-Dev] Re: regexp performance
In-Reply-To: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
 <199911091646.LAA21467@eric.cnri.reston.va.us>
 <14376.22527.323888.677816@amarok.cnri.reston.va.us>
 <199911091726.MAA21754@eric.cnri.reston.va.us>
 <14376.23671.250752.637144@amarok.cnri.reston.va.us>
 <14376.25768.368164.88151@anthem.cnri.reston.va.us>
 <14376.30652.201552.116828@amarok.cnri.reston.va.us>
 <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com>
Message-ID: <14377.34704.639462.794509@amarok.cnri.reston.va.us>

[Cc'ed to the String-SIG; sheesh, what's the point of having SIGs
otherwise?]

Fredrik Lundh writes:
>any special pattern constructs that are in need of per-
>formance improvements?  (compared to Perl, that is).

In the 1.5 source tree, I think one major slowdown is coming from the
malloc'ed failure stack.  This was introduced in order to prevent an
expression like (x)* from filling the stack when applied to a string
contained 50,000 'x' characters (hence 50,000 recursive function
calls).  I'd like to get rid of this stack because it's slow and
requires much tedious patching of the upstream PCRE.

>or maybe anyone has an extensive performance test
>suite for perlish regular expressions?  (preferrably based
>on how real people use regular expressions, not only on
>things that are known to be slow if not optimized)

Friedl's book describes several optimizations which aren't implemented
in PCRE.  The problem is that PCRE never builds a parse tree, and
parse trees are easy to analyse recursively.  Instead, PCRE's
functions actually look at the compiled byte codes (for example, look
at find_firstchar or is_anchored in pypcre.c), but this makes analysis
functions hard to write, and rearranging the code near-impossible.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I didn't say it was my fault. I said it was my responsibility. I know the
difference.
    -- Rose Walker, in SANDMAN #60: "The Kindly Ones:4"


From jack@oratrix.nl  Wed Nov 10 15:04:58 1999
From: jack@oratrix.nl (Jack Jansen)
Date: Wed, 10 Nov 1999 16:04:58 +0100
Subject: [Python-Dev] I18N Toolkit
In-Reply-To: Message by "Fredrik Lundh"  ,
 Wed, 10 Nov 1999 11:52:28 +0100 , <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com>
Message-ID: <19991110150458.B542735BB1E@snelboot.oratrix.nl>

> a slightly hairer design issue is what combinations
> of pattern and string the new 're' will handle.
> 
> the first two are obvious:
>  
>      ordinary pattern, ordinary string
>      unicode pattern, unicode string
>  
>  but what about these?
>  
>      ordinary pattern, unicode string
>      unicode pattern, ordinary string

I think the logical thing to do would be to "promote" the ordinary pattern or 
string to unicode, in a similar way to what happens if you combine ints and 
floats in a single expression.

The result may be a bit surprising if your pattern is in ascii and you've 
never been aware of unicode and are given such a string from somewhere else, 
but then if you're only aware of integer arithmetic and are suddenly presented 
with a couple of floats you'll also be pretty surprised at the result. At 
least it's easily explained.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 




From fdrake@acm.org  Wed Nov 10 15:22:17 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Wed, 10 Nov 1999 10:22:17 -0500 (EST)
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
 <199911091646.LAA21467@eric.cnri.reston.va.us>
 <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
Message-ID: <14377.36265.315127.788319@weyr.cnri.reston.va.us>

Fredrik Lundh writes:
 > having been there and done that, I strongly suggest
 > a third option: a 16-bit unsigned integer, in platform
 > specific byte order (PY_UNICODE_T).  along all other

  I actually like this best, but I understand that there are reasons
for using wchar_t, especially for interfacing with other code that
uses Unicode.
  Perhaps someone who knows more about the specific issues with
interfacing using wchar_t can summarize them, or point me to whatever
I've already missed.  p-)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From skip@mojam.com (Skip Montanaro)  Wed Nov 10 15:54:30 1999
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Wed, 10 Nov 1999 09:54:30 -0600 (CST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
 <199911091646.LAA21467@eric.cnri.reston.va.us>
 <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
Message-ID: <14377.38198.793496.870273@dolphin.mojam.com>

Just a couple observations from the peanut gallery...

1. I'm glad I don't have to do this Unicode/UTF/internationalization stuff.
   Seems like it would be easier to just get the whole world speaking
   Esperanto.

2. Are there plans for an internationalization session at IPC8?  Perhaps a
   few key players could be locked into a room for a couple days, to emerge
   bloodied, but with an implementation in-hand...

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...


From fdrake@acm.org  Wed Nov 10 15:58:30 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Wed, 10 Nov 1999 10:58:30 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38295A08.D3928401@lemburg.com>
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
 <38295A08.D3928401@lemburg.com>
Message-ID: <14377.38438.615701.231437@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 >     def encode(self,u):
 > 	
 > 	""" Return the Unicode object u encoded as Python string.

  This should accept an optional slice parameter, and use it in the
same way as .dump().

 >     def dump(self,u,stream,slice=None):
...
 >     def load(self,stream,length=None):

  Why not have something like .wrapFile(f) that returns a file-like
object with all the file methods implemented, and doing to "right
thing" regarding encoding/decoding?  That way, the new file-like
object can be used directly with code that works with files and
doesn't care whether it uses 8-bit or unicode strings.

 > Codecs should raise an UnicodeError in case the conversion is
 > not possible.

  I think that should be ValueError, or UnicodeError should be a
subclass of ValueError.
  (Can the -X interpreter option be removed yet?)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From bwarsaw@cnri.reston.va.us (Barry A. Warsaw)  Wed Nov 10 16:41:29 1999
From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw)
Date: Wed, 10 Nov 1999 11:41:29 -0500 (EST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
 <199911091646.LAA21467@eric.cnri.reston.va.us>
 <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
 <14377.38198.793496.870273@dolphin.mojam.com>
Message-ID: <14377.41017.413515.887236@anthem.cnri.reston.va.us>

>>>>> "SM" == Skip Montanaro  writes:

    SM> 2. Are there plans for an internationalization session at
    SM> IPC8?  Perhaps a few key players could be locked into a room
    SM> for a couple days, to emerge bloodied, but with an
    SM> implementation in-hand...

I'm starting to think about devday topics.  Sounds like an I18n
session would be very useful.  Champions?

-Barry


From mal@lemburg.com  Wed Nov 10 13:31:47 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 14:31:47 +0100
Subject: [Python-Dev] Internationalization Toolkit
References:  <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com>
Message-ID: <382973C3.DCA77051@lemburg.com>

Fredrik Lundh wrote:
> 
> Greg Stein  wrote:
> > Have you ever noticed how Python modules, packages, tools, etc, never
> > define an import hook?
> 
> hey, didn't MAL use one in one of his mx kits? ;-)

Not yet, but I will unless my last patch ("walk me up, Scotty" - import)
goes into the core interpreter.
 
> > I say axe it and say "UTF-8" is the fixed, default encoding. If you want
> > something else, then do that explicitly.
> 
> exactly.
> 
> modes are evil.  python is not perl.  etc.

But a requirement by the customer... they want to be able to set the locale
on a per thread basis. Not exactly my preference (I think all locale
settings should be passed as parameters, not via globals).
 
> > Are we digging a hole for ourselves? Maybe. But there are two other big
> > platforms that have the same hole to dig out of *IF* it ever comes to
> > that. I posit that it won't be necessary; that the people needing UCS-4
> > can do so entirely in Python.
> 
> last time I checked, there were no characters (even in the
> ISO standard) outside the 16-bit range.  has that changed?

No, but people are already thinking about it and there is
a defined range in the >16-bit area for private encodings
(F0000..FFFFD and 100000..10FFFD).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mhammond@skippinet.com.au  Wed Nov 10 21:36:04 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 11 Nov 1999 08:36:04 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382973C3.DCA77051@lemburg.com>
Message-ID: <005701bf2bc3$980f4d60$0501a8c0@bobcat>

Marc writes:

> > modes are evil.  python is not perl.  etc.
>
> But a requirement by the customer... they want to be able to
> set the locale
> on a per thread basis. Not exactly my preference (I think all locale
> settings should be passed as parameters, not via globals).

Sure - that is what this customer wants, but we need to be clear about
the "best thing" for Python generally versus what this particular
client wants.

For example, if we went with UTF-8 as the only default encoding, then
HP may be forced to use a helper function to perform the conversion,
rather than the built-in functions.  This helper function can use TLS
(in Python) to store the encoding.  At least it is localized.

I agree that having a default encoding that can be changed is a bad
idea.  It may make 3 line scripts that need to print something easier
to work with, but at the cost of reliability in large systems.  Kinda
like the existing "locale" support, which is thread specific, and is
well known to cause these sorts of problems.  The end result is that
in your app, you find _someone_ has changed the default encoding, and
some code no longer works.  So the solution is to change the default
encoding back, so _your_ code works again.  You just know that whoever
it was that changed the default encoding in the first place is now
going to break - but what else can you do?

Having a fixed, default encoding may make life slightly more difficult
when you want to work primarily in a different encoding, but at least
your system is predictable and reliable.

Mark.

>
> > > Are we digging a hole for ourselves? Maybe. But there are
> two other big
> > > platforms that have the same hole to dig out of *IF* it
> ever comes to
> > > that. I posit that it won't be necessary; that the people
> needing UCS-4
> > > can do so entirely in Python.
> >
> > last time I checked, there were no characters (even in the
> > ISO standard) outside the 16-bit range.  has that changed?
>
> No, but people are already thinking about it and there is
> a defined range in the >16-bit area for private encodings
> (F0000..FFFFD and 100000..10FFFD).
>
> --
> Marc-Andre Lemburg
>
______________________________________________________________________
> Y2000:                                                    51 days
left
> Business:
http://www.lemburg.com/
> Python Pages:
http://www.lemburg.com/python/
>
>
> _______________________________________________
> Python-Dev maillist  -  Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev
>



From gstein@lyra.org  Thu Nov 11 23:14:55 1999
From: gstein@lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 15:14:55 -0800 (PST)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat>
Message-ID: 

On Thu, 11 Nov 1999, Mark Hammond wrote:
> Marc writes:
> > > modes are evil.  python is not perl.  etc.
> >
> > But a requirement by the customer... they want to be able to
> > set the locale
> > on a per thread basis. Not exactly my preference (I think all locale
> > settings should be passed as parameters, not via globals).
> 
> Sure - that is what this customer wants, but we need to be clear about
> the "best thing" for Python generally versus what this particular
> client wants.

Ha! I was getting ready to say exactly the same thing. Are building Python
for a particular customer, or are we building it to Do The Right Thing?

I've been getting increasingly annoyed at "well, HP says this" or "HP
wants that." I'm ecstatic that they are a Consortium member and are
helping to fund the development of Python. However, if that means we are
selling Python's soul to corporate wishes rather than programming and
design ideals... well, it reduces my enthusiasm :-)

>...
> I agree that having a default encoding that can be changed is a bad
> idea.  It may make 3 line scripts that need to print something easier
> to work with, but at the cost of reliability in large systems.  Kinda
> like the existing "locale" support, which is thread specific, and is
> well known to cause these sorts of problems.  The end result is that
> in your app, you find _someone_ has changed the default encoding, and
> some code no longer works.  So the solution is to change the default
> encoding back, so _your_ code works again.  You just know that whoever
> it was that changed the default encoding in the first place is now
> going to break - but what else can you do?

Yes! Yes! Example #2.

My first example (import hooks) was shrugged off by some as "well, nobody
uses those." Okay, maybe people don't use them (but I believe that is
*because* of this kind of problem).

In Mark's example, however... this is a definite problem. I ran into this
when I was building some code for Microsoft Site Server. IIS was setting a
different locale on my thread -- one that I definitely was not expecting.
All of a sudden, strlwr() no longer worked as I expected -- certain
characters didn't get lower-cased, so my dictionary lookups failed because
the keys were not all lower-cased.

Solution? Before passing control from C++ into Python, I set the locale to
the default locale. Restored it on the way back out. Extreme measures, and
costly to do, but it had to be done.

I think I'll pick up Fredrik's phrase here...

(chanting) "Modes Are Evil!"  "Modes Are Evil!"  "Down with Modes!"

:-)

> Having a fixed, default encoding may make life slightly more difficult
> when you want to work primarily in a different encoding, but at least
> your system is predictable and reliable.

*bing*

I'm with Mark on this one. Global modes and state are a serious pain when
it comes to developing a system.

Python is very amenable to utility functions and classes. Any "customer"
can use a utility function to manually do the encoding according to a
per-thread setting stashed in some module-global dictionary (map thread-id
to default-encoding). Done. Keep it out of the interpreter...

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From da@ski.org  Wed Nov 10 23:21:54 1999
From: da@ski.org (David Ascher)
Date: Wed, 10 Nov 1999 15:21:54 -0800 (Pacific Standard Time)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: 
Message-ID: 

On Thu, 11 Nov 1999, Greg Stein wrote:

> Ha! I was getting ready to say exactly the same thing. Are building Python
> for a particular customer, or are we building it to Do The Right Thing?
> 
> I've been getting increasingly annoyed at "well, HP says this" or "HP
> wants that." I'm ecstatic that they are a Consortium member and are
> helping to fund the development of Python. However, if that means we are
> selling Python's soul to corporate wishes rather than programming and
> design ideals... well, it reduces my enthusiasm :-)

What about just explaining the rationale for the default-less point of
view to whoever is in charge of this at HP and see why they came up with
their rationale in the first place?  They might have a good reason, or
they might be willing to change said requirement.

--david



From gstein@lyra.org  Thu Nov 11 23:31:43 1999
From: gstein@lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 15:31:43 -0800 (PST)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: 
Message-ID: 

Damn, you're smooth... maybe you should have run for SF Mayor...

:-)

On Wed, 10 Nov 1999, David Ascher wrote:
> On Thu, 11 Nov 1999, Greg Stein wrote:
> 
> > Ha! I was getting ready to say exactly the same thing. Are building Python
> > for a particular customer, or are we building it to Do The Right Thing?
> > 
> > I've been getting increasingly annoyed at "well, HP says this" or "HP
> > wants that." I'm ecstatic that they are a Consortium member and are
> > helping to fund the development of Python. However, if that means we are
> > selling Python's soul to corporate wishes rather than programming and
> > design ideals... well, it reduces my enthusiasm :-)
> 
> What about just explaining the rationale for the default-less point of
> view to whoever is in charge of this at HP and see why they came up with
> their rationale in the first place?  They might have a good reason, or
> they might be willing to change said requirement.
> 
> --david
> 

--
Greg Stein, http://www.lyra.org/



From tim_one@email.msn.com  Thu Nov 11 06:25:27 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 01:25:27 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com>
Message-ID: <000201bf2c0d$8b866160$262d153f@tim>

[/F, dripping with code]
> ...
> Note that the 'u' must be followed by four hexadecimal digits.  If
> fewer digits are given, the sequence is left in the resulting string
> exactly as given.

Yuck -- don't let probable error pass without comment.  "must be" == "must
be"!

[moving backwards]
> \uxxxx -- Unicode character with hexadecimal value xxxx.  The
> character is stored using UTF-8 encoding, which means that this
> sequence can result in up to three encoded characters.

The code is fine, but I've gotten confused about what the intent is now.
Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8
literals, but now he's got Unicode-escaped literals instead -- and you favor
an internal 2-byte-per-char Unicode storage format.  In that combination of
worlds, is there any use in the *language* (as opposed to in a runtime
module) for \uxxxx -> UTF-8 conversion?

And MAL, if you're listening, I'm not clear on what a Unicode-escaped
literal means.  When you had UTF-8 literals, the meaning of something like

    u"a\340\341"

was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals
were just a way of specifying a byte stream.  As a Unicode-escaped string, I
assume the "a" maps to the Unicode "a", but what of the rest?  Are the octal
escapes to be taken as two separate Latin-1 characters (in their role as a
Unicode subset), or as an especially clumsy way to specify a single 16-bit
Unicode character?  I'm afraid I'd vote for the former.  Same issue wrt \x
escapes.

One other issue:  are there "raw" Unicode strings too, as in ur"\u20ac"?
There probably should be; and while Guido will hate this, a ur string should
probably *not* leave \uxxxx escapes untouched.  Nasties like this are why
Java defines \uxxxx expansion as occurring in a preprocessing step.

BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or
isn't \uxxxx allowed in a non-Unicode string?  that's what I would do ...).




From tim_one@email.msn.com  Thu Nov 11 06:49:16 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 01:49:16 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000501bf2c10$df4679e0$262d153f@tim>

[ Greg Stein]
> ...
> Things will be a lot faster if we have a fixed-size character. Variable
> length formats like UTF-8 are a lot harder to slice, search, etc.

The initial byte of any UTF-8 encoded character never appears in a
*non*-initial position of any UTF-8 encoded character.  Which means
searching is not only tractable in UTF-8, but also that whatever optimized
8-bit clean string searching routines you happen to have sitting around
today can be used as-is on UTF-8 encoded strings.  This is not true of UCS-2
encoded strings (in which "the first" byte is not distinguished, so 8-bit
search is vulnerable to finding a hit starting "in the middle" of a
character).  More, to the extent that the bulk of your text is plain ASCII,
the UTF-8 search will run much faster than when using a 2-byte encoding,
simply because it has half as many bytes to chew over.

UTF-8 is certainly slower for random-access indexing, including slicing.

I don't know what "etc" means, but if it follows the pattern so far,
sometimes it's faster and sometimes it's slower .

> (IMO) a big reason for this new type is for interaction with the
> underlying OS/platform. I don't know of any platforms right now that
> really use UTF-8 as their Unicode string representation (meaning we'd
> have to convert back/forth from our UTF-8 representation to talk to the
> OS).

No argument here.




From tim_one@email.msn.com  Thu Nov 11 06:56:35 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 01:56:35 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382968B7.ABFFD4C0@lemburg.com>
Message-ID: <000601bf2c11$e4b07920$262d153f@tim>

[MAL, on Unicode chr() and ord()
> ...
> Because unichr() will always have to return Unicode objects. You don't
> want chr(i) to return Unicode for i>255 and strings for i<256.

Indeed I do not!

> OTOH, ord() could probably be extended to also work on Unicode objects.

I think should be -- it's a good & natural use of polymorphism; introducing
a new function *here* would be as odd as introducing a unilen() function to
get the length of a Unicode string.




From tim_one@email.msn.com  Thu Nov 11 07:03:34 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 02:03:34 -0500
Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance
In-Reply-To: <14377.34704.639462.794509@amarok.cnri.reston.va.us>
Message-ID: <000701bf2c12$de8bca80$262d153f@tim>

[Andrew M. Kuchling]
> ...
> Friedl's book describes several optimizations which aren't implemented
> in PCRE.  The problem is that PCRE never builds a parse tree, and
> parse trees are easy to analyse recursively.  Instead, PCRE's
> functions actually look at the compiled byte codes (for example, look
> at find_firstchar or is_anchored in pypcre.c), but this makes analysis
> functions hard to write, and rearranging the code near-impossible.

This is wonderfully & ironically Pythonic.  That is, the Python compiler
itself goes straight to byte code, and the optimization that's done works at
the latter low level.  Luckily , very little optimization is
attempted, and what's there only replaces one bytecode with another of the
same length.  If it tried to do more, it would have to rearrange the code
...

the-more-things-differ-the-more-things-don't-ly y'rs  - tim




From tim_one@email.msn.com  Thu Nov 11 07:27:52 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 02:27:52 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382973C3.DCA77051@lemburg.com>
Message-ID: <000801bf2c16$43f9a4c0$262d153f@tim>

[/F]
> last time I checked, there were no characters (even in the
> ISO standard) outside the 16-bit range.  has that changed?

[MAL]
> No, but people are already thinking about it and there is
> a defined range in the >16-bit area for private encodings
> (F0000..FFFFD and 100000..10FFFD).

Over the decades I've developed a rule of thumb that has never wound up
stuck in my ass :  If I engineer code that I expect to be in use for N
years, I make damn sure that every internal limit is at least 10x larger
than the largest I can conceive of a user making reasonable use of at the
end of those N years.  The invariable result is that the N years pass, and
fewer than half of the users have bumped into the limit <0.5 wink>.

At the risk of offending everyone, I'll suggest that, qualitatively
speaking, Unicode is as Eurocentric as ASCII is Anglocentric.  We've just
replaced "256 characters?!  We'll *never* run out of those!" with 64K.  But
when Asian languages consume them 7K at a pop, 64K isn't even in my 10x
comfort range for some individual languages.  In just a few months, Unicode
3 will already have used up > 56K of the 64K slots.

As I understand it, UTF-16 "only" adds 1M new code points.  That's in my 10x
zone, for about a decade.

predicting-we'll-live-to-regret-it-either-way-ly y'rs  - tim




From andy@robanal.demon.co.uk  Thu Nov 11 07:29:05 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:29:05 -0800 (PST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
Message-ID: <19991111072905.25203.rocketmail@web607.mail.yahoo.com>

> 2. Are there plans for an internationalization
> session at IPC8?  Perhaps a
>    few key players could be locked into a room for a
> couple days, to emerge
>    bloodied, but with an implementation in-hand...

Excellent idea.  

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From tim_one@email.msn.com  Thu Nov 11 07:29:50 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 02:29:50 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat>
Message-ID: <000901bf2c16$8a107420$262d153f@tim>

[Mark Hammond]
> Sure - that is what this customer wants, but we need to be clear about
> the "best thing" for Python generally versus what this particular
> client wants.
> ...
> Having a fixed, default encoding may make life slightly more difficult
> when you want to work primarily in a different encoding, but at least
> your system is predictable and reliable.

Well said, Mark!  Me too.  It's like HP is suffering from Windows envy
.




From andy@robanal.demon.co.uk  Thu Nov 11 07:30:53 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:30:53 -0800 (PST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
Message-ID: <19991111073053.7884.rocketmail@web602.mail.yahoo.com>


--- "Barry A. Warsaw" 
wrote:
> 
> I'm starting to think about devday topics.  Sounds
> like an I18n
> session would be very useful.  Champions?
> 
I'm willing to explain what the fuss is about to
bemused onlookers and give some examples of problems
it should be able to solve - plenty of good slides and
screen shots.  I'll stay well away from the C
implementation issues.

Regards,

Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From andy@robanal.demon.co.uk  Thu Nov 11 07:33:25 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:33:25 -0800 (PST)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
Message-ID: <19991111073325.8024.rocketmail@web602.mail.yahoo.com>

> 
> What about just explaining the rationale for the
> default-less point of
> view to whoever is in charge of this at HP and see
> why they came up with
> their rationale in the first place?  They might have
> a good reason, or
> they might be willing to change said requirement.
> 
> --david

For that matter (I came into this a bit late), is
there a statement somewhere of what HP actually want
to do?  

- Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From andy@robanal.demon.co.uk  Thu Nov 11 07:44:50 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:44:50 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991111074450.20451.rocketmail@web606.mail.yahoo.com>

> I say axe it and say "UTF-8" is the fixed, default
> encoding. If you want
> something else, then do that explicitly.
> 
Let me tell you why you would want to have an encoding
which can be set:

(1) sday I am on a Japanese Windows box, I have a
string called 'address' and I do 'print address'.  If
I see utf8, I see garbage.  If I see Shift-JIS, I see
the correct Japanese address.  At this point in time,
utf8 is an interchange format but 99% of the world's
data is in various native encodings.  

Analogous problems occur on input.

(2) I'm using htmlgen, which 'prints' objects to
standard output.  My web site is supposed to be
encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
etc.)  Yes, browsers CAN detect and display UTF8 but
you just don't find UTF8 sites in the real world - and
most users just don't know about the encoding menu,
and will get pissed off if they have to reach for it.

Ditto for streaming output in some protocol.

Java solves this (and we could too by hacking stdout)
using Writer classes which are created as wrappers
around an output stream and can take an encoding, but
you lose the flexibility to 'just print'.  

I think being able to change encoding would be useful.
 What I do not want is to auto-detect it from the
operating system when Python boots - that would be a
portability nightmare. 

Regards,

Andy





=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From fredrik@pythonware.com  Thu Nov 11 08:06:04 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Thu, 11 Nov 1999 09:06:04 +0100
Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance
References: <000701bf2c12$de8bca80$262d153f@tim>
Message-ID: <009201bf2c1b$9a5c1b90$f29b12c2@secret.pythonware.com>

Tim Peters  wrote:
> > The problem is that PCRE never builds a parse tree, and
> > parse trees are easy to analyse recursively.  Instead, PCRE's
> > functions actually look at the compiled byte codes (for example, look
> > at find_firstchar or is_anchored in pypcre.c), but this makes analysis
> > functions hard to write, and rearranging the code near-impossible.
> 
> This is wonderfully & ironically Pythonic.  That is, the Python compiler
> itself goes straight to byte code, and the optimization that's done works at
> the latter low level.

yeah, but by some reason, people (including GvR) expect a
regular expression machinery to be more optimized than the
language interpreter ;-)





From tim_one@email.msn.com  Thu Nov 11 08:01:58 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 03:01:58 -0500
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: <19991111073325.8024.rocketmail@web602.mail.yahoo.com>
Message-ID: <000c01bf2c1b$0734c060$262d153f@tim>

[Andy Robinson]
> For that matter (I came into this a bit late), is
> there a statement somewhere of what HP actually want
> to do?

On this list, the best explanation we got was from Guido:  they want
"internationalization", and "Perl-compatible Unicode regexps".  I'm not sure
they even know the two aren't identical <0.9 wink>.

code-without-requirements-is-like-sex-without-consequences-ly y'rs  - tim




From guido@CNRI.Reston.VA.US  Thu Nov 11 12:03:51 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 11 Nov 1999 07:03:51 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: Your message of "Wed, 10 Nov 1999 23:44:50 PST."
 <19991111074450.20451.rocketmail@web606.mail.yahoo.com>
References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com>
Message-ID: <199911111203.HAA24221@eric.cnri.reston.va.us>

> Let me tell you why you would want to have an encoding
> which can be set:
> 
> (1) sday I am on a Japanese Windows box, I have a
> string called 'address' and I do 'print address'.  If
> I see utf8, I see garbage.  If I see Shift-JIS, I see
> the correct Japanese address.  At this point in time,
> utf8 is an interchange format but 99% of the world's
> data is in various native encodings.  
> 
> Analogous problems occur on input.
> 
> (2) I'm using htmlgen, which 'prints' objects to
> standard output.  My web site is supposed to be
> encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
> etc.)  Yes, browsers CAN detect and display UTF8 but
> you just don't find UTF8 sites in the real world - and
> most users just don't know about the encoding menu,
> and will get pissed off if they have to reach for it.
> 
> Ditto for streaming output in some protocol.
> 
> Java solves this (and we could too by hacking stdout)
> using Writer classes which are created as wrappers
> around an output stream and can take an encoding, but
> you lose the flexibility to 'just print'.  
> 
> I think being able to change encoding would be useful.
>  What I do not want is to auto-detect it from the
> operating system when Python boots - that would be a
> portability nightmare. 

You almost convinced me there, but I think this can still be done
without changing the default encoding: simply reopen stdout with a
different encoding.  This is how Java does it.  I/O streams with an
encoding specified at open() are a very powerful feature.  You can
hide this in your $PYTHONSTARTUP.

François Pinard might not like it though...

BTW, someone asked what HP asked for: I can't reveal what exactly they
asked for, basically because they don't seem to agree amongst
themselves.  The only firm statements I have is that they want i18n
and that they want it fast (before the end of the year).

The desire from Perl-compatible regexps comes from me, and the only
reason is compatibility with re.py.  (HP did ask for regexps, but they
don't know the difference between POSIX and Perl if it poked them in
the eye.)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gstein@lyra.org  Thu Nov 11 12:20:39 1999
From: gstein@lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 04:20:39 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit (fwd)
Message-ID: 

Andy originally sent this just to me... I replied in kind, but saw that he
sent another copy to python-dev. Sending my reply there...

---------- Forwarded message ----------
Date: Thu, 11 Nov 1999 04:00:38 -0800 (PST)
From: Greg Stein 
To: andy@robanal.demon.co.uk
Subject: Re: [Python-Dev] Internationalization Toolkit

[ note: you sent direct to me; replying in kind in case that was your
  intent ]

On Wed, 10 Nov 1999, [iso-8859-1] Andy Robinson wrote:
>...
> Let me tell you why you would want to have an encoding
> which can be set:
>...snip: two examples of how "print" fails...

Neither of those examples are solid reasons for having a default encoding
that can be changed. Both can easily be altered at the Python level by
using an encoding function before printing.

You're asking for convenience, *not* providing a reason.

> Java solves this (and we could too) using Writer
> classes which are created as wrappers around an output
> stream and can take an encoding, but you lose the
> flexibility to just print.  

Not flexibility: convenience. You can certainly do:

  print encode(u,'Shift-JIS')

> I think being able to change encoding would be useful.
>  What I do not want is to auto-detect it from the
> operating system when Python boots - that would be a
> portability nightmare. 

Useful, but not a requirement.

Keep the interpreter simple, understandable, and predictable. A module
that changes the default over to 'utf-8' because it is interacting with a
network object is going to screw up your app if you're relying on an
encoding of 'shift-jis' to be present.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From andy@robanal.demon.co.uk  Thu Nov 11 12:49:10 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Thu, 11 Nov 1999 04:49:10 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991111124910.6373.rocketmail@web603.mail.yahoo.com>

> You almost convinced me there, but I think this can
> still be done
> without changing the default encoding: simply reopen
> stdout with a
> different encoding.  This is how Java does it.  I/O
> streams with an
> encoding specified at open() are a very powerful
> feature.  You can
> hide this in your $PYTHONSTARTUP.

Good point, I'm happy with this.  Make sure we specify
it in the docs as the right way to do it.  In an IDE,
we'd have an Options screen somewhere for the output
encoding.

What the Java code I have seen does is to open a raw
file and construct wrappers (InputStreamReader,
OutputStreamWriter) around it to do an encoding
conversion.  This kind of obfuscates what is going on
- Python just needs the extra argument.  

- Andy








=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From mal@lemburg.com  Thu Nov 11 12:42:51 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 13:42:51 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
 <38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us>
Message-ID: <382AB9CB.634A9782@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  >     def encode(self,u):
>  >
>  >      """ Return the Unicode object u encoded as Python string.
> 
>   This should accept an optional slice parameter, and use it in the
> same way as .dump().

Ok.
 
>  >     def dump(self,u,stream,slice=None):
> ...
>  >     def load(self,stream,length=None):
> 
>   Why not have something like .wrapFile(f) that returns a file-like
> object with all the file methods implemented, and doing to "right
> thing" regarding encoding/decoding?  That way, the new file-like
> object can be used directly with code that works with files and
> doesn't care whether it uses 8-bit or unicode strings.

See File Output of the latest version:

File/Stream Output:
-------------------

Since file.write(object) and most other stream writers use the 's#'
argument parsing marker, the buffer interface implementation
determines the encoding to use (see Buffer Interface).

For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(,,) could be provided as
    short-hand for unicodec.file(open(,),) which
    also assures that  contains the 'b' character when needed.
 
>  > Codecs should raise an UnicodeError in case the conversion is
>  > not possible.
> 
>   I think that should be ValueError, or UnicodeError should be a
> subclass of ValueError.

Ok.

>   (Can the -X interpreter option be removed yet?)

Doesn't Python convert class exceptions to strings when -X is
used ? I would guess that many scripts already rely on the class
based mechanism (much of my stuff does for sure), so by the time
1.6 is out, I think -X should be considered an option to run
pre 1.5 code rather than using it for performance reasons.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 11 13:01:40 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 14:01:40 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <005701bf2bc3$980f4d60$0501a8c0@bobcat>
Message-ID: <382ABE34.5D27C701@lemburg.com>

Mark Hammond wrote:
> 
> Marc writes:
> 
> > > modes are evil.  python is not perl.  etc.
> >
> > But a requirement by the customer... they want to be able to
> > set the locale
> > on a per thread basis. Not exactly my preference (I think all locale
> > settings should be passed as parameters, not via globals).
> 
> Sure - that is what this customer wants, but we need to be clear about
> the "best thing" for Python generally versus what this particular
> client wants.
> 
> For example, if we went with UTF-8 as the only default encoding, then
> HP may be forced to use a helper function to perform the conversion,
> rather than the built-in functions.  This helper function can use TLS
> (in Python) to store the encoding.  At least it is localized.
> 
> I agree that having a default encoding that can be changed is a bad
> idea.  It may make 3 line scripts that need to print something easier
> to work with, but at the cost of reliability in large systems.  Kinda
> like the existing "locale" support, which is thread specific, and is
> well known to cause these sorts of problems.  The end result is that
> in your app, you find _someone_ has changed the default encoding, and
> some code no longer works.  So the solution is to change the default
> encoding back, so _your_ code works again.  You just know that whoever
> it was that changed the default encoding in the first place is now
> going to break - but what else can you do?
> 
> Having a fixed, default encoding may make life slightly more difficult
> when you want to work primarily in a different encoding, but at least
> your system is predictable and reliable.

I think the discussion on this is getting a little too hot. The point
is simply that the option of changing the per-thread default encoding
is there. You are not required to use it and if you do you are on
your own when something breaks.

Think of it as a HP specific feature... perhaps I should wrap the code
in #ifdefs and leave it undocumented.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake@acm.org  Thu Nov 11 15:02:32 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Thu, 11 Nov 1999 10:02:32 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AB9CB.634A9782@lemburg.com>
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
 <38295A08.D3928401@lemburg.com>
 <14377.38438.615701.231437@weyr.cnri.reston.va.us>
 <382AB9CB.634A9782@lemburg.com>
Message-ID: <14378.55944.371933.613604@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > For explicit handling of Unicode using files, the unicodec module
 > could provide stream wrappers which provide transparent
 > encoding/decoding for any open stream (file-like object):

  Sounds good to me!  I guess I just missed, there's been so much
going on lately.

 > XXX unicodec.file(,,) could be provided as
 >     short-hand for unicodec.file(open(,),) which
 >     also assures that  contains the 'b' character when needed.

  Actually, I'd call it unicodec.open().

I asked:
 >   (Can the -X interpreter option be removed yet?)

You commented:
 > Doesn't Python convert class exceptions to strings when -X is
 > used ? I would guess that many scripts already rely on the class
 > based mechanism (much of my stuff does for sure), so by the time
 > 1.6 is out, I think -X should be considered an option to run
 > pre 1.5 code rather than using it for performance reasons.

  Gosh, I never thought of it as a performance issue!
  What I'd like to do is avoid code like this:

	try:
            class UnicodeError(ValueError):
                # well, something would probably go here...
                pass
        except TypeError:
            class UnicodeError:
                # something slightly different for this one...
                pass

  Trying to use class exceptions can be really tedious, and often I'd
like to pick up the stuff from Exception.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From mal@lemburg.com  Thu Nov 11 14:21:50 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:21:50 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000201bf2c0d$8b866160$262d153f@tim>
Message-ID: <382AD0FE.B604876A@lemburg.com>

Tim Peters wrote:
> 
> [/F, dripping with code]
> > ...
> > Note that the 'u' must be followed by four hexadecimal digits.  If
> > fewer digits are given, the sequence is left in the resulting string
> > exactly as given.
> 
> Yuck -- don't let probable error pass without comment.  "must be" == "must
> be"!

I second that.
 
> [moving backwards]
> > \uxxxx -- Unicode character with hexadecimal value xxxx.  The
> > character is stored using UTF-8 encoding, which means that this
> > sequence can result in up to three encoded characters.
> 
> The code is fine, but I've gotten confused about what the intent is now.
> Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8
> literals, but now he's got Unicode-escaped literals instead -- and you favor
> an internal 2-byte-per-char Unicode storage format.  In that combination of
> worlds, is there any use in the *language* (as opposed to in a runtime
> module) for \uxxxx -> UTF-8 conversion?

No, no...  :-) 

I think it was a simple misunderstanding... \uXXXX is only to be
used within u'' strings and then gets expanded to *one* character
encoded in the internal Python format (which is heading towards UTF-16
without surrogates).
 
> And MAL, if you're listening, I'm not clear on what a Unicode-escaped
> literal means.  When you had UTF-8 literals, the meaning of something like
> 
>     u"a\340\341"
> 
> was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals
> were just a way of specifying a byte stream.  As a Unicode-escaped string, I
> assume the "a" maps to the Unicode "a", but what of the rest?  Are the octal
> escapes to be taken as two separate Latin-1 characters (in their role as a
> Unicode subset), or as an especially clumsy way to specify a single 16-bit
> Unicode character?  I'm afraid I'd vote for the former.  Same issue wrt \x
> escapes.

Good points.

The conversion goes as follows:
ˇ for single characters (and this includes all \XXX sequences except \uXXXX),
  take the ordinal and interpret it as Unicode ordinal
ˇ for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX 
  instead
 
> One other issue:  are there "raw" Unicode strings too, as in ur"\u20ac"?
> There probably should be; and while Guido will hate this, a ur string should
> probably *not* leave \uxxxx escapes untouched.  Nasties like this are why
> Java defines \uxxxx expansion as occurring in a preprocessing step.

Not sure whether we really need to make this even more complicated...
The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or filenames
won't hurt much in the context of those \uXXXX monsters :-)

> BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or
> isn't \uxxxx allowed in a non-Unicode string?  that's what I would do ...).

Right. \uXXXX will only be allowed in u'' strings, not in "normal"
strings.

BTW, if you want to type in UTF-8 strings and have them converted
to Unicode, you can use the standard:

u = unicode('...string with UTF-8 encoded characters...','utf-8')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 11 14:23:45 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:23:45 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000601bf2c11$e4b07920$262d153f@tim>
Message-ID: <382AD171.D22A1D6E@lemburg.com>

Tim Peters wrote:
> 
> [MAL, on Unicode chr() and ord()
> > ...
> > Because unichr() will always have to return Unicode objects. You don't
> > want chr(i) to return Unicode for i>255 and strings for i<256.
> 
> Indeed I do not!
> 
> > OTOH, ord() could probably be extended to also work on Unicode objects.
> 
> I think should be -- it's a good & natural use of polymorphism; introducing
> a new function *here* would be as odd as introducing a unilen() function to
> get the length of a Unicode string.

Fine. So I'll drop the uniord() API and extend ord() instead.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 11 14:36:41 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:36:41 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000901bf2c16$8a107420$262d153f@tim>
Message-ID: <382AD479.5261B43B@lemburg.com>

Tim Peters wrote:
> 
> [Mark Hammond]
> > Sure - that is what this customer wants, but we need to be clear about
> > the "best thing" for Python generally versus what this particular
> > client wants.
> > ...
> > Having a fixed, default encoding may make life slightly more difficult
> > when you want to work primarily in a different encoding, but at least
> > your system is predictable and reliable.
> 
> Well said, Mark!  Me too.  It's like HP is suffering from Windows envy
> .

See my other post on the subject...

Note that if we make UTF-8 the standard encoding, nearly all 
special Latin-1 characters will produce UTF-8 errors on input
and unreadable garbage on output. That will probably be unacceptable
in Europe. To remedy this, one would *always* have to use
u.encode('latin-1') to get readable output for Latin-1 strings
repesented in Unicode.

I'd rather see this happen the other way around: *always* explicitly
state the encoding you want in case you rely on it, e.g. write

file.write(u.encode('utf-8'))

instead of

file.write(u) # let's hope this goes out as UTF-8...

Using the  as site dependent setting is useful
for convenience in those cases where the output format should be
readable rather than parseable.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 11 14:26:59 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:26:59 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000801bf2c16$43f9a4c0$262d153f@tim>
Message-ID: <382AD233.BE6DE888@lemburg.com>

Tim Peters wrote:
> 
> [/F]
> > last time I checked, there were no characters (even in the
> > ISO standard) outside the 16-bit range.  has that changed?
> 
> [MAL]
> > No, but people are already thinking about it and there is
> > a defined range in the >16-bit area for private encodings
> > (F0000..FFFFD and 100000..10FFFD).
> 
> Over the decades I've developed a rule of thumb that has never wound up
> stuck in my ass :  If I engineer code that I expect to be in use for N
> years, I make damn sure that every internal limit is at least 10x larger
> than the largest I can conceive of a user making reasonable use of at the
> end of those N years.  The invariable result is that the N years pass, and
> fewer than half of the users have bumped into the limit <0.5 wink>.
> 
> At the risk of offending everyone, I'll suggest that, qualitatively
> speaking, Unicode is as Eurocentric as ASCII is Anglocentric.  We've just
> replaced "256 characters?!  We'll *never* run out of those!" with 64K.  But
> when Asian languages consume them 7K at a pop, 64K isn't even in my 10x
> comfort range for some individual languages.  In just a few months, Unicode
> 3 will already have used up > 56K of the 64K slots.
> 
> As I understand it, UTF-16 "only" adds 1M new code points.  That's in my 10x
> zone, for about a decade.

If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
signal failure of this assertion at Unicode object construction time
via an exception. That way we are within the standard, can use
reasonably fast code for Unicode manipulation and add those extra 1M
character at a later stage.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 11 14:47:49 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:47:49 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> <199911111203.HAA24221@eric.cnri.reston.va.us>
Message-ID: <382AD715.66DBA125@lemburg.com>

Guido van Rossum wrote:
> 
> > Let me tell you why you would want to have an encoding
> > which can be set:
> >
> > (1) sday I am on a Japanese Windows box, I have a
> > string called 'address' and I do 'print address'.  If
> > I see utf8, I see garbage.  If I see Shift-JIS, I see
> > the correct Japanese address.  At this point in time,
> > utf8 is an interchange format but 99% of the world's
> > data is in various native encodings.
> >
> > Analogous problems occur on input.
> >
> > (2) I'm using htmlgen, which 'prints' objects to
> > standard output.  My web site is supposed to be
> > encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
> > etc.)  Yes, browsers CAN detect and display UTF8 but
> > you just don't find UTF8 sites in the real world - and
> > most users just don't know about the encoding menu,
> > and will get pissed off if they have to reach for it.
> >
> > Ditto for streaming output in some protocol.
> >
> > Java solves this (and we could too by hacking stdout)
> > using Writer classes which are created as wrappers
> > around an output stream and can take an encoding, but
> > you lose the flexibility to 'just print'.
> >
> > I think being able to change encoding would be useful.
> >  What I do not want is to auto-detect it from the
> > operating system when Python boots - that would be a
> > portability nightmare.
> 
> You almost convinced me there, but I think this can still be done
> without changing the default encoding: simply reopen stdout with a
> different encoding.  This is how Java does it.  I/O streams with an
> encoding specified at open() are a very powerful feature.  You can
> hide this in your $PYTHONSTARTUP.

True and it probably covers all cases where setting the
default encoding to something other than UTF-8 makes sense.

I guess you've convinced me there ;-)

The current proposal has wrappers around stream for this purpose:

For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(,,) could be provided as
    short-hand for unicodec.file(open(,),) which
    also assures that  contains the 'b' character when needed.

The above can be done using:

import sys,unicodec
sys.stdin = unicodec.stream(sys.stdin,'jis')
sys.stdout = unicodec.stream(sys.stdout,'jis')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From jack@oratrix.nl  Thu Nov 11 15:58:39 1999
From: jack@oratrix.nl (Jack Jansen)
Date: Thu, 11 Nov 1999 16:58:39 +0100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: Message by "M.-A. Lemburg"  ,
 Thu, 11 Nov 1999 15:23:45 +0100 , <382AD171.D22A1D6E@lemburg.com>
Message-ID: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl>

> > [MAL, on Unicode chr() and ord()
> > > ...
> > > Because unichr() will always have to return Unicode objects. You don't
> > > want chr(i) to return Unicode for i>255 and strings for i<256.

> > > OTOH, ord() could probably be extended to also work on Unicode objects.

> Fine. So I'll drop the uniord() API and extend ord() instead.

Hmm, then wouldn't it be more logical to drop unichr() too, but add an 
optional parameter to chr() to specify what sort of a string you want? The 
type-object of a unicode string comes to mind...
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 




From bwarsaw@cnri.reston.va.us (Barry A. Warsaw)  Thu Nov 11 16:04:29 1999
From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw)
Date: Thu, 11 Nov 1999 11:04:29 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
 <38295A08.D3928401@lemburg.com>
 <14377.38438.615701.231437@weyr.cnri.reston.va.us>
 <382AB9CB.634A9782@lemburg.com>
Message-ID: <14378.59661.376434.449820@anthem.cnri.reston.va.us>

>>>>> "M" == M   writes:

    M> Doesn't Python convert class exceptions to strings when -X is
    M> used ? I would guess that many scripts already rely on the
    M> class based mechanism (much of my stuff does for sure), so by
    M> the time 1.6 is out, I think -X should be considered an option
    M> to run pre 1.5 code rather than using it for performance
    M> reasons.

This is a little off-topic so I'll be brief.  When using -X Python
never even creates the class exceptions, so it isn't really a
conversion.  It just uses string exceptions and tries to craft tuples
for what would be the superclasses in the class-based exception
hierarchy.  Yes, class-based exceptions are a bit of a performance hit
when you are catching exceptions in Python (because they need to be
instantiated), but they're just so darn *useful*.  I wouldn't mind
seeing the -X option go away for 1.6.

-Barry


From andy@robanal.demon.co.uk  Thu Nov 11 16:08:15 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Thu, 11 Nov 1999 08:08:15 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991111160815.5235.rocketmail@web608.mail.yahoo.com>

> See my other post on the subject...
> 
> Note that if we make UTF-8 the standard encoding,
> nearly all 
> special Latin-1 characters will produce UTF-8 errors
> on input
> and unreadable garbage on output. That will probably
> be unacceptable
> in Europe. To remedy this, one would *always* have
> to use
> u.encode('latin-1') to get readable output for
> Latin-1 strings
> repesented in Unicode.

You beat me to it - a colleague and I were just
discussing this verbally.  Specifically we Brits will
get annoyed as soon as we read in a text file with
pound (sterling) signs.

We concluded that the only reasonable default (if you
have one at all) is pure ASCII.  At least that way I
will get a clear and intelligible warning when I load
in such a file, and will remember to specify
ISO-Latin-1.  

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From mal@lemburg.com  Thu Nov 11 15:59:21 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 16:59:21 +0100
Subject: [Python-Dev] Unicode proposal: %-formatting ?
Message-ID: <382AE7D9.147D58CB@lemburg.com>

I wonder how we could add %-formatting to Unicode strings without
duplicating the PyString_Format() logic.

First, do we need Unicode object %-formatting at all ?

Second, here is an emulation using strings and 
that should give an idea of one could work with the different
encodings:

    s = '%s %i abcäöü' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string via Unicode
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

Note that .encode() defaults to the current setting of
.

Provided u maps to Latin-1, an alternative would be:

    u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 11 17:04:37 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 18:04:37 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl>
Message-ID: <382AF725.FC66C9B6@lemburg.com>

Jack Jansen wrote:
> 
> > > [MAL, on Unicode chr() and ord()
> > > > ...
> > > > Because unichr() will always have to return Unicode objects. You don't
> > > > want chr(i) to return Unicode for i>255 and strings for i<256.
> 
> > > > OTOH, ord() could probably be extended to also work on Unicode objects.
> 
> > Fine. So I'll drop the uniord() API and extend ord() instead.
> 
> Hmm, then wouldn't it be more logical to drop unichr() too, but add an
> optional parameter to chr() to specify what sort of a string you want? The
> type-object of a unicode string comes to mind...

Like:

import types
uc = chr(12,types.UnicodeType)

... looks overly complicated, IMHO.

uc = unichr(12)

and

u = unicode('abc')

look pretty intuitive to me.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 11 15:59:21 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 16:59:21 +0100
Subject: [Python-Dev] Unicode proposal: %-formatting ?
Message-ID: <382AE7D9.147D58CB@lemburg.com>

I wonder how we could add %-formatting to Unicode strings without
duplicating the PyString_Format() logic.

First, do we need Unicode object %-formatting at all ?

Second, here is an emulation using strings and 
that should give an idea of one could work with the different
encodings:

    s = '%s %i abcäöü' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string via Unicode
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

Note that .encode() defaults to the current setting of
.

Provided u maps to Latin-1, an alternative would be:

    u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 11 17:31:34 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 18:31:34 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991111160815.5235.rocketmail@web608.mail.yahoo.com>
Message-ID: <382AFD76.A0D3FEC4@lemburg.com>

Andy Robinson wrote:
> 
> > See my other post on the subject...
> >
> > Note that if we make UTF-8 the standard encoding,
> > nearly all
> > special Latin-1 characters will produce UTF-8 errors
> > on input
> > and unreadable garbage on output. That will probably
> > be unacceptable
> > in Europe. To remedy this, one would *always* have
> > to use
> > u.encode('latin-1') to get readable output for
> > Latin-1 strings
> > repesented in Unicode.
> 
> You beat me to it - a colleague and I were just
> discussing this verbally.  Specifically we Brits will
> get annoyed as soon as we read in a text file with
> pound (sterling) signs.
> 
> We concluded that the only reasonable default (if you
> have one at all) is pure ASCII.  At least that way I
> will get a clear and intelligible warning when I load
> in such a file, and will remember to specify
> ISO-Latin-1.

Well, Guido's post made me rethink the approach...

1. Setting  to any non UTF encoding
   will result in data lossage due to the encoding limits
   imposed by the other formats -- this is dangerous and
   will result in errors (some of which may not even be
   noticed due to the interpreter ignoring them) in case
   your strings use non encodable characters.

2. You basically only want to set  to
   anything other than UTF-8 for stream input and output.
   This can be done using the unicodec stream wrapper without
   too much inconvenience. (We'll have to extend the wrapper a little,
   though, because it currently only accept Unicode objects for
   writing and always return Unicode object when reading.)

3. We should leave the issue open until some code is there
   to be tested... I have a feeling that there will be quite
   a few strange effects when APIs expecting strings are fed
   with Unicode objects returning UTF-8.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mhammond@skippinet.com.au  Fri Nov 12 01:10:09 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 12 Nov 1999 12:10:09 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382ABE34.5D27C701@lemburg.com>
Message-ID: <007a01bf2caa$aabdef60$0501a8c0@bobcat>

> Mark Hammond wrote:
> > Having a fixed, default encoding may make life slightly
> more difficult
> > when you want to work primarily in a different encoding,
> but at least
> > your system is predictable and reliable.
>
> I think the discussion on this is getting a little too hot.

Really - I see it as moving to a rational consensus that doesnt
support the proposal in this regard.  I see no heat in it at all.  Im
sorry if you saw my post or any of the followups as "emotional", but I
certainly not getting passionate about this.  I dont see any of this
as affecting me personally.  I believe that I can replace my Unicode
implementation with this either way we go.  Just because a we are
trying to get it right doesnt mean we are getting heated.

> The point
> is simply that the option of changing the per-thread default
encoding
> is there. You are not required to use it and if you do you are on
> your own when something breaks.

Hrm - Im having serious trouble following your logic here.  If make
_any_ assumptions about a default encoding, I am in danger of
breaking.  I may not choose to change the default, but as soon as
_anyone_ does, unrelated code may break.

I agree that I will be "on my own", but I wont necessarily have been
the one that changed it :-(

The only answer I can see is, as you suggest, to ignore the fact that
there is _any_ default.  Always specify the encoding.  But obviously
this is not good enough for HP:

> Think of it as a HP specific feature... perhaps I should wrap the
code
> in #ifdefs and leave it undocumented.

That would work - just ensure that no standard Python has those
#ifdefs turned on :-)  I would be sorely dissapointed if the fact that
HP are throwing money for this means they get every whim implemented
in the core language.  Imagine the outcry if it were instead MS'
money, and you were attempting to put an MS spin on all this.

Are you writing a module for HP, or writing a module for Python that
HP are assisting by providing some funding?  Clear difference.  IMO,
it must also be seen that there is a clear difference.

Maybe Im missing something.  Can you explain why it is good enough
everyone else to be required to assume there is no default encoding,
but HP get their thread specific global?  Are their requirements
greater than anyone elses?  Is everyone else not as important?  What
would you, as a consultant, recommend to people who arent HP, but have
a similar requirement?  It would seem obvious to me that HPs
requirement can be met in "pure Python", thereby keeping this out of
the core all together...

Mark.



From gmcm@hypernet.com  Fri Nov 12 02:01:23 1999
From: gmcm@hypernet.com (Gordon McMillan)
Date: Thu, 11 Nov 1999 21:01:23 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat>
References: <382ABE34.5D27C701@lemburg.com>
Message-ID: <1269750417-7621469@hypernet.com>

[per-thread defaults]

C'mon guys, hasn't anyone ever played consultant before? The 
idea is obviously brain-dead. OTOH, they asked for it 
specifically, meaning they have some assumptions about how 
they think they're going to use it. If you give them what they 
ask for, you'll only have to fix it when they realize there are 
other ways of doing things that don't work with per-thread 
defaults. So, you find out why they think it's a good thing; you 
make it easy for them to code this way (without actually using 
per-thread defaults) and you don't make a fuss about it. More 
than likely, they won't either.

"requirements"-are-only-useful-as-clues-to-the-objectives-
behind-them-ly y'rs



- Gordon


From tim_one@email.msn.com  Fri Nov 12 05:04:44 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 00:04:44 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AB9CB.634A9782@lemburg.com>
Message-ID: <000a01bf2ccb$6f59c2c0$fd2d153f@tim>

[MAL]
>>> Codecs should raise an UnicodeError in case the conversion is
>>> not possible.

[Fred L. Drake, Jr.]
>>   I think that should be ValueError, or UnicodeError should be a
>> subclass of ValueError.
>>   (Can the -X interpreter option be removed yet?)

[MAL]
> Doesn't Python convert class exceptions to strings when -X is
> used ? I would guess that many scripts already rely on the class
> based mechanism (much of my stuff does for sure), so by the time
> 1.6 is out, I think -X should be considered an option to run
> pre 1.5 code rather than using it for performance reasons.

-X is a red herring.  That is, do what seems best without regard for -X.  I
already added one subclass exception to the CVS tree (UnboundLocalError as a
subclass of NameError), and in doing that had to figure out how to make it
do the right thing under -X too.  It's a bit clumsy to arrange, but not a
problem.




From tim_one@email.msn.com  Fri Nov 12 05:18:09 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 00:18:09 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <382AD0FE.B604876A@lemburg.com>
Message-ID: <000e01bf2ccd$4f4b0e60$fd2d153f@tim>

[MAL]
> ...
> The conversion goes as follows:
> ˇ for single characters (and this includes all \XXX sequences
>   except \uXXXX), take the ordinal and interpret it as Unicode
>   ordinal for \uXXXX sequences, insert the Unicode character
>   with ordinal 0xXXXX instead

Perfect!

[about "raw" Unicode strings]
> ...
> Not sure whether we really need to make this even more complicated...
> The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or
> filenames won't hurt much in the context of those \uXXXX monsters :-)

Alas, this won't stand over the long term.  Eventually people will write
Python using nothing but Unicode strings -- "regular strings" will
eventurally become a backward compatibility headache <0.7 wink>.  IOW,
Unicode regexps and Unicode docstrings and Unicode formatting ops ...
nothing will escape.  Nor should it.

I don't think it all needs to be done at once, though -- existing languages
usually take years to graft in gimmicks to cover all the fine points.  So,
happy to let raw Unicode strings pass for now, as a relatively minor point,
but without agreeing it can be ignored forever.

> ...
> BTW, if you want to type in UTF-8 strings and have them converted
> to Unicode, you can use the standard:
>
> u = unicode('...string with UTF-8 encoded characters...','utf-8')

That's what I figured, and thanks for the confirmation.




From tim_one@email.msn.com  Fri Nov 12 05:42:32 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 00:42:32 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AD233.BE6DE888@lemburg.com>
Message-ID: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>

[MAL]
> If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
> signal failure of this assertion at Unicode object construction time
> via an exception. That way we are within the standard, can use
> reasonably fast code for Unicode manipulation and add those extra 1M
> character at a later stage.

I think this is reasonable.

Using UTF-8 internally is also reasonable, and if it's being rejected on the
grounds of supposed slowness, that deserves a closer look (it's an ingenious
encoding scheme that works correctly with a surprising number of existing
8-bit string routines as-is).  Indexing UTF-8 strings is greatly speeded by
adding a simple finger (i.e., store along with the string an index+offset
pair identifying the most recent position indexed to -- since string
indexing is overwhelmingly sequential, this makes most indexing
constant-time; and UTF-8 can be scanned either forward or backward from a
random internal point because "the first byte" of each encoding is
recognizable as such).

I expect either would work well.  It's at least curious that Perl and Tcl
both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
people here saying UCS-2 is the obviously better choice are all from the
Microsoft camp .  It's not obvious to me, but then neither do I claim
that UTF-8 is obviously better.




From tim_one@email.msn.com  Fri Nov 12 06:02:01 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 01:02:01 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AD479.5261B43B@lemburg.com>
Message-ID: <001001bf2cd3$6fa57820$fd2d153f@tim>

[MAL]
> Note that if we make UTF-8 the standard encoding, nearly all
> special Latin-1 characters will produce UTF-8 errors on input
> and unreadable garbage on output. That will probably be unacceptable
> in Europe. To remedy this, one would *always* have to use
> u.encode('latin-1') to get readable output for Latin-1 strings
> repesented in Unicode.

I think it's time for the Europeans to pronounce on what's acceptable in
Europe.  To the limited extent that I can pretend I'm Eurpoean, I'm happy
with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea.

> I'd rather see this happen the other way around: *always* explicitly
> state the encoding you want in case you rely on it, e.g. write
>
> file.write(u.encode('utf-8'))
>
> instead of
>
> file.write(u) # let's hope this goes out as UTF-8...

By the same argument, those pesky Europeans who are relying on Latin-1
should write

file.write(u.encode('latin-1'))

instead of

file.write(u)  # let's hope this goes out as Latin-1

> Using the  as site dependent setting is useful
> for convenience in those cases where the output format should be
> readable rather than parseable.

Well, "convenience" is always the argument advanced in favor of modes.
Conflicts and nasty intermittent bugs are always the result.  The latter
will happen under Guido's idea too, as various careless modules rebind stdin
& stdout to their own ideas of what "the proper" encoding should be.  But at
least the blame doesn't fall on the core language then <0.3 wink>.

Since there doesn't appear to be anything (either or good or bad) you can do
(or avoid) by using Guido's scheme instead of magical core thread state,
there's no *need* for the latter.  That is, it can be done with a user-level
API without involving the core.




From tim_one@email.msn.com  Fri Nov 12 06:17:08 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 01:17:08 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat>
Message-ID: <001501bf2cd5$8c380140$fd2d153f@tim>

[Mark Hammond]
> ...
> Are you writing a module for HP, or writing a module for Python that
> HP are assisting by providing some funding?  Clear difference.  IMO,
> it must also be seen that there is a clear difference.

I can resolve this easily, but only with input from Guido.  Guido, did HP's
check clear yet?  If so, we can ignore them .




From andy@robanal.demon.co.uk  Fri Nov 12 08:15:19 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Fri, 12 Nov 1999 00:15:19 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991112081519.20636.rocketmail@web603.mail.yahoo.com>

--- Gordon McMillan  wrote:
> [per-thread defaults]
> 
> C'mon guys, hasn't anyone ever played consultant
> before? The 
> idea is obviously brain-dead. OTOH, they asked for
> it 
> specifically, meaning they have some assumptions
> about how 
> they think they're going to use it. If you give them
> what they 
> ask for, you'll only have to fix it when they
> realize there are 
> other ways of doing things that don't work with
> per-thread 
> defaults. So, you find out why they think it's a
> good thing; you 
> make it easy for them to code this way (without
> actually using 
> per-thread defaults) and you don't make a fuss about
> it. More 
> than likely, they won't either.
> 

I wrote directly to ask them exactly this last night. 
Let's forget the per-thread thing until we get an
answer.

- Andy




=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From mal@lemburg.com  Fri Nov 12 09:27:29 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:27:29 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000e01bf2ccd$4f4b0e60$fd2d153f@tim>
Message-ID: <382BDD81.458D3125@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > ...
> > The conversion goes as follows:
> > ˇ for single characters (and this includes all \XXX sequences
> >   except \uXXXX), take the ordinal and interpret it as Unicode
> >   ordinal for \uXXXX sequences, insert the Unicode character
> >   with ordinal 0xXXXX instead
> 
> Perfect!

Thanks :-)
 
> [about "raw" Unicode strings]
> > ...
> > Not sure whether we really need to make this even more complicated...
> > The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or
> > filenames won't hurt much in the context of those \uXXXX monsters :-)
> 
> Alas, this won't stand over the long term.  Eventually people will write
> Python using nothing but Unicode strings -- "regular strings" will
> eventurally become a backward compatibility headache <0.7 wink>.  IOW,
> Unicode regexps and Unicode docstrings and Unicode formatting ops ...
> nothing will escape.  Nor should it.
> 
> I don't think it all needs to be done at once, though -- existing languages
> usually take years to graft in gimmicks to cover all the fine points.  So,
> happy to let raw Unicode strings pass for now, as a relatively minor point,
> but without agreeing it can be ignored forever.

Agreed... note that you could also write your own codec for just this
reason and then use:

u = unicode('....\u1234...\...\...','raw-unicode-escaped')

Put that into a function called 'ur' and you have:

u = ur('...\u4545...\...\...')

which is not that far away from ur'...' w/r to cosmetics.

> > ...
> > BTW, if you want to type in UTF-8 strings and have them converted
> > to Unicode, you can use the standard:
> >
> > u = unicode('...string with UTF-8 encoded characters...','utf-8')
> 
> That's what I figured, and thanks for the confirmation.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Fri Nov 12 09:00:47 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:00:47 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991112081519.20636.rocketmail@web603.mail.yahoo.com>
Message-ID: <382BD73E.E6729C79@lemburg.com>

Andy Robinson wrote:
> 
> --- Gordon McMillan  wrote:
> > [per-thread defaults]
> >
> > C'mon guys, hasn't anyone ever played consultant
> > before? The
> > idea is obviously brain-dead. OTOH, they asked for
> > it
> > specifically, meaning they have some assumptions
> > about how
> > they think they're going to use it. If you give them
> > what they
> > ask for, you'll only have to fix it when they
> > realize there are
> > other ways of doing things that don't work with
> > per-thread
> > defaults. So, you find out why they think it's a
> > good thing; you
> > make it easy for them to code this way (without
> > actually using
> > per-thread defaults) and you don't make a fuss about
> > it. More
> > than likely, they won't either.
> >
> 
> I wrote directly to ask them exactly this last night.
> Let's forget the per-thread thing until we get an
> answer.

That's the way to go, Andy.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Fri Nov 12 09:44:14 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:44:14 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <007a01bf2caa$aabdef60$0501a8c0@bobcat>
Message-ID: <382BE16E.D17C80E1@lemburg.com>

Mark Hammond wrote:
> 
> > Mark Hammond wrote:
> > > Having a fixed, default encoding may make life slightly
> > more difficult
> > > when you want to work primarily in a different encoding,
> > but at least
> > > your system is predictable and reliable.
> >
> > I think the discussion on this is getting a little too hot.
> 
> Really - I see it as moving to a rational consensus that doesnt
> support the proposal in this regard.  I see no heat in it at all.  Im
> sorry if you saw my post or any of the followups as "emotional", but I
> certainly not getting passionate about this.  I dont see any of this
> as affecting me personally.  I believe that I can replace my Unicode
> implementation with this either way we go.  Just because a we are
> trying to get it right doesnt mean we are getting heated.

Naa... with "heated" I meant the "HP wants this, HP wants that" side
of things. We'll just have to wait for their answer on this one.

> > The point
> > is simply that the option of changing the per-thread default
> encoding
> > is there. You are not required to use it and if you do you are on
> > your own when something breaks.
> 
> Hrm - Im having serious trouble following your logic here.  If make
> _any_ assumptions about a default encoding, I am in danger of
> breaking.  I may not choose to change the default, but as soon as
> _anyone_ does, unrelated code may break.
> 
> I agree that I will be "on my own", but I wont necessarily have been
> the one that changed it :-(

Sure there are some very subtile dangers in setting the default
to anything other than the default ;-) For some this risk may
be worthwhile taking, for others not. In fact, in large projects
I would never take such a risk... I'm sure we can get this 
message across to them.
 
> The only answer I can see is, as you suggest, to ignore the fact that
> there is _any_ default.  Always specify the encoding.  But obviously
> this is not good enough for HP:
> 
> > Think of it as a HP specific feature... perhaps I should wrap the
> code
> > in #ifdefs and leave it undocumented.
> 
> That would work - just ensure that no standard Python has those
> #ifdefs turned on :-)  I would be sorely dissapointed if the fact that
> HP are throwing money for this means they get every whim implemented
> in the core language.  Imagine the outcry if it were instead MS'
> money, and you were attempting to put an MS spin on all this.
> 
> Are you writing a module for HP, or writing a module for Python that
> HP are assisting by providing some funding?  Clear difference.  IMO,
> it must also be seen that there is a clear difference.
> 
> Maybe Im missing something.  Can you explain why it is good enough
> everyone else to be required to assume there is no default encoding,
> but HP get their thread specific global?  Are their requirements
> greater than anyone elses?  Is everyone else not as important?  What
> would you, as a consultant, recommend to people who arent HP, but have
> a similar requirement?  It would seem obvious to me that HPs
> requirement can be met in "pure Python", thereby keeping this out of
> the core all together...

Again, all I can try is convince them of not really needing
settable default encodings.


Since this is the first time a Python Consortium member is
pushing development, I think we can learn a lot here. For one,
it should be clear that money doesn't buy everything, OTOH,
we cannot put the whole thing at risk just because
of some minor disagreement that cannot be solved between the
parties. The standard solution for the latter should be a
customized Python interpreter.


-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Fri Nov 12 09:04:31 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:04:31 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <001001bf2cd3$6fa57820$fd2d153f@tim>
Message-ID: <382BD81F.B2BC896A@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > Note that if we make UTF-8 the standard encoding, nearly all
> > special Latin-1 characters will produce UTF-8 errors on input
> > and unreadable garbage on output. That will probably be unacceptable
> > in Europe. To remedy this, one would *always* have to use
> > u.encode('latin-1') to get readable output for Latin-1 strings
> > repesented in Unicode.
> 
> I think it's time for the Europeans to pronounce on what's acceptable in
> Europe.  To the limited extent that I can pretend I'm Eurpoean, I'm happy
> with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea.

Agreed.
 
> > I'd rather see this happen the other way around: *always* explicitly
> > state the encoding you want in case you rely on it, e.g. write
> >
> > file.write(u.encode('utf-8'))
> >
> > instead of
> >
> > file.write(u) # let's hope this goes out as UTF-8...
> 
> By the same argument, those pesky Europeans who are relying on Latin-1
> should write
> 
> file.write(u.encode('latin-1'))
> 
> instead of
> 
> file.write(u)  # let's hope this goes out as Latin-1

Right.
 
> > Using the  as site dependent setting is useful
> > for convenience in those cases where the output format should be
> > readable rather than parseable.
> 
> Well, "convenience" is always the argument advanced in favor of modes.
> Conflicts and nasty intermittent bugs are always the result.  The latter
> will happen under Guido's idea too, as various careless modules rebind stdin
> & stdout to their own ideas of what "the proper" encoding should be.  But at
> least the blame doesn't fall on the core language then <0.3 wink>.
> 
> Since there doesn't appear to be anything (either or good or bad) you can do
> (or avoid) by using Guido's scheme instead of magical core thread state,
> there's no *need* for the latter.  That is, it can be done with a user-level
> API without involving the core.

Dito :-)

I have nothing against telling people to take care about the problem
in user space (meaning: not done by the core interpreter) and I'm
pretty sure that HP will agree on this too, provided we give them
the proper user space tools like file wrappers et al.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Fri Nov 12 09:16:57 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:16:57 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
Message-ID: <382BDB09.55583F28@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
> > signal failure of this assertion at Unicode object construction time
> > via an exception. That way we are within the standard, can use
> > reasonably fast code for Unicode manipulation and add those extra 1M
> > character at a later stage.
> 
> I think this is reasonable.
> 
> Using UTF-8 internally is also reasonable, and if it's being rejected on the
> grounds of supposed slowness, that deserves a closer look (it's an ingenious
> encoding scheme that works correctly with a surprising number of existing
> 8-bit string routines as-is).  Indexing UTF-8 strings is greatly speeded by
> adding a simple finger (i.e., store along with the string an index+offset
> pair identifying the most recent position indexed to -- since string
> indexing is overwhelmingly sequential, this makes most indexing
> constant-time; and UTF-8 can be scanned either forward or backward from a
> random internal point because "the first byte" of each encoding is
> recognizable as such).

Here are some arguments for using the proposed UTF-16 strategy instead:

ˇ all characters have the same length; indexing is fast
ˇ conversion APIs to platform dependent wchar_t implementation are fast
  because they either can simply copy the content or extend the 2-bytes
  to 4 byte
ˇ UTF-8 needs 2 bytes for all the compound Latin-1 characters (e.g. u
  with two dots) which are used in many non-English languages
ˇ from the Unicode Consortium FAQ: "Most Unicode APIs are using UTF-16."

Besides, the Unicode object will have a buffer containing the
 representation of the object, which, if all goes
well, will always hold the UTF-8 value. RE engines etc. can then directly
work with this buffer.
 
> I expect either would work well.  It's at least curious that Perl and Tcl
> both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
> people here saying UCS-2 is the obviously better choice are all from the
> Microsoft camp .  It's not obvious to me, but then neither do I claim
> that UTF-8 is obviously better.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gstein@lyra.org  Fri Nov 12 10:20:16 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:20:16 -0800 (PST)
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit)
In-Reply-To: <382BE16E.D17C80E1@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> 
> Since this is the first time a Python Consortium member is
> pushing development, I think we can learn a lot here. For one,
> it should be clear that money doesn't buy everything, OTOH,
> we cannot put the whole thing at risk just because
> of some minor disagreement that cannot be solved between the
> parties. The standard solution for the latter should be a
> customized Python interpreter.
> 

hehe... funny you mention this. Go read the Consortium docs. Last time
that I read them, there are no "parties" to reach consensus. *Every*
technical decision regarding the Python language falls to the Technical
Director (Guido, of course). I looked. I found nothing that can override
the T.D.'s decisions and no way to force a particular decision.

Guido is still the Benevolent Dictator :-)

Cheers,
-g

p.s. yes, there is always the caveat that "sure, Guido has final say" but
"Al can fire him at will for being too stubborn" :-) ... but hey, Guido's
title does have the word Benevolent in it, so things are cool...

--
Greg Stein, http://www.lyra.org/




From gstein@lyra.org  Fri Nov 12 10:24:56 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:24:56 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382BE16E.D17C80E1@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> Sure there are some very subtile dangers in setting the default
> to anything other than the default ;-) For some this risk may
> be worthwhile taking, for others not. In fact, in large projects
> I would never take such a risk... I'm sure we can get this 
> message across to them.

It's a lot easier to just never provide the rope (per-thread default
encodings) in the first place.

If the feature exists, then it will be used. Period. Try to get the
message across until you're blue in the face, but it would be used.

Anyhow... discussion is pretty moot until somebody can state that it
is/isn't a "real requirement" and/or until The Guido takes a position.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Fri Nov 12 10:30:04 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:30:04 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
Message-ID: 

On Fri, 12 Nov 1999, Tim Peters wrote:
>...
> Using UTF-8 internally is also reasonable, and if it's being rejected on the
> grounds of supposed slowness

No... my main point was interaction with the underlying OS. I made a SWAG
(Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower
for various types of operations. As always, your infernal meddling has
dashed that hypothesis, so I must retreat...

>...
> I expect either would work well.  It's at least curious that Perl and Tcl
> both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
> people here saying UCS-2 is the obviously better choice are all from the
> Microsoft camp .  It's not obvious to me, but then neither do I claim
> that UTF-8 is obviously better.

Probably for the exact reason that you stated in your messages: many 8-bit
(7-bit?) functions continue to work quite well when given a UTF-8-encoded
string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter
to deal with a new string type.

I'd guess it is a helluva lot easier for us to add a Python Type than for
Perl or TCL to whack around with new string types (since they use strings
so heavily).

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From mal@lemburg.com  Fri Nov 12 10:30:28 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 11:30:28 +0100
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization
 Toolkit)
References: 
Message-ID: <382BEC44.A2541C7E@lemburg.com>

Greg Stein wrote:
> 
> On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> > 
> > Since this is the first time a Python Consortium member is
> > pushing development, I think we can learn a lot here. For one,
> > it should be clear that money doesn't buy everything, OTOH,
> > we cannot put the whole thing at risk just because
> > of some minor disagreement that cannot be solved between the
> > parties. The standard solution for the latter should be a
> > customized Python interpreter.
> > 
> 
> hehe... funny you mention this. Go read the Consortium docs. Last time
> that I read them, there are no "parties" to reach consensus. *Every*
> technical decision regarding the Python language falls to the Technical
> Director (Guido, of course). I looked. I found nothing that can override
> the T.D.'s decisions and no way to force a particular decision.
> 
> Guido is still the Benevolent Dictator :-)

Sure, but have you considered the option of a member simply bailing
out ? HP could always stop funding Unicode integration. That wouldn't
help us either...
 
> Cheers,
> -g
> 
> p.s. yes, there is always the caveat that "sure, Guido has final say" but
> "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's
> title does have the word Benevolent in it, so things are cool...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From gstein@lyra.org  Fri Nov 12 10:39:45 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:39:45 -0800 (PST)
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization
 Toolkit)
In-Reply-To: <382BEC44.A2541C7E@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
>...
> Sure, but have you considered the option of a member simply bailing
> out ? HP could always stop funding Unicode integration. That wouldn't
> help us either...

I'm not that dumb... come on. That was my whole point about "Benevolent"
below... Guido is a fair and reasonable Dictator... he wouldn't let that
happen.

>...
> > p.s. yes, there is always the caveat that "sure, Guido has final say" but
> > "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's
> > title does have the word Benevolent in it, so things are cool...


Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From Mike.Da.Silva@uk.fid-intl.com  Fri Nov 12 11:00:49 1999
From: Mike.Da.Silva@uk.fid-intl.com (Da Silva, Mike)
Date: Fri, 12 Nov 1999 11:00:49 -0000
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: 

Most of the ASCII string functions do indeed work for UTF-8.  I have made
extensive use of this feature when writing translation logic to harmonize
ASCII text (an SQL statement) with substitution parameters that must be
converted from IBM EBCDIC code pages (5035, 1027) into UTF8.  Since UTF-8 is
a superset of ASCII, this all works fine.

Some of the character classification functions etc can be flaky when used
with UTF8 characters outside the ASCII range, but simple string operations
work fine.

As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an
internal string representation are:

1.	UTF-8 allows all characters to be displayed (in some form or other)
on the users machine, with or without native fonts installed.  Naturally
anything outside the ASCII range will be garbage, but it is an immense
debugging aid when working with character encodings to be able to touch and
feel something recognizable.  Trying to decode a block of raw UTF-16 is a
pain.
2.	UTF-8 works with most existing string manipulation libraries quite
happily.  It is also portable (a char is always 8 bits, regardless of
platform; wchar_t varies between 16 and 32 bits depending on the underlying
operating system (although unsigned short does seems to work across
platforms, in my experience).
3.	UTF-16 has some advantages in providing fixed width characters and,
(ignoring surrogate pairs etc) a modeless encoding space.  This is an
advantage for fast string operations, especially on CPU's that have
efficient operations for handling 16bit data.
4.	UTF-16 would directly support a tightly coupled character properties
engine, which would enable Unicode compliant case folding and character
decomposition to be performed without an intermediate UTF-8 <----> UTF-16
translation step.
5.	UTF-16 requires string operations that do not make assumptions about
nulls - this means re-implementing most of the C runtime functions to work
with unsigned shorts.

Regards,
Mike da Silva

	-----Original Message-----
	From:	Greg Stein [SMTP:gstein@lyra.org]
	Sent:	12 November 1999 10:30
	To:	Tim Peters
	Cc:	python-dev@python.org
	Subject:	RE: [Python-Dev] Internationalization Toolkit

	On Fri, 12 Nov 1999, Tim Peters wrote:
	>...
	> Using UTF-8 internally is also reasonable, and if it's being
rejected on the
	> grounds of supposed slowness

	No... my main point was interaction with the underlying OS. I made a
SWAG
	(Scientific Wild Ass Guess :-) and stated that UTF-8 is probably
slower
	for various types of operations. As always, your infernal meddling
has
	dashed that hypothesis, so I must retreat...

	>...
	> I expect either would work well.  It's at least curious that Perl
and Tcl
	> both went with UTF-8 -- does anyone think they know *why*?  I
don't.  The
	> people here saying UCS-2 is the obviously better choice are all
from the
	> Microsoft camp .  It's not obvious to me, but then neither
do I claim
	> that UTF-8 is obviously better.

	Probably for the exact reason that you stated in your messages: many
8-bit
	(7-bit?) functions continue to work quite well when given a
UTF-8-encoded
	string. i.e. they didn't have to rewrite the entire Perl/TCL
interpreter
	to deal with a new string type.

	I'd guess it is a helluva lot easier for us to add a Python Type
than for
	Perl or TCL to whack around with new string types (since they use
strings
	so heavily).

	Cheers,
	-g

	--
	Greg Stein, http://www.lyra.org/


	_______________________________________________
	Python-Dev maillist  -  Python-Dev@python.org
	http://www.python.org/mailman/listinfo/python-dev


From fredrik@pythonware.com  Fri Nov 12 11:23:24 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 12:23:24 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com>
Message-ID: <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>

> Besides, the Unicode object will have a buffer containing the
>  representation of the object, which, if all goes
> well, will always hold the UTF-8 value.



over my dead body, that one...

(fwiw, over the last 20 years, I've implemented about a
dozen image processing libraries, supporting loads of
pixel layouts and file formats.  one important lesson
from that is to stick to a single internal representation,
and let the application programmers build their own
layers if they need to speed things up -- yes, they're
actually happier that way.  and text strings are not
that different from pixel buffers or sound streams or
scientific data sets, after all...)

(and sticks and modes will break your bones, but you
know that...)

> RE engines etc. can then directly work with this buffer.

sidebar: the RE engine that's being developed for this
project can handle 8-bit, 16-bit, and (optionally) 32-bit
text buffers. a single compiled expression can be used
with any character size, and performance is about the
same for all sizes (at least on any decent cpu).

> > I expect either would work well.  It's at least curious that Perl and Tcl
> > both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
> > people here saying UCS-2 is the obviously better choice are all from the
> > Microsoft camp .

(hey, I'm not a microsofter.  but I've been writing "i/o
libraries" for various "object types" all my life, so I do
have strong preferences on what works, and what
doesn't...  I use Python for good reasons, you know ;-)



thanks.  I feel better now.





From fredrik@pythonware.com  Fri Nov 12 11:23:38 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 12:23:38 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <027f01bf2d00$648745e0$f29b12c2@secret.pythonware.com>

> 5. UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.

footnote: the mad scientist has been there
and done that:

http://www.pythonware.com/madscientist/

(and you can replace "unsigned short" with
"whatever's suitable on this platform")





From fredrik@pythonware.com  Fri Nov 12 11:36:03 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 12:36:03 +0100
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit)
References: 
Message-ID: <02a701bf2d02$20c66280$f29b12c2@secret.pythonware.com>

> Guido is a fair and reasonable Dictator... he wouldn't let that
> happen.

...but where is he when we need him? ;-)





From Mike.Da.Silva@uk.fid-intl.com  Fri Nov 12 11:43:21 1999
From: Mike.Da.Silva@uk.fid-intl.com (Da Silva, Mike)
Date: Fri, 12 Nov 1999 11:43:21 -0000
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: 

Fredrik Lundh wrote:

> 5. UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.

footnote: the mad scientist has been there and done that:
http://www.pythonware.com/madscientist/
 
(and you can replace "unsigned short" with "whatever's suitable on this
platform")

Surely using a different type on different platforms means that we throw
away the concept of a platform independent Unicode string?
I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.
Does this mean that to transfer a file between a Windows box and Solaris, an
implicit conversion has to be done to go from 16 bits to 32 bits (and vice
versa)?  What about byte ordering issues?
Or do you mean whatever 16 bit data type is available on the platform, with
a standard (platform independent) byte ordering maintained?
Mike da S


From fredrik@pythonware.com  Fri Nov 12 12:16:24 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 13:16:24 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>

Mike wrote:
> Surely using a different type on different platforms means that we throw
> away the concept of a platform independent Unicode string?
> I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.

so?  the interchange format doesn't have to be
the same as the internal format, does it?

> Does this mean that to transfer a file between a Windows box and Solaris, an
> implicit conversion has to be done to go from 16 bits to 32 bits (and vice
> versa)?  What about byte ordering issues?

no problem at all: unicode has special byte order
marks for this purpose (and utf-8 doesn't care, of
course).

> Or do you mean whatever 16 bit data type is available on the platform, with
> a standard (platform independent) byte ordering maintained?

well, my preference is a 16-bit data type in the plat-
form's native byte order (exactly how it's done in the
unicode module -- for the moment, it can use the
platform's wchar_t, but only if it happens to be a
16-bit unsigned type).  gives you good performance,
compact storage, and cleanest possible code.

...

anyway, I think it would help the discussion a little bit
if people looked at (and played with) the existing code
base.  at least that'll change arguments like "but then
we have to implement that" to "but then we have to
maintain that code" ;-)





From andy@robanal.demon.co.uk  Fri Nov 12 12:13:03 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Fri, 12 Nov 1999 04:13:03 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991112121303.27452.rocketmail@ web605.yahoomail.com>

--- "Da Silva, Mike" 
wrote:
> As I see it, the relative pros and cons of UTF-8
> versus UTF-16 for use as an
> internal string representation are:
> [snip]
> Regards,
> Mike da Silva
> 

Note that by going with UTF16, we get both.  We will
certainly have a codec for utf8, just as we will for
ISO-Latin-1, Shift-JIS or whatever.  And a perfectly
ordinary Python string is a great place to hold UTF8;
you can look at it and use most of the ordinary string
algorithms on it.  

I presume no one is actually advocating dropping
ordinary Python strings, or the ability to do
   rawdata = open('myfile.txt', 'rb').read()
without any transformations?


- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From mhammond@skippinet.com.au  Fri Nov 12 12:27:19 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 12 Nov 1999 23:27:19 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
Message-ID: <007e01bf2d09$44738440$0501a8c0@bobcat>

/F writes
> anyway, I think it would help the discussion a little bit
> if people looked at (and played with) the existing code
> base.  at least that'll change arguments like "but then
> we have to implement that" to "but then we have to
> maintain that code" ;-)

I second that.  It is good enough for me (although my requirements
arent stringent) - its been used on CE, so would slot directly into
the win32 stuff.  It is pretty much the consensus of the string-sig of
last year, but as code!

The only "problem" with it is the code that hasnt been written yet,
specifically:
* Encoders as streams, and a concrete proposal for them.
* Decent PyArg_ParseTuple support and Py_BuildValue support.
* The ord(), chr() stuff, and other stuff around the edges no doubt.

Couldnt we start with Fredriks implementation, and see how the rest
turns out?  Even if we do choose to change the underlying Unicode
implementation to use a different native encoding, the interface to
the PyUnicode_Type would remain pretty similar.  The advantage is that
we have something now to start working with for the rest of the
support we need.

Mark.



From mal@lemburg.com  Fri Nov 12 12:38:44 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 13:38:44 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.4
Message-ID: <382C0A54.E6E8328D@lemburg.com>

I've uploaded a new version of the proposal which incorporates
a lot of what has been discussed on the list.

Thanks to everybody who helped so far. Note that I have extended
the list of references for those who want to join in, but are
in need of more background information.

The latest version of the proposal is available at:

	http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

	http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ˇ support for line breaks (see
      http://www.unicode.org/unicode/reports/tr13/ )

    ˇ support for case conversion: 

      Problems: string lengths can change due to multiple
      characters being mapped to a single new one, capital letters
      starting a word can be different than ones occurring in the
      middle, there are locale dependent deviations from the standard
      mappings.

    ˇ support for numbers, digits, whitespace, etc.

    ˇ support (or no support) for private code point areas

    ˇ should Unicode objects support %-formatting ?

    One possibility would be to emulate this via strings and 
    :

    s = '%s %i abcäöü' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

    ˇ specifying file wrappers:

    Open issues: what to do with Python strings
    fed to the .write() method (may need to know the encoding of the
    strings) and when/if to return Python strings through the .read()
    method.

    Perhaps we need more than one type of wrapper here.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Fri Nov 12 13:11:26 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 14:11:26 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
Message-ID: <382C11FE.D7D9F916@lemburg.com>

Fredrik Lundh wrote:
> 
> > Besides, the Unicode object will have a buffer containing the
> >  representation of the object, which, if all goes
> > well, will always hold the UTF-8 value.
> 
> 
> 
> over my dead body, that one...

Such a buffer is needed to implement "s" and "s#" argument
parsing. It's a simple requirement to support those two
parsing markers -- there's not much to argue about, really...
unless, of course, you want to give up Unicode object support
for all APIs using these parsers.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mal@lemburg.com  Fri Nov 12 13:01:28 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 14:01:28 +0100
Subject: [Python-Dev] Internationalization Toolkit
References:  <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
Message-ID: <382C0FA8.ACB6CCD6@lemburg.com>

Fredrik Lundh wrote:
> 
> Mike wrote:
> > Surely using a different type on different platforms means that we throw
> > away the concept of a platform independent Unicode string?
> > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.
> 
> so?  the interchange format doesn't have to be
> the same as the internal format, does it?

The interchange format (marshal + pickle) is defined as UTF-8,
so there's no problem with endianness or missing bits w/r to
shipping Unicode data from one platform to another.
 
> > Does this mean that to transfer a file between a Windows box and Solaris, an
> > implicit conversion has to be done to go from 16 bits to 32 bits (and vice
> > versa)?  What about byte ordering issues?
> 
> no problem at all: unicode has special byte order
> marks for this purpose (and utf-8 doesn't care, of
> course).

Access to this mark will go into sys: sys.bom.
 
> > Or do you mean whatever 16 bit data type is available on the platform, with
> > a standard (platform independent) byte ordering maintained?
> 
> well, my preference is a 16-bit data type in the plat-
> form's native byte order (exactly how it's done in the
> unicode module -- for the moment, it can use the
> platform's wchar_t, but only if it happens to be a
> 16-bit unsigned type).  gives you good performance,
> compact storage, and cleanest possible code.

The 0.4 proposal fixes this to 16-bit unsigned short
using UTF-16 encoding with checks for surrogates. This covers
all defined standard Unicode character points, is fast, etc. pp...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Fri Nov 12 11:15:15 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 12:15:15 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <382BF6C3.D79840EC@lemburg.com>

"Da Silva, Mike" wrote:
> 
> Most of the ASCII string functions do indeed work for UTF-8.  I have made
> extensive use of this feature when writing translation logic to harmonize
> ASCII text (an SQL statement) with substitution parameters that must be
> converted from IBM EBCDIC code pages (5035, 1027) into UTF8.  Since UTF-8 is
> a superset of ASCII, this all works fine.
> 
> Some of the character classification functions etc can be flaky when used
> with UTF8 characters outside the ASCII range, but simple string operations
> work fine.

That's why there's the  buffer which holds the UTF-8
encoded value...
 
> As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an
> internal string representation are:
> 
> 1.      UTF-8 allows all characters to be displayed (in some form or other)
> on the users machine, with or without native fonts installed.  Naturally
> anything outside the ASCII range will be garbage, but it is an immense
> debugging aid when working with character encodings to be able to touch and
> feel something recognizable.  Trying to decode a block of raw UTF-16 is a
> pain.

True.

> 2.      UTF-8 works with most existing string manipulation libraries quite
> happily.  It is also portable (a char is always 8 bits, regardless of
> platform; wchar_t varies between 16 and 32 bits depending on the underlying
> operating system (although unsigned short does seems to work across
> platforms, in my experience).

You mean with the compiler applying the needed 16->32 bit extension ?

> 3.      UTF-16 has some advantages in providing fixed width characters and,
> (ignoring surrogate pairs etc) a modeless encoding space.  This is an
> advantage for fast string operations, especially on CPU's that have
> efficient operations for handling 16bit data.

Right and this is major argument for using 16 bit encodings without
state internally.

> 4.      UTF-16 would directly support a tightly coupled character properties
> engine, which would enable Unicode compliant case folding and character
> decomposition to be performed without an intermediate UTF-8 <----> UTF-16
> translation step.

Could you elaborate on this one ? It is one of the open issues
in the proposal.

> 5.      UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.

AFAIK, the RE engines in Python are 8-bit clean...

BTW, wouldn't it be possible to take pcre and have it
use Py_Unicode instead of char ? [Of course, there would have to
be some extensions for character classes etc.]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fredrik@pythonware.com  Fri Nov 12 13:43:12 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 14:43:12 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com>
Message-ID: <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com>

> > > Besides, the Unicode object will have a buffer containing the
> > >  representation of the object, which, if all goes
> > > well, will always hold the UTF-8 value.
> > 
> > 
> > 
> > over my dead body, that one...
> 
> Such a buffer is needed to implement "s" and "s#" argument
> parsing. It's a simple requirement to support those two
> parsing markers -- there's not much to argue about, really...

why?  I don't understand why "s" and "s#" has
to deal with encoding issues at all...

> unless, of course, you want to give up Unicode object support
> for all APIs using these parsers.

hmm.  maybe that's exactly what I want...





From fdrake@acm.org  Fri Nov 12 14:34:56 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 09:34:56 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C11FE.D7D9F916@lemburg.com>
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
 <382BDB09.55583F28@lemburg.com>
 <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
 <382C11FE.D7D9F916@lemburg.com>
Message-ID: <14380.9616.245419.138261@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Such a buffer is needed to implement "s" and "s#" argument
 > parsing. It's a simple requirement to support those two
 > parsing markers -- there's not much to argue about, really...
 > unless, of course, you want to give up Unicode object support
 > for all APIs using these parsers.

  Perhaps I missed the agreement that these should always receive
UTF-8 from Unicode strings.  Was this agreed upon, or has it simply
not been argued over in favor of other topics?
  If this has indeed been agreed upon... at least it can be computed
on demand rather than at initialization!  Perhaps there should be two
pointers: one to the UTF-8 buffer and one to a PyObject; if the
PyObject is there it's a "old-style" string that's actually providing
the buffer.  This may or may not be a good idea; there's a lot of
memory expense for long Unicode strings converted from UTF-8 that
aren't ever converted back to UTF-8 or accessed using "s" or "s#".
Ok, I've talked myself out of that.  ;-)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From fdrake@acm.org  Fri Nov 12 14:57:15 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 09:57:15 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C0FA8.ACB6CCD6@lemburg.com>
References: 
 <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
 <382C0FA8.ACB6CCD6@lemburg.com>
Message-ID: <14380.10955.420102.327867@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Access to this mark will go into sys: sys.bom.

  Can the name in sys be a little more descriptive?
sys.byte_order_mark would be reasonable.
  I think that a support module (possibly unicodec) should provide
constants for all four byte order marks as strings (2- & 4-byte,
little- and big-endian).  Names could be short BOM_2_LE, BOM_4_LE,
etc.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From fredrik@pythonware.com  Fri Nov 12 15:00:45 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 16:00:45 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim><382BDB09.55583F28@lemburg.com><027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com><382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us>
Message-ID: <009101bf2d1f$21f5b490$f29b12c2@secret.pythonware.com>

Fred L. Drake, Jr.  wrote:
> M.-A. Lemburg writes:
>  > Such a buffer is needed to implement "s" and "s#" argument
>  > parsing. It's a simple requirement to support those two
>  > parsing markers -- there's not much to argue about, really...
>  > unless, of course, you want to give up Unicode object support
>  > for all APIs using these parsers.
>
>   Perhaps I missed the agreement that these should always receive
> UTF-8 from Unicode strings.

from unicode import *

def getname():
    # hidden in some database engine, or so...
    return unicode("Linköping", "iso-8859-1")

...

name = getname()

# emulate automatic conversion to utf-8
name = str(name)

# print it in uppercase, in the usual way
import string
print string.upper(name)

## LINKĂśPING

I don't know, but I think that I think that it
perhaps should raise an exception instead...





From mal@lemburg.com  Fri Nov 12 15:17:43 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 16:17:43 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com>
Message-ID: <382C2F97.8E7D7A4D@lemburg.com>

Fredrik Lundh wrote:
> 
> > > > Besides, the Unicode object will have a buffer containing the
> > > >  representation of the object, which, if all goes
> > > > well, will always hold the UTF-8 value.
> > >
> > > 
> > >
> > > over my dead body, that one...
> >
> > Such a buffer is needed to implement "s" and "s#" argument
> > parsing. It's a simple requirement to support those two
> > parsing markers -- there's not much to argue about, really...
> 
> why?  I don't understand why "s" and "s#" has
> to deal with encoding issues at all...
> 
> > unless, of course, you want to give up Unicode object support
> > for all APIs using these parsers.
> 
> hmm.  maybe that's exactly what I want...

If we don't add that support, lot's of existing APIs won't
accept Unicode object instead of strings. While it could be
argued that automatic conversion to UTF-8 is not transparent
enough for the user, the other solution of using str(u)
everywhere would probably make writing Unicode-aware code a
rather clumsy task and introduce other pitfalls, since str(obj)
calls PyObject_Str() which also works on integers, floats,
etc.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Fri Nov 12 15:50:33 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 16:50:33 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
 <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
 <382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us>
Message-ID: <382C3749.198EEBC6@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > Access to this mark will go into sys: sys.bom.
> 
>   Can the name in sys be a little more descriptive?
> sys.byte_order_mark would be reasonable.

The abbreviation BOM is quite common w/r to Unicode.

>   I think that a support module (possibly unicodec) should provide
> constants for all four byte order marks as strings (2- & 4-byte,
> little- and big-endian).  Names could be short BOM_2_LE, BOM_4_LE,
> etc.

Good idea...

sys.bom should return the byte order mark (BOM) for the format used
internally. The unicodec module should provide symbols for all
possible values of this variable:

  BOM_BE: '\376\377' 
    (corresponds to Unicode 0x0000FEFF in UTF-16 
     == ZERO WIDTH NO-BREAK SPACE)

  BOM_LE: '\377\376' 
    (corresponds to Unicode 0x0000FFFE in UTF-16 
     == illegal Unicode character)

  BOM4_BE: '\000\000\377\376'
    (corresponds to Unicode 0x0000FEFF in UCS-4)

  BOM4_LE: '\376\377\000\000'
    (corresponds to Unicode 0x0000FFFE in UCS-4)

Note that Unicode sees big endian byte order as being "correct". The
swapped order is taken to be an indicator for a "wrong" format, hence
the illegal character definition.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mal@lemburg.com  Fri Nov 12 15:24:33 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 16:24:33 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
 <382BDB09.55583F28@lemburg.com>
 <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
 <382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us>
Message-ID: <382C3131.A8965CA5@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > Such a buffer is needed to implement "s" and "s#" argument
>  > parsing. It's a simple requirement to support those two
>  > parsing markers -- there's not much to argue about, really...
>  > unless, of course, you want to give up Unicode object support
>  > for all APIs using these parsers.
> 
>   Perhaps I missed the agreement that these should always receive
> UTF-8 from Unicode strings.  Was this agreed upon, or has it simply
> not been argued over in favor of other topics?

It's been in the proposal since version 0.1. The idea is to
provide a decent way of making existing script Unicode aware.

>   If this has indeed been agreed upon... at least it can be computed
> on demand rather than at initialization!

This is what I intended to implement. The  buffer
will be filled upon the first request to the UTF-8 encoding.
"s" and "s#" are examples of such requests. The buffer will
remain intact until the object is destroyed (since other code
could store the pointer received via e.g. "s").

> Perhaps there should be two
> pointers: one to the UTF-8 buffer and one to a PyObject; if the
> PyObject is there it's a "old-style" string that's actually providing
> the buffer.  This may or may not be a good idea; there's a lot of
> memory expense for long Unicode strings converted from UTF-8 that
> aren't ever converted back to UTF-8 or accessed using "s" or "s#".
> Ok, I've talked myself out of that.  ;-)

Note that Unicode object are completely different beast ;-)
String object are not touched in any way by the proposal.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake@acm.org  Fri Nov 12 16:22:24 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 11:22:24 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C3749.198EEBC6@lemburg.com>
References: 
 <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
 <382C0FA8.ACB6CCD6@lemburg.com>
 <14380.10955.420102.327867@weyr.cnri.reston.va.us>
 <382C3749.198EEBC6@lemburg.com>
Message-ID: <14380.16064.723277.586881@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > The abbreviation BOM is quite common w/r to Unicode.

  Yes: "w/r to Unicode".  In sys, it's out of context and should
receive a more descriptive name.  I think using BOM in unicodec is
good.

 >   BOM_BE: '\376\377' 
 >     (corresponds to Unicode 0x0000FEFF in UTF-16 
 >      == ZERO WIDTH NO-BREAK SPACE)

  I'd also add BOM to be the same as sys.byte_order_mark.  Perhaps
even instead of sys.byte_order_mark (just to localize the areas of
code that are affected).

 > Note that Unicode sees big endian byte order as being "correct". The

  A lot of us do.  ;-)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From fdrake@acm.org  Fri Nov 12 16:28:37 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 11:28:37 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C3131.A8965CA5@lemburg.com>
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
 <382BDB09.55583F28@lemburg.com>
 <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
 <382C11FE.D7D9F916@lemburg.com>
 <14380.9616.245419.138261@weyr.cnri.reston.va.us>
 <382C3131.A8965CA5@lemburg.com>
Message-ID: <14380.16437.71847.832880@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > It's been in the proposal since version 0.1. The idea is to
 > provide a decent way of making existing script Unicode aware.

  Ok, so I haven't read closely enough.

 > This is what I intended to implement. The  buffer
 > will be filled upon the first request to the UTF-8 encoding.
 > "s" and "s#" are examples of such requests. The buffer will
 > remain intact until the object is destroyed (since other code
 > could store the pointer received via e.g. "s").

  Right.

 > Note that Unicode object are completely different beast ;-)
 > String object are not touched in any way by the proposal.

  I wasn't suggesting the PyStringObject be changed, only that the
PyUnicodeObject could maintain a reference.  Consider:

        s = fp.read()
        u = unicode(s, 'utf-8')

u would now hold a reference to s, and s/s# would return a pointer
into s instead of re-building the UTF-8 form.  I talked myself out of
this because it would be too easy to keep a lot more string objects
around than were actually needed.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From jack@oratrix.nl  Fri Nov 12 16:33:46 1999
From: jack@oratrix.nl (Jack Jansen)
Date: Fri, 12 Nov 1999 17:33:46 +0100
Subject: [Python-Dev] just say no...
In-Reply-To: Message by "M.-A. Lemburg"  ,
 Fri, 12 Nov 1999 16:24:33 +0100 , <382C3131.A8965CA5@lemburg.com>
Message-ID: <19991112163347.5527635BB1E@snelboot.oratrix.nl>

The problem with "s" and "s#"  is that they're already semantically 
overloaded, and will become more so with support for multiple charsets.

Some modules use "s#" when they mean "give me a pointer to an area of memory 
and its length". Writing to binary files is an example of this.

Some modules use it to mean "give me a pointer to a string". Writing to a text 
file is (probably) an example of this.

Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This 
is the case if we're going to actually look at the contents (think of 
string.upper() and such).

I think that the only real solution is to define what "s" means, come up with 
new getarg-formats for the other two use cases and convert all modules to use 
the new standard. It'll still cause grief to extension modules that aren't 
part of the core, but at least the problem will go away after a while.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 




From mal@lemburg.com  Fri Nov 12 18:36:55 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 19:36:55 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
 <382BDB09.55583F28@lemburg.com>
 <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
 <382C11FE.D7D9F916@lemburg.com>
 <14380.9616.245419.138261@weyr.cnri.reston.va.us>
 <382C3131.A8965CA5@lemburg.com> <14380.16437.71847.832880@weyr.cnri.reston.va.us>
Message-ID: <382C5E47.21FB4DD@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > It's been in the proposal since version 0.1. The idea is to
>  > provide a decent way of making existing script Unicode aware.
> 
>   Ok, so I haven't read closely enough.
> 
>  > This is what I intended to implement. The  buffer
>  > will be filled upon the first request to the UTF-8 encoding.
>  > "s" and "s#" are examples of such requests. The buffer will
>  > remain intact until the object is destroyed (since other code
>  > could store the pointer received via e.g. "s").
> 
>   Right.
> 
>  > Note that Unicode object are completely different beast ;-)
>  > String object are not touched in any way by the proposal.
> 
>   I wasn't suggesting the PyStringObject be changed, only that the
> PyUnicodeObject could maintain a reference.  Consider:
> 
>         s = fp.read()
>         u = unicode(s, 'utf-8')
> 
> u would now hold a reference to s, and s/s# would return a pointer
> into s instead of re-building the UTF-8 form.  I talked myself out of
> this because it would be too easy to keep a lot more string objects
> around than were actually needed.

Agreed. Also, the encoding would always be correct. 
will always hold the  version (which should
be UTF-8...).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From gstein@lyra.org  Fri Nov 12 22:19:15 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 14:19:15 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <007e01bf2d09$44738440$0501a8c0@bobcat>
Message-ID: 

On Fri, 12 Nov 1999, Mark Hammond wrote:
> Couldnt we start with Fredriks implementation, and see how the rest
> turns out?  Even if we do choose to change the underlying Unicode
> implementation to use a different native encoding, the interface to
> the PyUnicode_Type would remain pretty similar.  The advantage is that
> we have something now to start working with for the rest of the
> support we need.

I agree with "start with" here, and will go one step further (which Mark
may have implied) -- *check in* Fredrik's code.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Fri Nov 12 22:59:03 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 14:59:03 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C11FE.D7D9F916@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> Fredrik Lundh wrote:
> > > Besides, the Unicode object will have a buffer containing the
> > >  representation of the object, which, if all goes
> > > well, will always hold the UTF-8 value.
> > 
> > 
> > 
> > over my dead body, that one...
> 
> Such a buffer is needed to implement "s" and "s#" argument
> parsing. It's a simple requirement to support those two
> parsing markers -- there's not much to argue about, really...
> unless, of course, you want to give up Unicode object support
> for all APIs using these parsers.

Bull!

You can easily support "s#" support by returning the pointer to the
Unicode buffer. The *entire* reason for introducing "t#" is to
differentiate between returning a pointer to an 8-bit [character] buffer
and a not-8-bit buffer.

In other words, the work done to introduce "t#" was done *SPECIFICALLY* to
allow "s#" to return a pointer to the Unicode data.

I am with Fredrik on that auxilliary buffer. You'll have two dead bodies
to deal with :-)

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Fri Nov 12 23:05:11 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 15:05:11 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <19991112163347.5527635BB1E@snelboot.oratrix.nl>
Message-ID: 

This was done last year!! We have "s#" meaning "give me some bytes." We
have "t#" meaning "give me some 8-bit characters." The Python distribution
has been completely updated to use the appropriate format in each call.

The was done *specifically* to support the introduction of a Unicode type.
The intent was that "s#" returns the *raw* bytes of the Unicode string --
NOT a UTF-8 encoding!

As a separate argument, MAL can argue that "t#" should create an internal,
associated buffer to hold a UTF-8 encoding and then return that. But the
"s#" should return the raw bytes!
[ and I'll argue against the response to "t#" anyhow... ]

-g

On Fri, 12 Nov 1999, Jack Jansen wrote:
> The problem with "s" and "s#"  is that they're already semantically 
> overloaded, and will become more so with support for multiple charsets.
> 
> Some modules use "s#" when they mean "give me a pointer to an area of memory 
> and its length". Writing to binary files is an example of this.
> 
> Some modules use it to mean "give me a pointer to a string". Writing to a text 
> file is (probably) an example of this.
> 
> Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This 
> is the case if we're going to actually look at the contents (think of 
> string.upper() and such).
> 
> I think that the only real solution is to define what "s" means, come up with 
> new getarg-formats for the other two use cases and convert all modules to use 
> the new standard. It'll still cause grief to extension modules that aren't 
> part of the core, but at least the problem will go away after a while.
> --
> Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
> Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
> www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 
> 
> 
> 
> _______________________________________________
> Python-Dev maillist  -  Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

--
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Fri Nov 12 23:09:13 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 15:09:13 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C2F97.8E7D7A4D@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> Fredrik Lundh wrote:
>...
> > why?  I don't understand why "s" and "s#" has
> > to deal with encoding issues at all...
> > 
> > > unless, of course, you want to give up Unicode object support
> > > for all APIs using these parsers.
> > 
> > hmm.  maybe that's exactly what I want...
> 
> If we don't add that support, lot's of existing APIs won't
> accept Unicode object instead of strings. While it could be
> argued that automatic conversion to UTF-8 is not transparent
> enough for the user, the other solution of using str(u)
> everywhere would probably make writing Unicode-aware code a
> rather clumsy task and introduce other pitfalls, since str(obj)
> calls PyObject_Str() which also works on integers, floats,
> etc.

No no no...

"s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are
supposed to return the raw bytes.

If a caller wants 8-bit characters, then that caller will use "t#".

If you want to argue for that separate, encoded buffer, then argue for it
for support for the "t#" format. But do NOT say that it is needed for "s#"
which simply means "give me some bytes."

-g

--
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Fri Nov 12 23:26:08 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 15:26:08 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <14380.16064.723277.586881@weyr.cnri.reston.va.us>
Message-ID: 

On Fri, 12 Nov 1999, Fred L. Drake, Jr. wrote:
> M.-A. Lemburg writes:
>  > The abbreviation BOM is quite common w/r to Unicode.

True.

>   Yes: "w/r to Unicode".  In sys, it's out of context and should
> receive a more descriptive name.  I think using BOM in unicodec is
> good.

I agree and believe that we can avoid putting it into sys altogether.

>  >   BOM_BE: '\376\377' 
>  >     (corresponds to Unicode 0x0000FEFF in UTF-16 
>  >      == ZERO WIDTH NO-BREAK SPACE)

Are you sure about that interpretation? I thought the BOM characters
(0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.

>   I'd also add BOM to be the same as sys.byte_order_mark.  Perhaps
> even instead of sys.byte_order_mark (just to localize the areas of
> code that are affected).

### unicodec.py ###
import struct

BOM = struct.pack('h', 0x0000FEFF)
BOM_BE = '\376\377'
...


If somebody needs the BOM, then they should go to unicodec.py (or some
other module). I do not believe we need to put that stuff into the sys
module. It is just too easy to create the value in Python.

Cheers,
-g

p.s. to be pedantic, the pack() format could be '@h'

--
Greg Stein, http://www.lyra.org/



From mhammond@skippinet.com.au  Fri Nov 12 23:41:16 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 13 Nov 1999 10:41:16 +1100
Subject: [Python-Dev] just say no...
In-Reply-To: 
Message-ID: <008601bf2d67$6a9982b0$0501a8c0@bobcat>

[Greg writes]

> As a separate argument, MAL can argue that "t#" should create
> an internal,
> associated buffer to hold a UTF-8 encoding and then return
> that. But the
> "s#" should return the raw bytes!
> [ and I'll argue against the response to "t#" anyhow... ]

Hmm.  Climbing over these dead bodies could get a bit smelly :-)

Im inclined to agree that holding 2 internal buffers for the unicode
object is not ideal.  However, I _am_ concerned with getting decent
PyArg_ParseTuple and Py_BuildValue support, and if the cost is an
extra buffer I will survive.  So lets look for solutions that dont
require it, rather than holding it up as evil when no other solution
is obvious.

My requirements appear to me to be very simple (for an anglophile):

Lets say I have a platform Unicode value - eg, I got a Unicode value
from some external library (say COM :-)  Lets assume for now that the
Unicode string is fully representable as ASCII  - say a file or
directory name that COM gave me.  I simply want to be able to pass
this Unicode object to "open()", and have it work.  This assumes that
open() will not become "native unicode", simply as the underlying C
support is not unicode aware - it needs to be converted to a "char *"
(ie, will use the "t#" format)

The second side of the equation is when I expose a Python function
that talks Unicode - eg, I need to _pass_ a platform Unicode value to
an external library.  The Python programmer should be able to pass a
Unicode object (no problem), or a PyString object.

In code terms:
Prob1:
  name = SomeComObject.GetFileName() # A Unicode object
  f = open(name)
Prob2:
  SomeComObject.SetFileName("foo.txt")

IMO it is important that we have a good strategy for dealing with this
for extensions.  MAL addresses one direction, but not the other.

Maybe if we toss around general solutions for this the implementation
will fall out.  MALs idea of the additional buffer starts to address
this, but isnt the whole story.

Any ideas on this?



From gstein@lyra.org  Sat Nov 13 00:49:34 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 16:49:34 -0800 (PST)
Subject: [Python-Dev] argument parsing (was: just say no...)
In-Reply-To: <008601bf2d67$6a9982b0$0501a8c0@bobcat>
Message-ID: 

On Sat, 13 Nov 1999, Mark Hammond wrote:
>...
> Im inclined to agree that holding 2 internal buffers for the unicode
> object is not ideal.  However, I _am_ concerned with getting decent
> PyArg_ParseTuple and Py_BuildValue support, and if the cost is an
> extra buffer I will survive.  So lets look for solutions that dont
> require it, rather than holding it up as evil when no other solution
> is obvious.

I believe Py_BuildValue is pretty straight-forward. Simply state that it
is allowed to perform conversions and place the resulting object into the
resulting tuple.
(with appropriate refcounting)

In other words:

  tuple = Py_BuildValue("U", stringOb);

The stringOb will be converted to a Unicode object. The new Unicode object
will go into the tuple (with the tuple holding the only reference!). The
stringOb will NOT acquire any additional references.

[ "U" format may be wrong; it is here for example purposes ]


Okay... now the PyArg_ParseTuple() is the *real* kicker.

>...
> Prob1:
>   name = SomeComObject.GetFileName() # A Unicode object
>   f = open(name)
> Prob2:
>   SomeComObject.SetFileName("foo.txt")

Both of these issues are due to PyArg_ParseTuple. In Prob1, you want a
string-like object which can be passed to the OS as an 8-bit string. In
Prob2, you want a string-like object which can be passed to the OS as a
Unicode string.

I see three options for PyArg_ParseTuple:

1) allow it to return NEW objects which must be DECREF'd.
   [ current policy only loans out references ]

   This option could be difficult in the presence of errors during the
   parse. For example, the current idiom is:

     if (!PyArg_ParseTuple(args, "..."))
        return NULL;

   If an object was produced, but then a later argument cause a failure,
   then who is responsible for freeing the object?

2) like step 1, but PyArg_ParseTuple is smart enough to NOT return any new
   objects when an error occurred.

   This basically answers the last question in option (1) -- ParseTuple is
   responsible.

3) Return loaned-out-references to objects which have been tested for
   convertability. Helper functions perform the conversion and the caller
   will then free the reference.
   [ this is the model used in PyWin32 ]

   Code in PyWin32 typically looks like:

     if (!PyArg_ParseTuple(args, "O", &ob))
       return NULL;
     if ((unicodeOb = GiveMeUnicode(ob)) == NULL)
       return NULL;
     ...
     Py_DECREF(unicodeOb);

   [ GiveMeUnicode is descriptive here; I forget the name used in PyWin32 ]

   In a "real" situation, the ParseTuple format would be "U" and the
   object would be type-tested for PyStringType or PyUnicodeType.

   Note that GiveMeUnicode() would also do a type-test, but it can't
   produce a *specific* error like ParseTuple (e.g. "string/unicode object
   expected" vs "parameter 3 must be a string/unicode object")

Are there more options? Anybody?


All three of these avoid the secondary buffer. The last is cleanest w.r.t.
to keeping the existing "loaned references" behavior, but can get a bit
wordy when you need to convert a bunch of string arguments.

Option (2) adds a good amount of complexity to PyArg_ParseTuple -- it
would need to keep a "free list" in case an error occurred.

Option (1) adds DECREF logic to callers to ensure they clean up. The add'l
logic isn't much more than the other two options (the only change is
adding DECREFs before returning NULL from the "if (!PyArg_ParseTuple..."
condition). Note that the caller would probably need to initialize each
object to NULL before calling ParseTuple.


Personally, I prefer (3) as it makes it very clear that a new object has
been created and must be DECREF'd at some point. Also note that
GiveMeUnicode() could also accept a second argument for the type of
decoding to do (or NULL meaning "UTF-8").

Oh: note there are equivalents of all options for going from
unicode-to-string; the above is all about string-to-unicode. However, the
tricky part of unicode-to-string is determining whether backwards
compatibility will be a requirement. i.e. does existing code that uses the
"t" format suddenly achieve the capability to accept a Unicode object?
This obviously causes problems in all three options: since a new reference
must be created to handle the situation, then who DECREF's it? The old
code certainly doesn't.
[  I'm with Fredrik in saying "no, old code *doesn't* suddenly get
  the ability to accept a Unicode object." The Python code must use str() to
  do the encoding manually (until the old code is upgraded to one of the
  above three options).  ]

I think that's it for me. In the several years I've been thinking on this
problem, I haven't come up with anything but the above three. There may be
a whole new paradigm for argument parsing, but I haven't tried to think on
that one (and just fit in around ParseTuple).

Cheers,
-g

--
Greg Stein, http://www.lyra.org/





From mal@lemburg.com  Fri Nov 12 18:49:52 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 19:49:52 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
 <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
 <382C0FA8.ACB6CCD6@lemburg.com>
 <14380.10955.420102.327867@weyr.cnri.reston.va.us>
 <382C3749.198EEBC6@lemburg.com> <14380.16064.723277.586881@weyr.cnri.reston.va.us>
Message-ID: <382C6150.53BDC803@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > The abbreviation BOM is quite common w/r to Unicode.
> 
>   Yes: "w/r to Unicode".  In sys, it's out of context and should
> receive a more descriptive name.  I think using BOM in unicodec is
> good.

Guido proposed to add it to sys. I originally had it defined in
unicodec.

Perhaps a sys.endian would be more appropriate for sys
with values 'little' and 'big' or '<' and '>' to be conform
to the struct module.

unicodec could then define unicodec.bom depending on the setting
in sys.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Sat Nov 13 09:37:35 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 13 Nov 1999 10:37:35 +0100
Subject: [Python-Dev] just say no...
References: 
Message-ID: <382D315F.A7ADEC42@lemburg.com>

Greg Stein wrote:
> 
> On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> > Fredrik Lundh wrote:
> >...
> > > why?  I don't understand why "s" and "s#" has
> > > to deal with encoding issues at all...
> > >
> > > > unless, of course, you want to give up Unicode object support
> > > > for all APIs using these parsers.
> > >
> > > hmm.  maybe that's exactly what I want...
> >
> > If we don't add that support, lot's of existing APIs won't
> > accept Unicode object instead of strings. While it could be
> > argued that automatic conversion to UTF-8 is not transparent
> > enough for the user, the other solution of using str(u)
> > everywhere would probably make writing Unicode-aware code a
> > rather clumsy task and introduce other pitfalls, since str(obj)
> > calls PyObject_Str() which also works on integers, floats,
> > etc.
> 
> No no no...
> 
> "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are
> supposed to return the raw bytes.

[I've waited quite some time for you to chime in on this one ;-)]

Let me summarize a bit on the general ideas behind "s", "s#"
and the extra buffer:

First, we have a general design question here: should old code
become Unicode compatible or not. As I recall the original idea
about Unicode integration was to follow Perl's idea to have
scripts become Unicode aware by simply adding a 'use utf8;'.

If this is still the case, then we'll have to come with a
resonable approach for integrating classical string based
APIs with the new type.

Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
the Latin-1 folks) which has some very nice features (see
http://czyborra.com/utf/ ) and which is a true extension of ASCII,
this encoding seems best fit for the purpose.

However, one should not forget that UTF-8 is in fact a
variable length encoding of Unicode characters, that is up to
3 bytes form a *single* character. This is obviously not compatible
with definitions that explicitly state data to be using a
8-bit single character encoding, e.g. indexing in UTF-8 doesn't
work like it does in Latin-1 text.

So if we are to do the integration, we'll have to choose
argument parser markers that allow for multi byte characters.
"t#" does not fall into this category, "s#" certainly does,
"s" is argueable.

Also note that we have to watch out for embedded NULL bytes.
UTF-16 has NULL bytes for every character from the Latin-1
domain. If "s" were to give back a pointer to the internal
buffer which is encoded in UTF-16, you would loose data.
UTF-8 doesn't have this problem, since only NULL bytes
map to (single) NULL bytes.

Now Greg would chime in with the buffer interface and
argue that it should make the underlying internal
format accessible. This is a bad idea, IMHO, since you
shouldn't really have to know what the internal data format
is.

Defining "s#" to return UTF-8 data does not only
make "s" and "s#" return the same data format (which should
always be the case, IMO), but also hides the internal
format from the user and gives him a reliable cross-platform
data representation of Unicode data (note that UTF-8 doesn't
have the byte order problems of UTF-16).

If you are still with, let's look at what "s" and "s#"
do: they return pointers into data areas which have to
be kept alive until the corresponding object dies.

The only way to support this feature is by allocating
a buffer for just this purpose (on the fly and only if
needed to prevent excessive memory load). The other
options of adding new magic parser markers or switching
to more generic one all have one downside: you need to
change existing code which is in conflict with the idea
we started out with.

So, again, the question is: do we want this magical
intergration or not ? Note that this is a design question,
not one of memory consumption...

--

Ok, the above covered Unicode -> String conversion. Mark
mentioned that he wanted the other way around to also
work in the same fashion, ie. automatic String -> Unicode
conversion. 

This could also be done in the same way by
interpreting the string as UTF-8 encoded Unicode... but we
have the same problem: where to put the data without
generating new intermediate objects. Since only newly
written code will use this feature there is a way to do
this though:

PyArg_ParseTuple(args,"s#",&utf8,&len);

If your C API understands UTF-8 there's nothing more to do,
if not, take Greg's option 3 approach:

PyArg_ParseTuple(args,"O",&obj);
unicode = PyUnicode_FromObject(obj);
...
Py_DECREF(unicode);

Here PyUnicode_FromObject() will return a new
reference if obj is an Unicode object or create a new
Unicode object by interpreting str(obj) as UTF-8 encoded string.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    48 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From guido@CNRI.Reston.VA.US  Sat Nov 13 12:12:41 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Sat, 13 Nov 1999 07:12:41 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Fri, 12 Nov 1999 14:59:03 PST."
 
References: 
Message-ID: <199911131212.HAA25895@eric.cnri.reston.va.us>

> I am with Fredrik on that auxilliary buffer. You'll have two dead bodies
> to deal with :-)

I haven't made up my mind yet (due to a very successful
Python-promoting visit to SD'99 east, I'm about 100 msgs behind in
this thread alone) but let me warn you that I can deal with the
carnage, if necessary. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gstein@lyra.org  Sat Nov 13 12:23:54 1999
From: gstein@lyra.org (Greg Stein)
Date: Sat, 13 Nov 1999 04:23:54 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <199911131212.HAA25895@eric.cnri.reston.va.us>
Message-ID: 

On Sat, 13 Nov 1999, Guido van Rossum wrote:
> > I am with Fredrik on that auxilliary buffer. You'll have two dead bodies
> > to deal with :-)
> 
> I haven't made up my mind yet (due to a very successful
> Python-promoting visit to SD'99 east, I'm about 100 msgs behind in
> this thread alone) but let me warn you that I can deal with the
> carnage, if necessary. :-)

Bring it on, big boy!

:-)

--
Greg Stein, http://www.lyra.org/



From mhammond@skippinet.com.au  Sat Nov 13 12:52:18 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 13 Nov 1999 23:52:18 +1100
Subject: [Python-Dev] argument parsing (was: just say no...)
In-Reply-To: 
Message-ID: <00b301bf2dd5$ec4df840$0501a8c0@bobcat>

[Lamenting about PyArg_ParseTuple and managing memory buffers for
String/Unicode conversions.]

So what is really wrong with Marc's proposal about the extra pointer
on the Unicode object?  And to double the carnage, who not add the
equivilent native Unicode buffer to the PyString object?

These would only ever be filled when requested by the conversion
routines.  They have no other effect than their memory is managed by
the object itself; simply a convenience to avoid having extension
modules manage the conversion buffers.

The only overheads appear to be:
* The conversion buffers may be slightly (or much :-) longer-lived -
ie, they are not freed until the object itself is freed.
* String object slightly bigger, and slightly slower to destroy.

It appears to solve the problems, and the cost doesnt seem too high...

Mark.



From guido@CNRI.Reston.VA.US  Sat Nov 13 13:06:26 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Sat, 13 Nov 1999 08:06:26 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Sat, 13 Nov 1999 10:37:35 +0100."
 <382D315F.A7ADEC42@lemburg.com>
References: 
 <382D315F.A7ADEC42@lemburg.com>
Message-ID: <199911131306.IAA26030@eric.cnri.reston.va.us>

I think I have a reasonable grasp of the issues here, even though I
still haven't read about 100 msgs in this thread.  Note that t# and
the charbuffer addition to the buffer API were added by Greg Stein
with my support; I'll attempt to reconstruct our thinking at the
time...

[MAL]
> Let me summarize a bit on the general ideas behind "s", "s#"
> and the extra buffer:

I think you left out t#.

> First, we have a general design question here: should old code
> become Unicode compatible or not. As I recall the original idea
> about Unicode integration was to follow Perl's idea to have
> scripts become Unicode aware by simply adding a 'use utf8;'.

I've never heard of this idea before -- or am I taking it too literal?
It smells of a mode to me :-)  I'd rather live in a world where
Unicode just works as long as you use u'...' literals or whatever
convention we decide.

> If this is still the case, then we'll have to come with a
> resonable approach for integrating classical string based
> APIs with the new type.
> 
> Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> the Latin-1 folks) which has some very nice features (see
> http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> this encoding seems best fit for the purpose.

Yes, especially if we fix the default encoding as UTF-8.  (I'm
expecting feedback from HP on this next week, hopefully when I see the
details, it'll be clear that don't need a per-thread default encoding
to solve their problems; that's quite a likely outcome.  If not, we
have a real-world argument for allowing a variable default encoding,
without carnage.)

> However, one should not forget that UTF-8 is in fact a
> variable length encoding of Unicode characters, that is up to
> 3 bytes form a *single* character. This is obviously not compatible
> with definitions that explicitly state data to be using a
> 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> work like it does in Latin-1 text.

Sure, but where in current Python are there such requirements?

> So if we are to do the integration, we'll have to choose
> argument parser markers that allow for multi byte characters.
> "t#" does not fall into this category, "s#" certainly does,
> "s" is argueable.

I disagree.  I grepped through the source for s# and t#.  Here's a bit
of background.  Before t# was introduced, s# was being used for two
distinct purposes: (1) to get an 8-bit text string plus its length, in
situations where the length was needed; (2) to get binary data (e.g.
GIF data read from a file in "rb" mode).  Greg pointed out that if we
ever introduced some form of Unicode support, these two had to be
disambiguated.  We found that the majority of uses was for (2)!
Therefore we decided to change the definition of s# to mean only (2),
and introduced t# to mean (1).  Also, we introduced getcharbuffer
corresponding to t#, while getreadbuffer was meant for s#.

Note that the definition of the 's' format was left alone -- as
before, it means you need an 8-bit text string not containing null
bytes.

Our expectation was that a Unicode string passed to an s# situation
would give a pointer to the internal format plus a byte count (not a
character count!) while t# would get a pointer to some kind of 8-bit
translation/encoding plus a byte count, with the explicit requirement
that the 8-bit translation would have the same lifetime as the
original unicode object.  We decided to leave it up to the next
generation (i.e., Marc-Andre :-) to decide what kind of translation to
use and what to do when there is no reasonable translation.

Any of the following choices is acceptable (from the point of view of
not breaking the intended t# semantics; we can now start deciding
which we like best):

- utf-8
- latin-1
- ascii
- shift-jis
- lower byte of unicode ordinal
- some user- or os-specified multibyte encoding

As far as t# is concerned, for encodings that don't encode all of
Unicode, untranslatable characters could be dealt with in any number
of ways (raise an exception, ignore, replace with '?', make best
effort, etc.).

Given the current context, it should probably be the same as the
default encoding -- i.e., utf-8.  If we end up making the default
user-settable, we'll have to decide what to do with untranslatable
characters -- but that will probably be decided by the user too (it
would be a property of a specific translation specification).

In any case, I feel that t# could receive a multi-byte encoding, 
s# should receive raw binary data, and they should correspond to
getcharbuffer and getreadbuffer, respectively.

(Aside: the symmetry between 's' and 's#' is now lost; 's' matches
't#', there's no match for 's#'.)

> Also note that we have to watch out for embedded NULL bytes.
> UTF-16 has NULL bytes for every character from the Latin-1
> domain. If "s" were to give back a pointer to the internal
> buffer which is encoded in UTF-16, you would loose data.
> UTF-8 doesn't have this problem, since only NULL bytes
> map to (single) NULL bytes.

This is a red herring given my explanation above.

> Now Greg would chime in with the buffer interface and
> argue that it should make the underlying internal
> format accessible. This is a bad idea, IMHO, since you
> shouldn't really have to know what the internal data format
> is.

This is for C code.  Quite likely it *does* know what the internal
data format is!

> Defining "s#" to return UTF-8 data does not only
> make "s" and "s#" return the same data format (which should
> always be the case, IMO),

That was before t# was introduced.  No more, alas.  If you replace s#
with t#, I agree with you completely.

> but also hides the internal
> format from the user and gives him a reliable cross-platform
> data representation of Unicode data (note that UTF-8 doesn't
> have the byte order problems of UTF-16).
> 
> If you are still with, let's look at what "s" and "s#"

(and t#, which is more relevant here)

> do: they return pointers into data areas which have to
> be kept alive until the corresponding object dies.
> 
> The only way to support this feature is by allocating
> a buffer for just this purpose (on the fly and only if
> needed to prevent excessive memory load). The other
> options of adding new magic parser markers or switching
> to more generic one all have one downside: you need to
> change existing code which is in conflict with the idea
> we started out with.

Agreed.  I think this was our thinking when Greg & I introduced t#.
My own preference would be to allocate a whole string object, not
just a buffer; this could then also be used for the .encode() method
using the default encoding.

> So, again, the question is: do we want this magical
> intergration or not ? Note that this is a design question,
> not one of memory consumption...

Yes, I want it.

Note that this doesn't guarantee that all old extensions will work
flawlessly when passed Unicode objects; but I think that it covers
most cases where you could have a reasonable expectation that it
works.

(Hm, unfortunately many reasonable expectations seem to involve
the current user's preferred encoding. :-( )

> --
> 
> Ok, the above covered Unicode -> String conversion. Mark
> mentioned that he wanted the other way around to also
> work in the same fashion, ie. automatic String -> Unicode
> conversion. 
> 
> This could also be done in the same way by
> interpreting the string as UTF-8 encoded Unicode... but we
> have the same problem: where to put the data without
> generating new intermediate objects. Since only newly
> written code will use this feature there is a way to do
> this though:
> 
> PyArg_ParseTuple(args,"s#",&utf8,&len);

No!  That is supposed to give the native representation of the string
object.

I agree that Mark's problem requires a solution too, but it doesn't
have to use existing formatting characters, since there's no backwards
compatibility issue.

> If your C API understands UTF-8 there's nothing more to do,
> if not, take Greg's option 3 approach:
> 
> PyArg_ParseTuple(args,"O",&obj);
> unicode = PyUnicode_FromObject(obj);
> ...
> Py_DECREF(unicode);
> 
> Here PyUnicode_FromObject() will return a new
> reference if obj is an Unicode object or create a new
> Unicode object by interpreting str(obj) as UTF-8 encoded string.

This might work.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Sat Nov 13 13:06:35 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 13 Nov 1999 14:06:35 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.5
References: <382C0A54.E6E8328D@lemburg.com>
Message-ID: <382D625B.DC14DBDE@lemburg.com>

FYI, I've uploaded a new version of the proposal which incorporates
proposals for line breaks, case mapping, character properties and
private code points support.

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ˇ should Unicode objects support %-formatting ?

    One possibility would be to emulate this via strings and 
    :

    s = '%s %i abcäöü' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

    ˇ specifying file wrappers:

    Open issues: what to do with Python strings
    fed to the .write() method (may need to know the encoding of the
    strings) and when/if to return Python strings through the .read()
    method.

    Perhaps we need more than one type of wrapper here.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    48 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From jack@oratrix.nl  Sat Nov 13 16:40:34 1999
From: jack@oratrix.nl (Jack Jansen)
Date: Sat, 13 Nov 1999 17:40:34 +0100
Subject: [Python-Dev] just say no...
In-Reply-To: Message by Greg Stein  ,
 Fri, 12 Nov 1999 15:05:11 -0800 (PST) , 
Message-ID: <19991113164039.9B697EA11A@oratrix.oratrix.nl>

Recently, Greg Stein  said:
> This was done last year!! We have "s#" meaning "give me some bytes." We
> have "t#" meaning "give me some 8-bit characters." The Python distribution
> has been completely updated to use the appropriate format in each call.

Oops...

I remember the discussion but I wasn't aware that somone had actually
_implemented_ this:-). Part of my misunderstanding was also caused by
the fact that I inspected what I thought would be the prime candidate
for t#: file.write() to a non-binary file, and it doesn't use the new
format.

I also noted a few inconsistencies at first glance, by the way: most
modules seem to use s# for things like filenames and other
data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an 
exception and it uses t# for uuencoded strings...
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 


From guido@CNRI.Reston.VA.US  Sat Nov 13 19:20:51 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Sat, 13 Nov 1999 14:20:51 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Sat, 13 Nov 1999 17:40:34 +0100."
 <19991113164039.9B697EA11A@oratrix.oratrix.nl>
References: <19991113164039.9B697EA11A@oratrix.oratrix.nl>
Message-ID: <199911131920.OAA26165@eric.cnri.reston.va.us>

> I remember the discussion but I wasn't aware that somone had actually
> _implemented_ this:-). Part of my misunderstanding was also caused by
> the fact that I inspected what I thought would be the prime candidate
> for t#: file.write() to a non-binary file, and it doesn't use the new
> format.

I guess that's because file.write() doesn't distinguish between text
and binary files.  Maybe it should: the current implementation
together with my proposed semantics for Unicode strings would mean that
printing a unicode string (to stdout) would dump the internal encoding
to the file.  I guess it should do so only when the file is opened in
binary mode; for files opened in text mode it should use an encoding
(opening a file can specify an encoding; can we change the encoding of
an existing file?).

> I also noted a few inconsistencies at first glance, by the way: most
> modules seem to use s# for things like filenames and other
> data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an 
> exception and it uses t# for uuencoded strings...

Actually, binascii seems to do it right: s# for binary data, t# for
text (uuencoded, hqx, base64).  That is, the b2a variants use s# while
the a2b variants use t#.  The only thing I'm not sure about in that
module are binascii_rledecode_hqx() and binascii_rlecode_hqx() -- I
don't understand where these stand in the complexity of binhex
en/decoding.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal@lemburg.com  Sun Nov 14 22:11:54 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 14 Nov 1999 23:11:54 +0100
Subject: [Python-Dev] just say no...
References: 
 <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us>
Message-ID: <382F33AA.C3EE825A@lemburg.com>

Guido van Rossum wrote:
> 
> I think I have a reasonable grasp of the issues here, even though I
> still haven't read about 100 msgs in this thread.  Note that t# and
> the charbuffer addition to the buffer API were added by Greg Stein
> with my support; I'll attempt to reconstruct our thinking at the
> time...
>
> [MAL]
> > Let me summarize a bit on the general ideas behind "s", "s#"
> > and the extra buffer:
> 
> I think you left out t#.

On purpose -- according to my thinking. I see "t#" as an interface
to bf_getcharbuf which I understand as 8-bit character buffer...
UTF-8 is a multi byte encoding. It still is character data, but
not necessarily 8 bits in length (up to 24 bits are used).

Anyway, I'm not really interested in having an argument about
this. If you say, "t#" fits the purpose, then that's fine with
me. Still, we should clearly define that "t#" returns
text data and "s#" binary data. Encoding, bit length, etc. should
explicitly remain left undefined.

> > First, we have a general design question here: should old code
> > become Unicode compatible or not. As I recall the original idea
> > about Unicode integration was to follow Perl's idea to have
> > scripts become Unicode aware by simply adding a 'use utf8;'.
> 
> I've never heard of this idea before -- or am I taking it too literal?
> It smells of a mode to me :-)  I'd rather live in a world where
> Unicode just works as long as you use u'...' literals or whatever
> convention we decide.
> 
> > If this is still the case, then we'll have to come with a
> > resonable approach for integrating classical string based
> > APIs with the new type.
> >
> > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> > the Latin-1 folks) which has some very nice features (see
> > http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> > this encoding seems best fit for the purpose.
> 
> Yes, especially if we fix the default encoding as UTF-8.  (I'm
> expecting feedback from HP on this next week, hopefully when I see the
> details, it'll be clear that don't need a per-thread default encoding
> to solve their problems; that's quite a likely outcome.  If not, we
> have a real-world argument for allowing a variable default encoding,
> without carnage.)

Fair enough :-)
 
> > However, one should not forget that UTF-8 is in fact a
> > variable length encoding of Unicode characters, that is up to
> > 3 bytes form a *single* character. This is obviously not compatible
> > with definitions that explicitly state data to be using a
> > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> > work like it does in Latin-1 text.
> 
> Sure, but where in current Python are there such requirements?

It was my understanding that "t#" refers to single byte character
data. That's where the above arguments were aiming at...
 
> > So if we are to do the integration, we'll have to choose
> > argument parser markers that allow for multi byte characters.
> > "t#" does not fall into this category, "s#" certainly does,
> > "s" is argueable.
> 
> I disagree.  I grepped through the source for s# and t#.  Here's a bit
> of background.  Before t# was introduced, s# was being used for two
> distinct purposes: (1) to get an 8-bit text string plus its length, in
> situations where the length was needed; (2) to get binary data (e.g.
> GIF data read from a file in "rb" mode).  Greg pointed out that if we
> ever introduced some form of Unicode support, these two had to be
> disambiguated.  We found that the majority of uses was for (2)!
> Therefore we decided to change the definition of s# to mean only (2),
> and introduced t# to mean (1).  Also, we introduced getcharbuffer
> corresponding to t#, while getreadbuffer was meant for s#.

I know its too late now, but I can't really follow the arguments
here: in what ways are (1) and (2) different from the implementations
point of view ? If "t#" is to return UTF-8 then  will not equal , so both parser markers return
essentially the same information. The only difference would be
on the semantic side: (1) means: give me text data, while (2) does
not specify the data type.

Perhaps I'm missing something...
 
> Note that the definition of the 's' format was left alone -- as
> before, it means you need an 8-bit text string not containing null
> bytes.

This definition should then be changed to "text string without
null bytes" dropping the 8-bit reference.
 
> Our expectation was that a Unicode string passed to an s# situation
> would give a pointer to the internal format plus a byte count (not a
> character count!) while t# would get a pointer to some kind of 8-bit
> translation/encoding plus a byte count, with the explicit requirement
> that the 8-bit translation would have the same lifetime as the
> original unicode object.  We decided to leave it up to the next
> generation (i.e., Marc-Andre :-) to decide what kind of translation to
> use and what to do when there is no reasonable translation.

Hmm, I would strongly object to making "s#" return the internal
format. file.write() would then default to writing UTF-16 data
instead of UTF-8 data. This could result in strange errors
due to the UTF-16 format being endian dependent.

It would also break the symmetry between file.write(u) and
unicode(file.read()), since the default encoding is not used as
internal format for other reasons (see proposal).

> Any of the following choices is acceptable (from the point of view of
> not breaking the intended t# semantics; we can now start deciding
> which we like best):

I think we have already agreed on using UTF-8 for the default
encoding. It has quite a few advantages. See

	http://czyborra.com/utf/

for a good overview of the pros and cons.

> - utf-8
> - latin-1
> - ascii
> - shift-jis
> - lower byte of unicode ordinal
> - some user- or os-specified multibyte encoding
> 
> As far as t# is concerned, for encodings that don't encode all of
> Unicode, untranslatable characters could be dealt with in any number
> of ways (raise an exception, ignore, replace with '?', make best
> effort, etc.).

The usual Python way would be: raise an exception. This is what
the proposal defines for Codecs in case an encoding/decoding
mapping is not possible, BTW. (UTF-8 will always succeed on
output.)
 
> Given the current context, it should probably be the same as the
> default encoding -- i.e., utf-8.  If we end up making the default
> user-settable, we'll have to decide what to do with untranslatable
> characters -- but that will probably be decided by the user too (it
> would be a property of a specific translation specification).
> 
> In any case, I feel that t# could receive a multi-byte encoding,
> s# should receive raw binary data, and they should correspond to
> getcharbuffer and getreadbuffer, respectively.

Why would you want to have "s#" return the raw binary data for
Unicode objects ? 

Note that it is not mentioned anywhere that
"s#" and "t#" do have to necessarily return different things
(binary being a superset of text). I'd opt for "s#" and "t#" both
returning UTF-8 data. This can be implemented by delegating the
buffer slots to the  object (see below).

> > Now Greg would chime in with the buffer interface and
> > argue that it should make the underlying internal
> > format accessible. This is a bad idea, IMHO, since you
> > shouldn't really have to know what the internal data format
> > is.
> 
> This is for C code.  Quite likely it *does* know what the internal
> data format is!

C code can use the PyUnicode_* APIs to access the data. I
don't think that argument parsing is powerful enough to
provide the C code with enough information about the data
contents, e.g. it can only state the encoding length, not the
string length.
 
> > Defining "s#" to return UTF-8 data does not only
> > make "s" and "s#" return the same data format (which should
> > always be the case, IMO),
> 
> That was before t# was introduced.  No more, alas.  If you replace s#
> with t#, I agree with you completely.

Done :-)
 
> > but also hides the internal
> > format from the user and gives him a reliable cross-platform
> > data representation of Unicode data (note that UTF-8 doesn't
> > have the byte order problems of UTF-16).
> >
> > If you are still with, let's look at what "s" and "s#"
> 
> (and t#, which is more relevant here)
> 
> > do: they return pointers into data areas which have to
> > be kept alive until the corresponding object dies.
> >
> > The only way to support this feature is by allocating
> > a buffer for just this purpose (on the fly and only if
> > needed to prevent excessive memory load). The other
> > options of adding new magic parser markers or switching
> > to more generic one all have one downside: you need to
> > change existing code which is in conflict with the idea
> > we started out with.
> 
> Agreed.  I think this was our thinking when Greg & I introduced t#.
> My own preference would be to allocate a whole string object, not
> just a buffer; this could then also be used for the .encode() method
> using the default encoding.

Good point. I'll change  to , a Python
string object created on request.
 
> > So, again, the question is: do we want this magical
> > intergration or not ? Note that this is a design question,
> > not one of memory consumption...
> 
> Yes, I want it.
> 
> Note that this doesn't guarantee that all old extensions will work
> flawlessly when passed Unicode objects; but I think that it covers
> most cases where you could have a reasonable expectation that it
> works.
> 
> (Hm, unfortunately many reasonable expectations seem to involve
> the current user's preferred encoding. :-( )

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    47 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From akuchlin@mems-exchange.org  Mon Nov 15 01:49:08 1999
From: akuchlin@mems-exchange.org (A.M. Kuchling)
Date: Sun, 14 Nov 1999 20:49:08 -0500
Subject: [Python-Dev] PyErr_Format security note
Message-ID: <199911150149.UAA00408@mira.erols.com>

I noticed this in PyErr_Format(exception, format, va_alist):

	char buffer[500]; /* Caller is responsible for limiting the format */
	...
	vsprintf(buffer, format, vargs);

Making the caller responsible for this is error-prone.  The danger, of
course, is a buffer overflow caused by generating an error string
that's larger than the buffer, possibly letting people execute
arbitrary code.  We could add a test to the configure script for
vsnprintf() and use it when possible, but that only fixes the problem
on platforms which have it.  Can we find an implementation of
vsnprintf() someplace?

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
One form to rule them all, one form to find them, one form to bring them all
and in the darkness rewrite the hell out of them.
    -- Digital Equipment Corporation, in a comment from SENDMAIL Ruleset 3



From gstein@lyra.org  Mon Nov 15 02:11:39 1999
From: gstein@lyra.org (Greg Stein)
Date: Sun, 14 Nov 1999 18:11:39 -0800 (PST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <199911150149.UAA00408@mira.erols.com>
Message-ID: 

On Sun, 14 Nov 1999, A.M. Kuchling wrote:
> Making the caller responsible for this is error-prone.  The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

Apache has a safe implementation (they have reviewed the heck out of it
for obvious reasons :-).

In the Apache source distribution, it is located in src/ap/ap_snprintf.c.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From mal@lemburg.com  Mon Nov 15 08:09:07 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 09:09:07 +0100
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
Message-ID: <382FBFA3.B28B8E1E@lemburg.com>

"A.M. Kuchling" wrote:
> 
> I noticed this in PyErr_Format(exception, format, va_alist):
> 
>         char buffer[500]; /* Caller is responsible for limiting the format */
>         ...
>         vsprintf(buffer, format, vargs);
> 
> Making the caller responsible for this is error-prone.  The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

In sysmodule.c, this check is done which should be safe enough
since no "return" is issued (Py_FatalError() does an abort()):

  if (vsprintf(buffer, format, va) >= sizeof(buffer))
    Py_FatalError("PySys_WriteStdout/err: buffer overrun");


-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gstein@lyra.org  Mon Nov 15 09:28:06 1999
From: gstein@lyra.org (Greg Stein)
Date: Mon, 15 Nov 1999 01:28:06 -0800 (PST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <382FBFA3.B28B8E1E@lemburg.com>
Message-ID: 

On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
>...
> In sysmodule.c, this check is done which should be safe enough
> since no "return" is issued (Py_FatalError() does an abort()):
> 
>   if (vsprintf(buffer, format, va) >= sizeof(buffer))
>     Py_FatalError("PySys_WriteStdout/err: buffer overrun");

I believe the return from vsprintf() itself would be the problem.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From mal@lemburg.com  Mon Nov 15 09:49:26 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 10:49:26 +0100
Subject: [Python-Dev] PyErr_Format security note
References: 
Message-ID: <382FD726.6ACB912F@lemburg.com>

Greg Stein wrote:
> 
> On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> >...
> > In sysmodule.c, this check is done which should be safe enough
> > since no "return" is issued (Py_FatalError() does an abort()):
> >
> >   if (vsprintf(buffer, format, va) >= sizeof(buffer))
> >     Py_FatalError("PySys_WriteStdout/err: buffer overrun");
> 
> I believe the return from vsprintf() itself would be the problem.

Ouch, yes, you are right... but who could exploit this security
hole ? Since PyErr_Format() is only reachable for C code, only
bad programming style in extensions could make it exploitable
via user input.

Wouldn't it be possible to assign thread globals for these
functions to use ? These would live on the heap instead of
on the stack and eliminate the buffer overrun possibilities
(I guess -- I don't have any experience with these...).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From akuchlin@mems-exchange.org  Mon Nov 15 15:17:58 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Mon, 15 Nov 1999 10:17:58 -0500 (EST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <382FD726.6ACB912F@lemburg.com>
References: 
 <382FD726.6ACB912F@lemburg.com>
Message-ID: <14384.9254.152604.11688@amarok.cnri.reston.va.us>

M.-A. Lemburg writes:
>Ouch, yes, you are right... but who could exploit this security
>hole ? Since PyErr_Format() is only reachable for C code, only
>bad programming style in extensions could make it exploitable
>via user input.

99% of security holes arise out of carelessness, and besides, this
buffer size doesn't seem to be documented in either api.tex or
ext.tex.  I'll look into borrowing Apache's implementation and
modifying it into a varargs form.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I can also withstand considerably more G-force than most people, even though I
do say so myself.
    -- The Doctor, in "The Ambassadors of Death"



From guido@CNRI.Reston.VA.US  Mon Nov 15 15:23:57 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 10:23:57 -0500
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: Your message of "Sun, 14 Nov 1999 20:49:08 EST."
 <199911150149.UAA00408@mira.erols.com>
References: <199911150149.UAA00408@mira.erols.com>
Message-ID: <199911151523.KAA27163@eric.cnri.reston.va.us>

> I noticed this in PyErr_Format(exception, format, va_alist):
> 
> 	char buffer[500]; /* Caller is responsible for limiting the format */
> 	...
> 	vsprintf(buffer, format, vargs);
> 
> Making the caller responsible for this is error-prone.

Agreed.  The limit of 500 chars, while technically undocumented, is
part of the specs for PyErr_Format (which is currently wholly
undocumented).  The current callers all have explicit precautions, but
of course I agree that this is a potential danger.

> The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

Assuming that Linux and Solaris have vsnprintf(), can't we just use
the configure script to detect it, and issue a warning blaming the
platform for those platforms that don't have it?  That seems much
simpler (from a maintenance perspective) than carrying our own
implementation around (even if we can borrow the Apache version).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fdrake@acm.org  Mon Nov 15 15:24:27 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Mon, 15 Nov 1999 10:24:27 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C6150.53BDC803@lemburg.com>
References: 
 <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
 <382C0FA8.ACB6CCD6@lemburg.com>
 <14380.10955.420102.327867@weyr.cnri.reston.va.us>
 <382C3749.198EEBC6@lemburg.com>
 <14380.16064.723277.586881@weyr.cnri.reston.va.us>
 <382C6150.53BDC803@lemburg.com>
Message-ID: <14384.9643.145759.816037@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Guido proposed to add it to sys. I originally had it defined in
 > unicodec.

  Well, he clearly didn't ask me!  ;-)

 > Perhaps a sys.endian would be more appropriate for sys
 > with values 'little' and 'big' or '<' and '>' to be conform
 > to the struct module.
 > 
 > unicodec could then define unicodec.bom depending on the setting
 > in sys.

  This seems more reasonable, though I'd go with BOM instead of bom.
But that's a style issue, so not so important.  If your write bom,
I'll write bom.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From andy@robanal.demon.co.uk  Mon Nov 15 15:30:45 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Mon, 15 Nov 1999 07:30:45 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
Message-ID: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>

Some thoughts on the codecs...

1. Stream interface
At the moment a codec has dump and load methods which
read a (slice of a) stream into a string in memory and
vice versa.  As the proposal notes, this could lead to
errors if you take a slice out of a stream.   This is
not just due to character truncation; some Asian
encodings are modal and have shift-in and shift-out
sequences as they move from Western single-byte
characters to double-byte ones.   It also seems a bit
pointless to me as the source (or target) is still a
Unicode string in memory.

This is a real problem - a filter to convert big files
between two encodings should be possible without
knowledge of the particular encoding, as should one on
the input/output of some server.  We can still give a
default implementation for single-byte encodings.

What's a good API for real stream conversion?   just
Codec.encodeStream(infile, outfile)  ?  or is it more
useful to feed the codec with data a chunk at a time?


2. Data driven codecs
I really like codecs being objects, and believe we
could build support for a lot more encodings, a lot
sooner than is otherwise possible, by making them data
driven rather making each one compiled C code with
static mapping tables.  What do people think about the
approach below?

First of all, the ISO8859-1 series are straight
mappings to Unicode code points.  So one Python script
could parse these files and build the mapping table,
and a very small data file could hold these encodings.
  A compiled helper function analogous to
string.translate() could deal with most of them.

Secondly, the double-byte ones involve a mixture of
algorithms and data.  The worst cases I know are modal
encodings which need a single-byte lookup table, a
double-byte lookup table, and have some very simple
rules about escape sequences in between them.  A
simple state machine could still handle these (and the
single-byte mappings above become extra-simple special
cases); I could imagine feeding it a totally
data-driven set of rules.  

Third, we can massively compress the mapping tables
using a notation which just lists contiguous ranges;
and very often there are relationships between
encodings.  For example, "cpXYZ is just like cpXYY but
with an extra 'smiley' at 0XFE32".  In these cases, a
script can build a family of related codecs in an
auditable manner. 

3. What encodings to distribute?
The only clean answers to this are 'almost none', or
'everything that Unicode 3.0 has a mapping for'.  The
latter is going to add some weight to the
distribution.  What are people's feelings?  Do we ship
any at all apart from the Unicode ones?  Should new
encodings be downloadable from www.python.org?  Should
there be an optional package outside the main
distribution?

Thanks,

Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From akuchlin@mems-exchange.org  Mon Nov 15 15:36:47 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Mon, 15 Nov 1999 10:36:47 -0500 (EST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <199911151523.KAA27163@eric.cnri.reston.va.us>
References: <199911150149.UAA00408@mira.erols.com>
 <199911151523.KAA27163@eric.cnri.reston.va.us>
Message-ID: <14384.10383.718373.432606@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>Assuming that Linux and Solaris have vsnprintf(), can't we just use
>the configure script to detect it, and issue a warning blaming the
>platform for those platforms that don't have it?  That seems much

But people using an already-installed Python binary won't see any such
configure-time warning, and won't find out about the potential
problem.  Plus, how do people fix the problem on platforms that don't
have vsnprintf() -- switch to Solaris or Linux?  Not much of a
solution.  (vsnprintf() isn't ANSI C, though it's a common extension,
so platforms that lack it aren't really deficient.)

Hmm... could we maybe use Python's existing (string % vars) machinery?
 No, that seems to be hard, because it would want
PyObjects, and we can't know what Python types to convert the varargs
to, unless we parse the format string (at which point we may as well
get a vsnprintf() implementation.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
A successful tool is one that was used to do something undreamed of by its
author.
    -- S.C. Johnson



From guido@CNRI.Reston.VA.US  Mon Nov 15 15:50:24 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 10:50:24 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Sun, 14 Nov 1999 23:11:54 +0100."
 <382F33AA.C3EE825A@lemburg.com>
References:  <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us>
 <382F33AA.C3EE825A@lemburg.com>
Message-ID: <199911151550.KAA27188@eric.cnri.reston.va.us>

> On purpose -- according to my thinking. I see "t#" as an interface
> to bf_getcharbuf which I understand as 8-bit character buffer...
> UTF-8 is a multi byte encoding. It still is character data, but
> not necessarily 8 bits in length (up to 24 bits are used).
> 
> Anyway, I'm not really interested in having an argument about
> this. If you say, "t#" fits the purpose, then that's fine with
> me. Still, we should clearly define that "t#" returns
> text data and "s#" binary data. Encoding, bit length, etc. should
> explicitly remain left undefined.

Thanks for not picking an argument.  Multibyte encodings typically
have ASCII as a subset (in such a way that an ASCII string is
represented as itself in bytes).  This is the characteristic that's
needed in my view.

> > > First, we have a general design question here: should old code
> > > become Unicode compatible or not. As I recall the original idea
> > > about Unicode integration was to follow Perl's idea to have
> > > scripts become Unicode aware by simply adding a 'use utf8;'.
> > 
> > I've never heard of this idea before -- or am I taking it too literal?
> > It smells of a mode to me :-)  I'd rather live in a world where
> > Unicode just works as long as you use u'...' literals or whatever
> > convention we decide.
> > 
> > > If this is still the case, then we'll have to come with a
> > > resonable approach for integrating classical string based
> > > APIs with the new type.
> > >
> > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> > > the Latin-1 folks) which has some very nice features (see
> > > http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> > > this encoding seems best fit for the purpose.
> > 
> > Yes, especially if we fix the default encoding as UTF-8.  (I'm
> > expecting feedback from HP on this next week, hopefully when I see the
> > details, it'll be clear that don't need a per-thread default encoding
> > to solve their problems; that's quite a likely outcome.  If not, we
> > have a real-world argument for allowing a variable default encoding,
> > without carnage.)
> 
> Fair enough :-)
>  
> > > However, one should not forget that UTF-8 is in fact a
> > > variable length encoding of Unicode characters, that is up to
> > > 3 bytes form a *single* character. This is obviously not compatible
> > > with definitions that explicitly state data to be using a
> > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> > > work like it does in Latin-1 text.
> > 
> > Sure, but where in current Python are there such requirements?
> 
> It was my understanding that "t#" refers to single byte character
> data. That's where the above arguments were aiming at...

t# refers to byte-encoded data.  Multibyte encodings are explicitly
designed to be passed cleanly through processing steps that handle
single-byte character data, as long as they are 8-bit clean and don't
do too much processing.

> > > So if we are to do the integration, we'll have to choose
> > > argument parser markers that allow for multi byte characters.
> > > "t#" does not fall into this category, "s#" certainly does,
> > > "s" is argueable.
> > 
> > I disagree.  I grepped through the source for s# and t#.  Here's a bit
> > of background.  Before t# was introduced, s# was being used for two
> > distinct purposes: (1) to get an 8-bit text string plus its length, in
> > situations where the length was needed; (2) to get binary data (e.g.
> > GIF data read from a file in "rb" mode).  Greg pointed out that if we
> > ever introduced some form of Unicode support, these two had to be
> > disambiguated.  We found that the majority of uses was for (2)!
> > Therefore we decided to change the definition of s# to mean only (2),
> > and introduced t# to mean (1).  Also, we introduced getcharbuffer
> > corresponding to t#, while getreadbuffer was meant for s#.
> 
> I know its too late now, but I can't really follow the arguments
> here: in what ways are (1) and (2) different from the implementations
> point of view ? If "t#" is to return UTF-8 then  buffer> will not equal , so both parser markers return
> essentially the same information. The only difference would be
> on the semantic side: (1) means: give me text data, while (2) does
> not specify the data type.
> 
> Perhaps I'm missing something...

The idea is that (1)/s# disallows any translation of the data, while
(2)/t# requires translation of the data to an ASCII superset (possibly
multibyte, such as UTF-8 or shift-JIS).  (2)/t# assumes that the data
contains text and that if the text consists of only ASCII characters
they are represented as themselves.  (1)/s# makes no such assumption.

In terms of implementation, Unicode objects should translate
themselves to the default encoding for t# (if possible), but they
should make the native representation available for s#.

For example, take an encryption engine.  While it is defined in terms
of byte streams, there's no requirement that the bytes represent
characters -- they could be the bytes of a GIF file, an MP3 file, or a
gzipped tar file.  If we pass Unicode to an encryption engine, we want
Unicode to come out at the other end, not UTF-8.  (If we had wanted to
encrypt UTF-8, we should have fed it UTF-8.)

> > Note that the definition of the 's' format was left alone -- as
> > before, it means you need an 8-bit text string not containing null
> > bytes.
> 
> This definition should then be changed to "text string without
> null bytes" dropping the 8-bit reference.

Aha, I think there's a confusion about what "8-bit" means.  For me, a
multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?
(As far as I know, C uses char* to represent multibyte characters.)
Maybe we should disambiguate it more explicitly?

> > Our expectation was that a Unicode string passed to an s# situation
> > would give a pointer to the internal format plus a byte count (not a
> > character count!) while t# would get a pointer to some kind of 8-bit
> > translation/encoding plus a byte count, with the explicit requirement
> > that the 8-bit translation would have the same lifetime as the
> > original unicode object.  We decided to leave it up to the next
> > generation (i.e., Marc-Andre :-) to decide what kind of translation to
> > use and what to do when there is no reasonable translation.
> 
> Hmm, I would strongly object to making "s#" return the internal
> format. file.write() would then default to writing UTF-16 data
> instead of UTF-8 data. This could result in strange errors
> due to the UTF-16 format being endian dependent.

But this was the whole design.  file.write() needs to be changed to
use s# when the file is open in binary mode and t# when the file is
open in text mode.

> It would also break the symmetry between file.write(u) and
> unicode(file.read()), since the default encoding is not used as
> internal format for other reasons (see proposal).

If the file is encoded using UTF-16 or UCS-2, you should open it in
binary mode and use unicode(file.read(), 'utf-16').  (Or perhaps the
app should read the first 2 bytes and check for a BOM and then decide
to choose bewteen 'utf-16-be' and 'utf-16-le'.)

> > Any of the following choices is acceptable (from the point of view of
> > not breaking the intended t# semantics; we can now start deciding
> > which we like best):
> 
> I think we have already agreed on using UTF-8 for the default
> encoding. It has quite a few advantages. See
> 
> 	http://czyborra.com/utf/
> 
> for a good overview of the pros and cons.

Of course.  I was just presenting the list as an argument that if
we changed our mind about the default encoding, t# should follow the
default encoding (and not pick an encoding by other means).

> > - utf-8
> > - latin-1
> > - ascii
> > - shift-jis
> > - lower byte of unicode ordinal
> > - some user- or os-specified multibyte encoding
> > 
> > As far as t# is concerned, for encodings that don't encode all of
> > Unicode, untranslatable characters could be dealt with in any number
> > of ways (raise an exception, ignore, replace with '?', make best
> > effort, etc.).
> 
> The usual Python way would be: raise an exception. This is what
> the proposal defines for Codecs in case an encoding/decoding
> mapping is not possible, BTW. (UTF-8 will always succeed on
> output.)

Did you read Andy Robinson's case study?  He suggested that for
certain encodings there may be other things you can do that are more
user-friendly than raising an exception, depending on the application.
I am proposing to leave this a detail of each specific translation.
There may even be translations that do the same thing except they have
a different behavior for untranslatable cases -- e.g. a strict version
that raises an exception and a non-strict version that replaces bad
characters with '?'.  I think this is one of the powers of having an
extensible set of encodings.

> > Given the current context, it should probably be the same as the
> > default encoding -- i.e., utf-8.  If we end up making the default
> > user-settable, we'll have to decide what to do with untranslatable
> > characters -- but that will probably be decided by the user too (it
> > would be a property of a specific translation specification).
> > 
> > In any case, I feel that t# could receive a multi-byte encoding,
> > s# should receive raw binary data, and they should correspond to
> > getcharbuffer and getreadbuffer, respectively.
> 
> Why would you want to have "s#" return the raw binary data for
> Unicode objects ? 

Because file.write() for a binary file, and other similar things
(e.g. the encryption engine example I mentioned above) must have
*some* way to get at the raw bits.

> Note that it is not mentioned anywhere that
> "s#" and "t#" do have to necessarily return different things
> (binary being a superset of text). I'd opt for "s#" and "t#" both
> returning UTF-8 data. This can be implemented by delegating the
> buffer slots to the  object (see below).

This would defeat the whole purpose of introducing t#.  We might as
well drop t# then altogether if we adopt this.

> > > Now Greg would chime in with the buffer interface and
> > > argue that it should make the underlying internal
> > > format accessible. This is a bad idea, IMHO, since you
> > > shouldn't really have to know what the internal data format
> > > is.
> > 
> > This is for C code.  Quite likely it *does* know what the internal
> > data format is!
> 
> C code can use the PyUnicode_* APIs to access the data. I
> don't think that argument parsing is powerful enough to
> provide the C code with enough information about the data
> contents, e.g. it can only state the encoding length, not the
> string length.

Typically, all the C code does is pass multibyte encoded strings on to
other library routines that know what to do to them, or simply give
them back unchanged at a later time.  It is essential to know the
number of bytes, for memory allocation purposes.  The number of
characters is totally immaterial (and multibyte-handling code knows
how to calculate the number of characters anyway).

> > > Defining "s#" to return UTF-8 data does not only
> > > make "s" and "s#" return the same data format (which should
> > > always be the case, IMO),
> > 
> > That was before t# was introduced.  No more, alas.  If you replace s#
> > with t#, I agree with you completely.
> 
> Done :-)
>  
> > > but also hides the internal
> > > format from the user and gives him a reliable cross-platform
> > > data representation of Unicode data (note that UTF-8 doesn't
> > > have the byte order problems of UTF-16).
> > >
> > > If you are still with, let's look at what "s" and "s#"
> > 
> > (and t#, which is more relevant here)
> > 
> > > do: they return pointers into data areas which have to
> > > be kept alive until the corresponding object dies.
> > >
> > > The only way to support this feature is by allocating
> > > a buffer for just this purpose (on the fly and only if
> > > needed to prevent excessive memory load). The other
> > > options of adding new magic parser markers or switching
> > > to more generic one all have one downside: you need to
> > > change existing code which is in conflict with the idea
> > > we started out with.
> > 
> > Agreed.  I think this was our thinking when Greg & I introduced t#.
> > My own preference would be to allocate a whole string object, not
> > just a buffer; this could then also be used for the .encode() method
> > using the default encoding.
> 
> Good point. I'll change  to , a Python
> string object created on request.
>  
> > > So, again, the question is: do we want this magical
> > > intergration or not ? Note that this is a design question,
> > > not one of memory consumption...
> > 
> > Yes, I want it.
> > 
> > Note that this doesn't guarantee that all old extensions will work
> > flawlessly when passed Unicode objects; but I think that it covers
> > most cases where you could have a reasonable expectation that it
> > works.
> > 
> > (Hm, unfortunately many reasonable expectations seem to involve
> > the current user's preferred encoding. :-( )
> 
> -- 
> Marc-Andre Lemburg

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Mike.Da.Silva@uk.fid-intl.com  Mon Nov 15 16:01:59 1999
From: Mike.Da.Silva@uk.fid-intl.com (Da Silva, Mike)
Date: Mon, 15 Nov 1999 16:01:59 -0000
Subject: [Python-Dev] Some thoughts on the codecs...
Message-ID: 

Andy Robinson wrote:
1.	Stream interface
At the moment a codec has dump and load methods which read a (slice of a)
stream into a string in memory and vice versa.  As the proposal notes, this
could lead to errors if you take a slice out of a stream.   This is not just
due to character truncation; some Asian encodings are modal and have
shift-in and shift-out sequences as they move from Western single-byte
characters to double-byte ones.   It also seems a bit pointless to me as the
source (or target) is still a Unicode string in memory.
This is a real problem - a filter to convert big files between two encodings
should be possible without knowledge of the particular encoding, as should
one on the input/output of some server.  We can still give a default
implementation for single-byte encodings.
What's a good API for real stream conversion?   just
Codec.encodeStream(infile, outfile)  ?  or is it more useful to feed the
codec with data a chunk at a time?

A user defined chunking factor (suitably defaulted) would be useful for
processing large files.

2.	Data driven codecs
I really like codecs being objects, and believe we could build support for a
lot more encodings, a lot sooner than is otherwise possible, by making them
data driven rather making each one compiled C code with static mapping
tables.  What do people think about the approach below?
First of all, the ISO8859-1 series are straight mappings to Unicode code
points.  So one Python script could parse these files and build the mapping
table, and a very small data file could hold these encodings.  A compiled
helper function analogous to string.translate() could deal with most of
them.
Secondly, the double-byte ones involve a mixture of algorithms and data.
The worst cases I know are modal encodings which need a single-byte lookup
table, a double-byte lookup table, and have some very simple rules about
escape sequences in between them.  A simple state machine could still handle
these (and the single-byte mappings above become extra-simple special
cases); I could imagine feeding it a totally data-driven set of rules.  
Third, we can massively compress the mapping tables using a notation which
just lists contiguous ranges; and very often there are relationships between
encodings.  For example, "cpXYZ is just like cpXYY but with an extra
'smiley' at 0XFE32".  In these cases, a script can build a family of related
codecs in an auditable manner. 

The problem here is that we need to decide whether we are Unicode-centric,
or whether Unicode is just another supported encoding. If we are
Unicode-centric, then all code-page translations will require static mapping
tables between the appropriate Unicode character and the relevant code
points in the other encoding.  This would involve (worst case) 64k static
tables for each supported encoding.  Unfortunately this also precludes the
use of algorithmic conversions and or sparse conversion tables because most
of these transformations are relative to a source and target non-Unicode
encoding, eg JIS <---->EUCJIS.  If we are taking the IBM approach (see
CDRA), then we can mix and match approaches, and treat Unicode strings as
just Unicode, and normal strings as being any arbitrary MBCS encoding.

To guarantee the utmost interoperability and Unicode 3.0 (and beyond)
compliance, we should probably assume that all core encodings are relative
to Unicode as the pivot encoding.  This should hopefully avoid any gotcha's
with roundtrips between any two arbitrary native encodings.  The downside is
this will probably be slower than an optimised algorithmic transformation.

3.	What encodings to distribute?
The only clean answers to this are 'almost none', or 'everything that
Unicode 3.0 has a mapping for'.  The latter is going to add some weight to
the distribution.  What are people's feelings?  Do we ship any at all apart
from the Unicode ones?  Should new encodings be downloadable from
www.python.org  ?  Should there be an optional
package outside the main distribution?
Ship with Unicode encodings in the core, the rest should be an add on
package.

If we are truly Unicode-centric, this gives us the most value in terms of
accessing a Unicode character properties database, which will provide
language neutral case folding, Hankaku <----> Zenkaku folding (Japan
specific), and composition / normalisation between composed characters and
their component nonspacing characters.

Regards,
Mike da Silva


From andy@robanal.demon.co.uk  Mon Nov 15 16:18:13 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Mon, 15 Nov 1999 08:18:13 -0800 (PST)
Subject: [Python-Dev] just say no...
Message-ID: <19991115161813.13111.rocketmail@web606.mail.yahoo.com>

--- Guido van Rossum  wrote:

> Did you read Andy Robinson's case study?  He 
> suggested that for certain encodings there may be 
> other things you can do that are more
> user-friendly than raising an exception, depending
> on the application. I am proposing to leave this a
> detail of each specific translation.
> There may even be translations that do the same
thing
> except they have a different behavior for 
> untranslatable cases -- e.g. a strict version
> that raises an exception and a non-strict version
> that replaces bad characters with '?'.  I think this
> is one of the powers of having an extensible set of 
> encodings.

This would be a desirable option in almost every case.
 Default is an exception (I want to know my data is
not clean), but an option to specify an error
character.  It is usually a question mark but Mike
tells me that some encodings specify the error
character to use.  

Example - I query a Sybase Unicode database containing
European accents or Japanese.  By default it will give
me question marks.  If I issue the command 'set
char_convert utf8', then I see the lot (as garbage,
but never mind).  If it always errored whenever a
query result contained unexpected data, it would be
almost impossible to maintain the database.

If I wrote my own codec class for a family of
encodings, I'd give it an even wider variety of
error-logging options - maybe a mode where it told me
where in the file the dodgy characters were.

We've already taken the key step by allowing codecs to
be separate objects registered at run-time,
implemented in either C or Python.  This means that
once again Python will have the most flexible solution
around.

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From jim@digicool.com  Mon Nov 15 16:29:13 1999
From: jim@digicool.com (Jim Fulton)
Date: Mon, 15 Nov 1999 11:29:13 -0500
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
Message-ID: <383034D9.6E1E74D4@digicool.com>

"A.M. Kuchling" wrote:
> 
> I noticed this in PyErr_Format(exception, format, va_alist):
> 
>         char buffer[500]; /* Caller is responsible for limiting the format */
>         ...
>         vsprintf(buffer, format, vargs);
> 
> Making the caller responsible for this is error-prone.  The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

I would prefer to see a different interface altogether:

  PyObject *PyErr_StringFormat(errtype, format, buildformat, ...)

So, you could generate an error like this:

  return PyErr_StringFormat(ErrorObject, 
     "You had too many, %d, foos. The last one was %s", 
     "iO", n, someObject)

I implemented this in cPickle. See cPickle_ErrFormat.
(Note that it always returns NULL.)

Jim

--
Jim Fulton           mailto:jim@digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.


From bwarsaw@cnri.reston.va.us (Barry A. Warsaw)  Mon Nov 15 16:54:10 1999
From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw)
Date: Mon, 15 Nov 1999 11:54:10 -0500 (EST)
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
 <199911151523.KAA27163@eric.cnri.reston.va.us>
Message-ID: <14384.15026.392781.151886@anthem.cnri.reston.va.us>

>>>>> "Guido" == Guido van Rossum  writes:

    Guido> Assuming that Linux and Solaris have vsnprintf(), can't we
    Guido> just use the configure script to detect it, and issue a
    Guido> warning blaming the platform for those platforms that don't
    Guido> have it?  That seems much simpler (from a maintenance
    Guido> perspective) than carrying our own implementation around
    Guido> (even if we can borrow the Apache version).

Mailman uses vsnprintf in it's C wrapper.  There's a simple configure
test...

# Checks for library functions.
AC_CHECK_FUNCS(vsnprintf)

...and for systems that don't have a vsnprintf, I modified a version
from GNU screen.  It may not have gone through the scrutiny of
Apache's implementation, but for Mailman it was more important that it
be GPL'd (not a Python requirement).

-Barry


From jim@digicool.com  Mon Nov 15 16:56:38 1999
From: jim@digicool.com (Jim Fulton)
Date: Mon, 15 Nov 1999 11:56:38 -0500
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
 <199911151523.KAA27163@eric.cnri.reston.va.us> <14384.10383.718373.432606@amarok.cnri.reston.va.us>
Message-ID: <38303B46.F6AEEDF1@digicool.com>

"Andrew M. Kuchling" wrote:
> 
> Guido van Rossum writes:
> >Assuming that Linux and Solaris have vsnprintf(), can't we just use
> >the configure script to detect it, and issue a warning blaming the
> >platform for those platforms that don't have it?  That seems much
> 
> But people using an already-installed Python binary won't see any such
> configure-time warning, and won't find out about the potential
> problem.  Plus, how do people fix the problem on platforms that don't
> have vsnprintf() -- switch to Solaris or Linux?  Not much of a
> solution.  (vsnprintf() isn't ANSI C, though it's a common extension,
> so platforms that lack it aren't really deficient.)
> 
> Hmm... could we maybe use Python's existing (string % vars) machinery?
>  No, that seems to be hard, because it would want
> PyObjects, and we can't know what Python types to convert the varargs
> to, unless we parse the format string (at which point we may as well
> get a vsnprintf() implementation.

It's easy. You use two format strings. One a Python string format, 
and the other a Py_BuildValue format. See my other note.

Jim


--
Jim Fulton           mailto:jim@digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.


From tismer@appliedbiometrics.com  Mon Nov 15 17:02:20 1999
From: tismer@appliedbiometrics.com (Christian Tismer)
Date: Mon, 15 Nov 1999 18:02:20 +0100
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us>
Message-ID: <38303C9C.42C5C830@appliedbiometrics.com>


Guido van Rossum wrote:
> 
> > I noticed this in PyErr_Format(exception, format, va_alist):
> >
> >       char buffer[500]; /* Caller is responsible for limiting the format */
> >       ...
> >       vsprintf(buffer, format, vargs);
> >
> > Making the caller responsible for this is error-prone.
> 
> Agreed.  The limit of 500 chars, while technically undocumented, is
> part of the specs for PyErr_Format (which is currently wholly
> undocumented).  The current callers all have explicit precautions, but
> of course I agree that this is a potential danger.

All but one (checked them all):
In ceval.c, function call_builtin, there is a possible security hole.
If an extension module happens to create a very long type name
(maybe just via a bug), we will crash.

	}
	PyErr_Format(PyExc_TypeError, "call of non-function (type %s)",
		     func->ob_type->tp_name);
	return NULL;
}

ciao - chris

-- 
Christian Tismer             :^)   
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101   :    *Starship* http://starship.python.net
10553 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home


From guido@CNRI.Reston.VA.US  Mon Nov 15 19:32:00 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 14:32:00 -0500
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: Your message of "Mon, 15 Nov 1999 18:02:20 +0100."
 <38303C9C.42C5C830@appliedbiometrics.com>
References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us>
 <38303C9C.42C5C830@appliedbiometrics.com>
Message-ID: <199911151932.OAA28008@eric.cnri.reston.va.us>

> All but one (checked them all):

Thanks for checking.

> In ceval.c, function call_builtin, there is a possible security hole.
> If an extension module happens to create a very long type name
> (maybe just via a bug), we will crash.
> 
> 	}
> 	PyErr_Format(PyExc_TypeError, "call of non-function (type %s)",
> 		     func->ob_type->tp_name);
> 	return NULL;
> }

I would think that an extension module with a name of nearly 500
characters would draw a lot of attention as being ridiculous.  If
there was a bug through which you could make tp_name point to such a
long string, you could probably exploit that bug without having to use
this particular PyErr_Format() statement.

However, I agree it's better to be safe than sorry, so I've checked in
a fix making it %.400s.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tismer@appliedbiometrics.com  Mon Nov 15 19:41:14 1999
From: tismer@appliedbiometrics.com (Christian Tismer)
Date: Mon, 15 Nov 1999 20:41:14 +0100
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us>
 <38303C9C.42C5C830@appliedbiometrics.com> <199911151932.OAA28008@eric.cnri.reston.va.us>
Message-ID: <383061DA.CA5CB373@appliedbiometrics.com>


Guido van Rossum wrote:
> 
> > All but one (checked them all):

[ceval.c without limits]

> I would think that an extension module with a name of nearly 500
> characters would draw a lot of attention as being ridiculous.  If
> there was a bug through which you could make tp_name point to such a
> long string, you could probably exploit that bug without having to use
> this particular PyErr_Format() statement.

Of course this case is very unlikely.
My primary intent was to create such a mess without
an extension, and ExtensionClass seemed to be a candidate since
it synthetizes a type name at runtime (!).
This would have been dangerous since EC is in the heart of Zope.

But, I could not get at this special case since EC always
stands the class/instance checks and so this case can never happen :(

The above lousy result was just to say *something* after no success.

> However, I agree it's better to be safe than sorry, so I've checked in
> a fix making it %.400s.

cheap, consistent, fine - thanks - chris

-- 
Christian Tismer             :^)   
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101   :    *Starship* http://starship.python.net
10553 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home


From mal@lemburg.com  Mon Nov 15 19:04:59 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 20:04:59 +0100
Subject: [Python-Dev] just say no...
References:  <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us>
 <382F33AA.C3EE825A@lemburg.com> <199911151550.KAA27188@eric.cnri.reston.va.us>
Message-ID: <3830595B.348E8CC7@lemburg.com>

Guido van Rossum wrote:
> 
> [Misunderstanding in the reasoning behind "t#" and "s#"]
> 
> Thanks for not picking an argument.  Multibyte encodings typically
> have ASCII as a subset (in such a way that an ASCII string is
> represented as itself in bytes).  This is the characteristic that's
> needed in my view.
> 
> > It was my understanding that "t#" refers to single byte character
> > data. That's where the above arguments were aiming at...
> 
> t# refers to byte-encoded data.  Multibyte encodings are explicitly
> designed to be passed cleanly through processing steps that handle
> single-byte character data, as long as they are 8-bit clean and don't
> do too much processing.

Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
"8-bit clean" as you obviously did.
 
> > Perhaps I'm missing something...
> 
> The idea is that (1)/s# disallows any translation of the data, while
> (2)/t# requires translation of the data to an ASCII superset (possibly
> multibyte, such as UTF-8 or shift-JIS).  (2)/t# assumes that the data
> contains text and that if the text consists of only ASCII characters
> they are represented as themselves.  (1)/s# makes no such assumption.
> 
> In terms of implementation, Unicode objects should translate
> themselves to the default encoding for t# (if possible), but they
> should make the native representation available for s#.
> 
> For example, take an encryption engine.  While it is defined in terms
> of byte streams, there's no requirement that the bytes represent
> characters -- they could be the bytes of a GIF file, an MP3 file, or a
> gzipped tar file.  If we pass Unicode to an encryption engine, we want
> Unicode to come out at the other end, not UTF-8.  (If we had wanted to
> encrypt UTF-8, we should have fed it UTF-8.)
> 
> > > Note that the definition of the 's' format was left alone -- as
> > > before, it means you need an 8-bit text string not containing null
> > > bytes.
> >
> > This definition should then be changed to "text string without
> > null bytes" dropping the 8-bit reference.
> 
> Aha, I think there's a confusion about what "8-bit" means.  For me, a
> multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?
> (As far as I know, C uses char* to represent multibyte characters.)
> Maybe we should disambiguate it more explicitly?

There should be some definition for the two markers and the
ideas behind them in the API guide, I guess.
 
> > Hmm, I would strongly object to making "s#" return the internal
> > format. file.write() would then default to writing UTF-16 data
> > instead of UTF-8 data. This could result in strange errors
> > due to the UTF-16 format being endian dependent.
> 
> But this was the whole design.  file.write() needs to be changed to
> use s# when the file is open in binary mode and t# when the file is
> open in text mode.

Ok, that would make the situation a little clearer (even though
I expect the two different encodings to produce some FAQs). 

I still don't feel very comfortable about the fact that all
existing APIs using "s#" will suddenly receive UTF-16 data if
being passed Unicode objects: this probably won't get us the
"magical" Unicode integration we invision, since "t#" usage is not
very wide spread and character handling code will probably not
work well with UTF-16 encoded strings.

Anyway, we should probably try out both methods...

> > It would also break the symmetry between file.write(u) and
> > unicode(file.read()), since the default encoding is not used as
> > internal format for other reasons (see proposal).
> 
> If the file is encoded using UTF-16 or UCS-2, you should open it in
> binary mode and use unicode(file.read(), 'utf-16').  (Or perhaps the
> app should read the first 2 bytes and check for a BOM and then decide
> to choose bewteen 'utf-16-be' and 'utf-16-le'.)

Right, that's the idea (there is a note on this in the Standard
Codec section of the proposal).
 
> > > Any of the following choices is acceptable (from the point of view of
> > > not breaking the intended t# semantics; we can now start deciding
> > > which we like best):
> >
> > I think we have already agreed on using UTF-8 for the default
> > encoding. It has quite a few advantages. See
> >
> >       http://czyborra.com/utf/
> >
> > for a good overview of the pros and cons.
> 
> Of course.  I was just presenting the list as an argument that if
> we changed our mind about the default encoding, t# should follow the
> default encoding (and not pick an encoding by other means).

Ok.
 
> > > - utf-8
> > > - latin-1
> > > - ascii
> > > - shift-jis
> > > - lower byte of unicode ordinal
> > > - some user- or os-specified multibyte encoding
> > >
> > > As far as t# is concerned, for encodings that don't encode all of
> > > Unicode, untranslatable characters could be dealt with in any number
> > > of ways (raise an exception, ignore, replace with '?', make best
> > > effort, etc.).
> >
> > The usual Python way would be: raise an exception. This is what
> > the proposal defines for Codecs in case an encoding/decoding
> > mapping is not possible, BTW. (UTF-8 will always succeed on
> > output.)
> 
> Did you read Andy Robinson's case study?  He suggested that for
> certain encodings there may be other things you can do that are more
> user-friendly than raising an exception, depending on the application.
> I am proposing to leave this a detail of each specific translation.
> There may even be translations that do the same thing except they have
> a different behavior for untranslatable cases -- e.g. a strict version
> that raises an exception and a non-strict version that replaces bad
> characters with '?'.  I think this is one of the powers of having an
> extensible set of encodings.

Agreed, the Codecs should decide for themselves what to do. I'll
add a note to the next version of the proposal.
 
> > > Given the current context, it should probably be the same as the
> > > default encoding -- i.e., utf-8.  If we end up making the default
> > > user-settable, we'll have to decide what to do with untranslatable
> > > characters -- but that will probably be decided by the user too (it
> > > would be a property of a specific translation specification).
> > >
> > > In any case, I feel that t# could receive a multi-byte encoding,
> > > s# should receive raw binary data, and they should correspond to
> > > getcharbuffer and getreadbuffer, respectively.
> >
> > Why would you want to have "s#" return the raw binary data for
> > Unicode objects ?
> 
> Because file.write() for a binary file, and other similar things
> (e.g. the encryption engine example I mentioned above) must have
> *some* way to get at the raw bits.

What for ? Any lossless encoding should do the trick... UTF-8
is just as good as UTF-16 for binary files; plus it's more compact
for ASCII data. I don't really see a need to get explicitly
at the internal data representation because both encodings are
in fact "internal" w/r to Unicode objects.

The only argument I can come up with is that using UTF-16 for
binary files could (possibly) eliminate the UTF-8 conversion step
which is otherwise always needed.
 
> > Note that it is not mentioned anywhere that
> > "s#" and "t#" do have to necessarily return different things
> > (binary being a superset of text). I'd opt for "s#" and "t#" both
> > returning UTF-8 data. This can be implemented by delegating the
> > buffer slots to the  object (see below).
> 
> This would defeat the whole purpose of introducing t#.  We might as
> well drop t# then altogether if we adopt this.

Well... yes ;-)
 
> > > > Now Greg would chime in with the buffer interface and
> > > > argue that it should make the underlying internal
> > > > format accessible. This is a bad idea, IMHO, since you
> > > > shouldn't really have to know what the internal data format
> > > > is.
> > >
> > > This is for C code.  Quite likely it *does* know what the internal
> > > data format is!
> >
> > C code can use the PyUnicode_* APIs to access the data. I
> > don't think that argument parsing is powerful enough to
> > provide the C code with enough information about the data
> > contents, e.g. it can only state the encoding length, not the
> > string length.
> 
> Typically, all the C code does is pass multibyte encoded strings on to
> other library routines that know what to do to them, or simply give
> them back unchanged at a later time.  It is essential to know the
> number of bytes, for memory allocation purposes.  The number of
> characters is totally immaterial (and multibyte-handling code knows
> how to calculate the number of characters anyway).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Mon Nov 15 19:20:55 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 20:20:55 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>
Message-ID: <38305D17.60EC94D0@lemburg.com>

Andy Robinson wrote:
> 
> Some thoughts on the codecs...
> 
> 1. Stream interface
> At the moment a codec has dump and load methods which
> read a (slice of a) stream into a string in memory and
> vice versa.  As the proposal notes, this could lead to
> errors if you take a slice out of a stream.   This is
> not just due to character truncation; some Asian
> encodings are modal and have shift-in and shift-out
> sequences as they move from Western single-byte
> characters to double-byte ones.   It also seems a bit
> pointless to me as the source (or target) is still a
> Unicode string in memory.
> 
> This is a real problem - a filter to convert big files
> between two encodings should be possible without
> knowledge of the particular encoding, as should one on
> the input/output of some server.  We can still give a
> default implementation for single-byte encodings.
> 
> What's a good API for real stream conversion?   just
> Codec.encodeStream(infile, outfile)  ?  or is it more
> useful to feed the codec with data a chunk at a time?

The idea was to use Unicode as intermediate for all
encoding conversions. 

What you invision here are stream recoders. The can
easily be implemented as an useful addition to the Codec
subclasses, but I don't think that these have to go
into the core.
 
> 2. Data driven codecs
> I really like codecs being objects, and believe we
> could build support for a lot more encodings, a lot
> sooner than is otherwise possible, by making them data
> driven rather making each one compiled C code with
> static mapping tables.  What do people think about the
> approach below?
> 
> First of all, the ISO8859-1 series are straight
> mappings to Unicode code points.  So one Python script
> could parse these files and build the mapping table,
> and a very small data file could hold these encodings.
>   A compiled helper function analogous to
> string.translate() could deal with most of them.

The problem with these large tables is that currently
Python modules are not shared among processes since
every process builds its own table.

Static C data has the advantage of being shareable at
the OS level.

You can of course implement Python based lookup tables,
but these should be too large...
 
> Secondly, the double-byte ones involve a mixture of
> algorithms and data.  The worst cases I know are modal
> encodings which need a single-byte lookup table, a
> double-byte lookup table, and have some very simple
> rules about escape sequences in between them.  A
> simple state machine could still handle these (and the
> single-byte mappings above become extra-simple special
> cases); I could imagine feeding it a totally
> data-driven set of rules.
> 
> Third, we can massively compress the mapping tables
> using a notation which just lists contiguous ranges;
> and very often there are relationships between
> encodings.  For example, "cpXYZ is just like cpXYY but
> with an extra 'smiley' at 0XFE32".  In these cases, a
> script can build a family of related codecs in an
> auditable manner.

These are all great ideas, but I think they unnecessarily
complicate the proposal.
 
> 3. What encodings to distribute?
> The only clean answers to this are 'almost none', or
> 'everything that Unicode 3.0 has a mapping for'.  The
> latter is going to add some weight to the
> distribution.  What are people's feelings?  Do we ship
> any at all apart from the Unicode ones?  Should new
> encodings be downloadable from www.python.org?  Should
> there be an optional package outside the main
> distribution?

Since Codecs can be registered at runtime, there is quite
some potential there for extension writers coding their
own fast codecs. E.g. one could use mxTextTools as codec
engine working at C speeds.

I would propose to only add some very basic encodings to
the standard distribution, e.g. the ones mentioned under
Standard Codecs in the proposal:

  'utf-8':		8-bit variable length encoding
  'utf-16':		16-bit variable length encoding (litte/big endian)
  'utf-16-le':		utf-16 but explicitly little endian
  'utf-16-be':		utf-16 but explicitly big endian
  'ascii':		7-bit ASCII codepage
  'latin-1':		Latin-1 codepage
  'html-entities':	Latin-1 + HTML entities;
			see htmlentitydefs.py from the standard Pythin Lib
  'jis' (a popular version XXX):
			Japanese character encoding
  'unicode-escape':	See Unicode Constructors for a definition
  'native':		Dump of the Internal Format used by Python

Perhaps not even 'html-entities' (even though it would make
a cool replacement for cgi.escape()) and maybe we should
also place the JIS encoding into a separate Unicode package.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Mon Nov 15 19:26:16 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 20:26:16 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: 
Message-ID: <38305E58.28B20E24@lemburg.com>

"Da Silva, Mike" wrote:
> 
> Andy Robinson wrote:
> --
> 1.      Stream interface
> At the moment a codec has dump and load methods which read a (slice of a)
> stream into a string in memory and vice versa.  As the proposal notes, this
> could lead to errors if you take a slice out of a stream.   This is not just
> due to character truncation; some Asian encodings are modal and have
> shift-in and shift-out sequences as they move from Western single-byte
> characters to double-byte ones.   It also seems a bit pointless to me as the
> source (or target) is still a Unicode string in memory.
> This is a real problem - a filter to convert big files between two encodings
> should be possible without knowledge of the particular encoding, as should
> one on the input/output of some server.  We can still give a default
> implementation for single-byte encodings.
> What's a good API for real stream conversion?   just
> Codec.encodeStream(infile, outfile)  ?  or is it more useful to feed the
> codec with data a chunk at a time?
> --
> A user defined chunking factor (suitably defaulted) would be useful for
> processing large files.
> --
> 2.      Data driven codecs
> I really like codecs being objects, and believe we could build support for a
> lot more encodings, a lot sooner than is otherwise possible, by making them
> data driven rather making each one compiled C code with static mapping
> tables.  What do people think about the approach below?
> First of all, the ISO8859-1 series are straight mappings to Unicode code
> points.  So one Python script could parse these files and build the mapping
> table, and a very small data file could hold these encodings.  A compiled
> helper function analogous to string.translate() could deal with most of
> them.
> Secondly, the double-byte ones involve a mixture of algorithms and data.
> The worst cases I know are modal encodings which need a single-byte lookup
> table, a double-byte lookup table, and have some very simple rules about
> escape sequences in between them.  A simple state machine could still handle
> these (and the single-byte mappings above become extra-simple special
> cases); I could imagine feeding it a totally data-driven set of rules.
> Third, we can massively compress the mapping tables using a notation which
> just lists contiguous ranges; and very often there are relationships between
> encodings.  For example, "cpXYZ is just like cpXYY but with an extra
> 'smiley' at 0XFE32".  In these cases, a script can build a family of related
> codecs in an auditable manner.
> --
> The problem here is that we need to decide whether we are Unicode-centric,
> or whether Unicode is just another supported encoding. If we are
> Unicode-centric, then all code-page translations will require static mapping
> tables between the appropriate Unicode character and the relevant code
> points in the other encoding.  This would involve (worst case) 64k static
> tables for each supported encoding.  Unfortunately this also precludes the
> use of algorithmic conversions and or sparse conversion tables because most
> of these transformations are relative to a source and target non-Unicode
> encoding, eg JIS <---->EUCJIS.  If we are taking the IBM approach (see
> CDRA), then we can mix and match approaches, and treat Unicode strings as
> just Unicode, and normal strings as being any arbitrary MBCS encoding.
> 
> To guarantee the utmost interoperability and Unicode 3.0 (and beyond)
> compliance, we should probably assume that all core encodings are relative
> to Unicode as the pivot encoding.  This should hopefully avoid any gotcha's
> with roundtrips between any two arbitrary native encodings.  The downside is
> this will probably be slower than an optimised algorithmic transformation.

Optimizations should go into separate packages for direct EncodingA
-> EncodingB conversions. I don't think we need them in the core.

> --
> 3.      What encodings to distribute?
> The only clean answers to this are 'almost none', or 'everything that
> Unicode 3.0 has a mapping for'.  The latter is going to add some weight to
> the distribution.  What are people's feelings?  Do we ship any at all apart
> from the Unicode ones?  Should new encodings be downloadable from
> www.python.org  ?  Should there be an optional
> package outside the main distribution?
> --
> Ship with Unicode encodings in the core, the rest should be an add on
> package.
> 
> If we are truly Unicode-centric, this gives us the most value in terms of
> accessing a Unicode character properties database, which will provide
> language neutral case folding, Hankaku <----> Zenkaku folding (Japan
> specific), and composition / normalisation between composed characters and
> their component nonspacing characters.

>From the proposal:

"""
Unicode Character Properties:
-----------------------------

A separate module "unicodedata" should provide a compact interface to
all Unicode character properties defined in the standard's
UnicodeData.txt file.

Among other things, these properties provide ways to recognize
numbers, digits, spaces, whitespace, etc.

Since this module will have to provide access to all Unicode
characters, it will eventually have to contain the data from
UnicodeData.txt which takes up around 200kB. For this reason, the data
should be stored in static C data. This enables compilation as shared
module which the underlying OS can shared between processes (unlike
normal Python code modules).

XXX Define the interface...

"""

Special CJK packages can then access this data for the purposes
you mentioned above.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido@CNRI.Reston.VA.US  Mon Nov 15 21:37:28 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 16:37:28 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Mon, 15 Nov 1999 20:20:55 +0100."
 <38305D17.60EC94D0@lemburg.com>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>
 <38305D17.60EC94D0@lemburg.com>
Message-ID: <199911152137.QAA28280@eric.cnri.reston.va.us>

> Andy Robinson wrote:
> > 
> > Some thoughts on the codecs...
> > 
> > 1. Stream interface
> > At the moment a codec has dump and load methods which
> > read a (slice of a) stream into a string in memory and
> > vice versa.  As the proposal notes, this could lead to
> > errors if you take a slice out of a stream.   This is
> > not just due to character truncation; some Asian
> > encodings are modal and have shift-in and shift-out
> > sequences as they move from Western single-byte
> > characters to double-byte ones.   It also seems a bit
> > pointless to me as the source (or target) is still a
> > Unicode string in memory.
> > 
> > This is a real problem - a filter to convert big files
> > between two encodings should be possible without
> > knowledge of the particular encoding, as should one on
> > the input/output of some server.  We can still give a
> > default implementation for single-byte encodings.
> > 
> > What's a good API for real stream conversion?   just
> > Codec.encodeStream(infile, outfile)  ?  or is it more
> > useful to feed the codec with data a chunk at a time?

M.-A. Lemburg responds:

> The idea was to use Unicode as intermediate for all
> encoding conversions. 
> 
> What you invision here are stream recoders. The can
> easily be implemented as an useful addition to the Codec
> subclasses, but I don't think that these have to go
> into the core.

What I wanted was a codec API that acts somewhat like a buffered file;
the buffer makes it possible to efficient handle shift states.  This
is not exactly what Andy shows, but it's not what Marc's current spec
has either.

I had thought something more like what Java does: an output stream
codec's constructor takes a writable file object and the object
returned by the constructor has a write() method, a flush() method and
a close() method.  It acts like a buffering interface to the
underlying file; this allows it to generate the minimal number of
shift sequeuces.  Similar for input stream codecs.

Andy's file translation example could then be written as follows:

# assuming variables input_file, input_encoding, output_file,
# output_encoding, and constant BUFFER_SIZE

f = open(input_file, "rb")
f1 = unicodec.codecs[input_encoding].stream_reader(f)
g = open(output_file, "wb")
g1 = unicodec.codecs[output_encoding].stream_writer(f)

while 1:
      buffer = f1.read(BUFFER_SIZE)
      if not buffer:
	 break
      f2.write(buffer)

f2.close()
f1.close()

Note that we could possibly make these the only API that a codec needs
to provide; the string object <--> unicode object conversions can be
done using this and the cStringIO module.  (On the other hand it seems
a common case that would be quite useful.)

> > 2. Data driven codecs
> > I really like codecs being objects, and believe we
> > could build support for a lot more encodings, a lot
> > sooner than is otherwise possible, by making them data
> > driven rather making each one compiled C code with
> > static mapping tables.  What do people think about the
> > approach below?
> > 
> > First of all, the ISO8859-1 series are straight
> > mappings to Unicode code points.  So one Python script
> > could parse these files and build the mapping table,
> > and a very small data file could hold these encodings.
> >   A compiled helper function analogous to
> > string.translate() could deal with most of them.
> 
> The problem with these large tables is that currently
> Python modules are not shared among processes since
> every process builds its own table.
> 
> Static C data has the advantage of being shareable at
> the OS level.

Don't worry about it.  128K is too small to care, I think...

> You can of course implement Python based lookup tables,
> but these should be too large...
>  
> > Secondly, the double-byte ones involve a mixture of
> > algorithms and data.  The worst cases I know are modal
> > encodings which need a single-byte lookup table, a
> > double-byte lookup table, and have some very simple
> > rules about escape sequences in between them.  A
> > simple state machine could still handle these (and the
> > single-byte mappings above become extra-simple special
> > cases); I could imagine feeding it a totally
> > data-driven set of rules.
> > 
> > Third, we can massively compress the mapping tables
> > using a notation which just lists contiguous ranges;
> > and very often there are relationships between
> > encodings.  For example, "cpXYZ is just like cpXYY but
> > with an extra 'smiley' at 0XFE32".  In these cases, a
> > script can build a family of related codecs in an
> > auditable manner.
> 
> These are all great ideas, but I think they unnecessarily
> complicate the proposal.

Agreed, let's leave the *implementation* of codecs out of the current
efforts.

However I want to make sure that the *interface* to codecs is defined
right, because changing it will be expensive.  (This is Linus
Torvald's philosophy on drivers -- he doesn't care about bugs in
drivers, as they will get fixed; however he greatly cares about
defining the driver APIs correctly.)

> > 3. What encodings to distribute?
> > The only clean answers to this are 'almost none', or
> > 'everything that Unicode 3.0 has a mapping for'.  The
> > latter is going to add some weight to the
> > distribution.  What are people's feelings?  Do we ship
> > any at all apart from the Unicode ones?  Should new
> > encodings be downloadable from www.python.org?  Should
> > there be an optional package outside the main
> > distribution?
> 
> Since Codecs can be registered at runtime, there is quite
> some potential there for extension writers coding their
> own fast codecs. E.g. one could use mxTextTools as codec
> engine working at C speeds.

(Do you think you'll be able to extort some money from HP for these? :-)

> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
> 
>   'utf-8':		8-bit variable length encoding
>   'utf-16':		16-bit variable length encoding (litte/big endian)
>   'utf-16-le':		utf-16 but explicitly little endian
>   'utf-16-be':		utf-16 but explicitly big endian
>   'ascii':		7-bit ASCII codepage
>   'latin-1':		Latin-1 codepage
>   'html-entities':	Latin-1 + HTML entities;
> 			see htmlentitydefs.py from the standard Pythin Lib
>   'jis' (a popular version XXX):
> 			Japanese character encoding
>   'unicode-escape':	See Unicode Constructors for a definition
>   'native':		Dump of the Internal Format used by Python
> 
> Perhaps not even 'html-entities' (even though it would make
> a cool replacement for cgi.escape()) and maybe we should
> also place the JIS encoding into a separate Unicode package.

I'd drop html-entities, it seems too cutesie.  (And who uses these
anyway, outside browsers?)

For JIS (shift-JIS?) I hope that Andy can help us with some pointers
and validation.

And unicode-escape: now that you mention it, this is a section of
the proposal that I don't understand.  I quote it here:

| Python should provide a built-in constructor for Unicode strings which
| is available through __builtins__:
| 
|   u = unicode([,=])
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

What do you mean by this notation?  Since encoding names are not
always legal Python identifiers (most contain hyphens), I don't
understand what you really meant here.  Do you mean to say that it has
to be a keyword argument?  I would disagree; and then I would have
expected the notation [,encoding=].

| With the 'unicode-escape' encoding being defined as:
| 
|   u = u''
| 
| ˇ for single characters (and this includes all \XXX sequences except \uXXXX),
|   take the ordinal and interpret it as Unicode ordinal;
| 
| ˇ for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX 
|   instead, e.g. \u03C0 to represent the character Pi.

I've looked at this several times and I don't see the difference
between the two bullets.  (Ironically, you are using a non-ASCII
character here that doesn't always display, depending on where I look
at your mail :-).

Can you give some examples?

Is u'\u0020' different from u'\x20' (a space)?

Does '\u0020' (no u prefix) have a meaning?

Also, I remember reading Tim Peters who suggested that a "raw unicode"
notation (ur"...") might be necessary, to encode regular expressions.
I tend to agree.

While I'm on the topic, I don't see in your proposal a description of
the source file character encoding.  Currently, this is undefined, and
in fact can be (ab)used to enter non-ASCII in string literals.  For
example, a programmer named François might write a file containing
this statement:

  print "Written by François." # (There's a cedilla in there!)

(He assumes his source character encoding is Latin-1, and he doesn't
want to have to type \347 when he can type a cedilla on his keyboard.)

If his source file (or .pyc file!)  is executed by a Japanese user,
this will probably print some garbage.

Using the new Unicode strings, François could change his program as
follows:

  print unicode("Written by François.", "latin-1")

Assuming that François sets his sys.stdout to use Latin-1, while the
Japanese user sets his to shift-JIS (or whatever his kanjiterm uses).

But when the Japanese user views François' source file, he will again
see garbage.  If he uses a generic tool to translate latin-1 files to
shift-JIS (assuming shift-JIS has a cedilla character) the program
will no longer work correctly -- the string "latin-1" has to be
changed to "shift-jis".

What should we do about this?  The safest and most radical solution is
to disallow non-ASCII source characters; François will then have to
type

  print u"Written by Fran\u00E7ois."

but, knowing François, he probably won't like this solution very much
(since he didn't like the \347 version either).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From andy@robanal.demon.co.uk  Mon Nov 15 21:41:21 1999
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Mon, 15 Nov 1999 21:41:21 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38305D17.60EC94D0@lemburg.com>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>
Message-ID: <38307984.12653394@post.demon.co.uk>

On Mon, 15 Nov 1999 20:20:55 +0100, you wrote:

>These are all great ideas, but I think they unnecessarily
>complicate the proposal.

However, to claim that Python is properly internationalized, we will
need a large number of multi-byte encodings to be available.  It's a
large amount of work, it must be provably correct, and someone's going
to have to do it.  So if anyone with more C expertise than me - not
hard :-) - is interested

I'm not suggesting putting my points in the Unicode proposal - in
fact, I'm very happy we have a proposal which allows for extension,
and lets us work on the encodings separately (and later).

>Since Codecs can be registered at runtime, there is quite
>some potential there for extension writers coding their
>own fast codecs. E.g. one could use mxTextTools as codec
>engine working at C speeds.
Exactly my thoughts , although I was thinking of a more slimmed down
and specialized one.  The right tool might be usable for things like
compression algorithms too.  Separate project to the Unicode stuff,
but if anyone is interested, talk to me.

>I would propose to only add some very basic encodings to
>the standard distribution, e.g. the ones mentioned under
>Standard Codecs in the proposal:
>
>  'utf-8':		8-bit variable length encoding
>  'utf-16':		16-bit variable length encoding (litte/big endian)
>  'utf-16-le':		utf-16 but explicitly little endian
>  'utf-16-be':		utf-16 but explicitly big endian
>  'ascii':		7-bit ASCII codepage
>  'latin-1':		Latin-1 codepage
>  'html-entities':	Latin-1 + HTML entities;
>			see htmlentitydefs.py from the standard Pythin Lib
>  'jis' (a popular version XXX):
>			Japanese character encoding
>  'unicode-escape':	See Unicode Constructors for a definition
>  'native':		Dump of the Internal Format used by Python
>
Leave JISXXX and the CJK stuff out.  If you get into Japanese, you
really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there
are lots of options about how to do it.  The other ones are
algorithmic and can be small and fast and fit into the core.

Ditto with HTML, and maybe even escaped-unicode too.

In summary, the current discussion is clearly doing the right things,
but is only covering a small percentage of what needs to be done to
internationalize Python fully.

- Andy



From guido@CNRI.Reston.VA.US  Mon Nov 15 21:49:26 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 16:49:26 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Mon, 15 Nov 1999 21:41:21 GMT."
 <38307984.12653394@post.demon.co.uk>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>
 <38307984.12653394@post.demon.co.uk>
Message-ID: <199911152149.QAA28345@eric.cnri.reston.va.us>

> In summary, the current discussion is clearly doing the right things,
> but is only covering a small percentage of what needs to be done to
> internationalize Python fully.

Agreed.  So let's focus on defining interfaces that are correct and
convenient so others who want to add codecs won't have to fight our
architecture!

Is the current architecture good enough so that the Japanese codecs
will fit in it?  (I'm particularly worried about the stream codecs,
see my previous message.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From andy@robanal.demon.co.uk  Mon Nov 15 21:58:34 1999
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Mon, 15 Nov 1999 21:58:34 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <199911152149.QAA28345@eric.cnri.reston.va.us>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>   <38307984.12653394@post.demon.co.uk> <199911152149.QAA28345@eric.cnri.reston.va.us>
Message-ID: <3831806d.14422147@post.demon.co.uk>

On Mon, 15 Nov 1999 16:49:26 -0500, you wrote:

>> In summary, the current discussion is clearly doing the right things,
>> but is only covering a small percentage of what needs to be done to
>> internationalize Python fully.
>
>Agreed.  So let's focus on defining interfaces that are correct and
>convenient so others who want to add codecs won't have to fight our
>architecture!
>
>Is the current architecture good enough so that the Japanese codecs
>will fit in it?  (I'm particularly worried about the stream codecs,
>see my previous message.)
>
No, I don't think it is good enough.  We need a stream codec, and as
you said the string and file interfaces can be built out of that.  

You guys will know better than me what the best patterns for that
are...

- Andy






From andy@robanal.demon.co.uk  Mon Nov 15 22:30:53 1999
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Mon, 15 Nov 1999 22:30:53 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>   <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <383086da.16067684@post.demon.co.uk>

On Mon, 15 Nov 1999 16:37:28 -0500, you wrote:

># assuming variables input_file, input_encoding, output_file,
># output_encoding, and constant BUFFER_SIZE
>
>f = open(input_file, "rb")
>f1 = unicodec.codecs[input_encoding].stream_reader(f)
>g = open(output_file, "wb")
>g1 = unicodec.codecs[output_encoding].stream_writer(f)
>
>while 1:
>      buffer = f1.read(BUFFER_SIZE)
>      if not buffer:
>	 break
>      f2.write(buffer)
>
>f2.close()
>f1.close()
>
>Note that we could possibly make these the only API that a codec needs
>to provide; the string object <--> unicode object conversions can be
>done using this and the cStringIO module.  (On the other hand it seems
>a common case that would be quite useful.)
Perfect.  I'd keep the string ones - easy to implement but a big
convenience.

The proposal also says:
>For explicit handling of Unicode using files, the unicodec module
>could provide stream wrappers which provide transparent
>encoding/decoding for any open stream (file-like object):
>
>  import unicodec
>  file = open('mytext.txt','rb')
>  ufile = unicodec.stream(file,'utf-16')
>  u = ufile.read()
>  ...
>  ufile.close()

It seems to me that if we go for stream_reader, it replaces this bit
of the proposal too - no need for unicodec to provide anything.  If
you want to have a convenience function there to save a line or two,
you could have
	unicodec.open(filename, mode, encoding)
which returned a stream_reader.


- Andy



From mal@lemburg.com  Mon Nov 15 22:54:38 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 23:54:38 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>
 <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <38308F2E.44B9C6BF@lemburg.com>

[I'll get back on this tomorrow, just some quick notes here...]

Guido van Rossum wrote:
> 
> > Andy Robinson wrote:
> > >
> > > Some thoughts on the codecs...
> > >
> > > 1. Stream interface
> > > At the moment a codec has dump and load methods which
> > > read a (slice of a) stream into a string in memory and
> > > vice versa.  As the proposal notes, this could lead to
> > > errors if you take a slice out of a stream.   This is
> > > not just due to character truncation; some Asian
> > > encodings are modal and have shift-in and shift-out
> > > sequences as they move from Western single-byte
> > > characters to double-byte ones.   It also seems a bit
> > > pointless to me as the source (or target) is still a
> > > Unicode string in memory.
> > >
> > > This is a real problem - a filter to convert big files
> > > between two encodings should be possible without
> > > knowledge of the particular encoding, as should one on
> > > the input/output of some server.  We can still give a
> > > default implementation for single-byte encodings.
> > >
> > > What's a good API for real stream conversion?   just
> > > Codec.encodeStream(infile, outfile)  ?  or is it more
> > > useful to feed the codec with data a chunk at a time?
> 
> M.-A. Lemburg responds:
> 
> > The idea was to use Unicode as intermediate for all
> > encoding conversions.
> >
> > What you invision here are stream recoders. The can
> > easily be implemented as an useful addition to the Codec
> > subclasses, but I don't think that these have to go
> > into the core.
> 
> What I wanted was a codec API that acts somewhat like a buffered file;
> the buffer makes it possible to efficient handle shift states.  This
> is not exactly what Andy shows, but it's not what Marc's current spec
> has either.
> 
> I had thought something more like what Java does: an output stream
> codec's constructor takes a writable file object and the object
> returned by the constructor has a write() method, a flush() method and
> a close() method.  It acts like a buffering interface to the
> underlying file; this allows it to generate the minimal number of
> shift sequeuces.  Similar for input stream codecs.

The Codecs provide implementations for encoding and decoding,
they are not intended as complete wrappers for e.g. files or
sockets.

The unicodec module will define a generic stream wrapper
(which is yet to be defined) for dealing with files, sockets,
etc. It will use the codec registry to do the actual codec
work.
 
>From the proposal:
"""
For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(,,) could be provided as
    short-hand for unicodec.file(open(,),) which
    also assures that  contains the 'b' character when needed.

XXX Specify the wrapper(s)...

    Open issues: what to do with Python strings
    fed to the .write() method (may need to know the encoding of the
    strings) and when/if to return Python strings through the .read()
    method.

    Perhaps we need more than one type of wrapper here.
"""

> Andy's file translation example could then be written as follows:
> 
> # assuming variables input_file, input_encoding, output_file,
> # output_encoding, and constant BUFFER_SIZE
> 
> f = open(input_file, "rb")
> f1 = unicodec.codecs[input_encoding].stream_reader(f)
> g = open(output_file, "wb")
> g1 = unicodec.codecs[output_encoding].stream_writer(f)
> 
> while 1:
>       buffer = f1.read(BUFFER_SIZE)
>       if not buffer:
>          break
>       f2.write(buffer)
> 
> f2.close()
> f1.close()

 
> Note that we could possibly make these the only API that a codec needs
> to provide; the string object <--> unicode object conversions can be
> done using this and the cStringIO module.  (On the other hand it seems
> a common case that would be quite useful.)

You wouldn't want to go via cStringIO for *every* encoding
translation.

The Codec interface defines two pairs of methods
on purpose: one which works internally (ie. directly between
strings and Unicode objects), and one which works externally
(directly between a stream and Unicode objects).

> > > 2. Data driven codecs
> > > I really like codecs being objects, and believe we
> > > could build support for a lot more encodings, a lot
> > > sooner than is otherwise possible, by making them data
> > > driven rather making each one compiled C code with
> > > static mapping tables.  What do people think about the
> > > approach below?
> > >
> > > First of all, the ISO8859-1 series are straight
> > > mappings to Unicode code points.  So one Python script
> > > could parse these files and build the mapping table,
> > > and a very small data file could hold these encodings.
> > >   A compiled helper function analogous to
> > > string.translate() could deal with most of them.
> >
> > The problem with these large tables is that currently
> > Python modules are not shared among processes since
> > every process builds its own table.
> >
> > Static C data has the advantage of being shareable at
> > the OS level.
> 
> Don't worry about it.  128K is too small to care, I think...

Huh ? 128K for every process using Python ? That quickly
sums up to lots of megabytes lying around pretty much unused.

> > You can of course implement Python based lookup tables,
> > but these should be too large...
> >
> > > Secondly, the double-byte ones involve a mixture of
> > > algorithms and data.  The worst cases I know are modal
> > > encodings which need a single-byte lookup table, a
> > > double-byte lookup table, and have some very simple
> > > rules about escape sequences in between them.  A
> > > simple state machine could still handle these (and the
> > > single-byte mappings above become extra-simple special
> > > cases); I could imagine feeding it a totally
> > > data-driven set of rules.
> > >
> > > Third, we can massively compress the mapping tables
> > > using a notation which just lists contiguous ranges;
> > > and very often there are relationships between
> > > encodings.  For example, "cpXYZ is just like cpXYY but
> > > with an extra 'smiley' at 0XFE32".  In these cases, a
> > > script can build a family of related codecs in an
> > > auditable manner.
> >
> > These are all great ideas, but I think they unnecessarily
> > complicate the proposal.
> 
> Agreed, let's leave the *implementation* of codecs out of the current
> efforts.
> 
> However I want to make sure that the *interface* to codecs is defined
> right, because changing it will be expensive.  (This is Linus
> Torvald's philosophy on drivers -- he doesn't care about bugs in
> drivers, as they will get fixed; however he greatly cares about
> defining the driver APIs correctly.)
> 
> > > 3. What encodings to distribute?
> > > The only clean answers to this are 'almost none', or
> > > 'everything that Unicode 3.0 has a mapping for'.  The
> > > latter is going to add some weight to the
> > > distribution.  What are people's feelings?  Do we ship
> > > any at all apart from the Unicode ones?  Should new
> > > encodings be downloadable from www.python.org?  Should
> > > there be an optional package outside the main
> > > distribution?
> >
> > Since Codecs can be registered at runtime, there is quite
> > some potential there for extension writers coding their
> > own fast codecs. E.g. one could use mxTextTools as codec
> > engine working at C speeds.
> 
> (Do you think you'll be able to extort some money from HP for these? :-)

Don't know, it depends on what their specs look like. I use
mxTextTools for fast HTML file processing. It uses a small
Turing machine with some extra magic and is progammable via
Python tuples.
 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8':            8-bit variable length encoding
> >   'utf-16':           16-bit variable length encoding (litte/big endian)
> >   'utf-16-le':                utf-16 but explicitly little endian
> >   'utf-16-be':                utf-16 but explicitly big endian
> >   'ascii':            7-bit ASCII codepage
> >   'latin-1':          Latin-1 codepage
> >   'html-entities':    Latin-1 + HTML entities;
> >                       see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> >                       Japanese character encoding
> >   'unicode-escape':   See Unicode Constructors for a definition
> >   'native':           Dump of the Internal Format used by Python
> >
> > Perhaps not even 'html-entities' (even though it would make
> > a cool replacement for cgi.escape()) and maybe we should
> > also place the JIS encoding into a separate Unicode package.
> 
> I'd drop html-entities, it seems too cutesie.  (And who uses these
> anyway, outside browsers?)

Ok.
 
> For JIS (shift-JIS?) I hope that Andy can help us with some pointers
> and validation.
> 
> And unicode-escape: now that you mention it, this is a section of
> the proposal that I don't understand.  I quote it here:
> 
> | Python should provide a built-in constructor for Unicode strings which
> | is available through __builtins__:
> |
> |   u = unicode([,=])
>                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I meant this as optional second argument defaulting to
whatever we define  to mean, e.g. 'utf-8'.

u = unicode("string","utf-8") == unicode("string")

The  argument must be a string identifying one
of the registered codecs.
 
> | With the 'unicode-escape' encoding being defined as:
> |
> |   u = u''
> |
> | ˇ for single characters (and this includes all \XXX sequences except \uXXXX),
> |   take the ordinal and interpret it as Unicode ordinal;
> |
> | ˇ for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX
> |   instead, e.g. \u03C0 to represent the character Pi.
> 
> I've looked at this several times and I don't see the difference
> between the two bullets.  (Ironically, you are using a non-ASCII
> character here that doesn't always display, depending on where I look
> at your mail :-).

The first bullet covers the normal Python string characters
and escapes, e.g. \n and \267 (the center dot ;-), while the
second explains how \uXXXX is interpreted.
 
> Can you give some examples?
> 
> Is u'\u0020' different from u'\x20' (a space)?

No, they both map to the same Unicode ordinal.

> Does '\u0020' (no u prefix) have a meaning?

No, \uXXXX is only defined for u"" strings or strings that are
used to build Unicode objects with this encoding:

u = u'\u0020' == unicode(r'\u0020','unicode-escape')

Note that writing \uXX is an error, e.g. u"\u12 " will cause
cause a syntax error.
 
Aside: I just noticed that '\x2010' doesn't give '\x20' + '10'
but instead '\x10' -- is this intended ?

> Also, I remember reading Tim Peters who suggested that a "raw unicode"
> notation (ur"...") might be necessary, to encode regular expressions.
> I tend to agree.

This can be had via unicode():

u = unicode(r'\a\b\c\u0020','unicode-escaped')

If that's too long, define a ur() function which wraps up the
above line in a function.

> While I'm on the topic, I don't see in your proposal a description of
> the source file character encoding.  Currently, this is undefined, and
> in fact can be (ab)used to enter non-ASCII in string literals.  For
> example, a programmer named François might write a file containing
> this statement:
> 
>   print "Written by François." # (There's a cedilla in there!)
> 
> (He assumes his source character encoding is Latin-1, and he doesn't
> want to have to type \347 when he can type a cedilla on his keyboard.)
> 
> If his source file (or .pyc file!)  is executed by a Japanese user,
> this will probably print some garbage.
> 
> Using the new Unicode strings, François could change his program as
> follows:
> 
>   print unicode("Written by François.", "latin-1")
> 
> Assuming that François sets his sys.stdout to use Latin-1, while the
> Japanese user sets his to shift-JIS (or whatever his kanjiterm uses).
> 
> But when the Japanese user views François' source file, he will again
> see garbage.  If he uses a generic tool to translate latin-1 files to
> shift-JIS (assuming shift-JIS has a cedilla character) the program
> will no longer work correctly -- the string "latin-1" has to be
> changed to "shift-jis".
> 
> What should we do about this?  The safest and most radical solution is
> to disallow non-ASCII source characters; François will then have to
> type
> 
>   print u"Written by Fran\u00E7ois."
> 
> but, knowing François, he probably won't like this solution very much
> (since he didn't like the \347 version either).

I think best is to leave it undefined... as with all files,
only the programmer knows what format and encoding it contains,
e.g. a Japanese programmer might want to use a shift-JIS editor
to enter strings directly in shift-JIS via

u = unicode("...shift-JIS encoded text...","shift-jis")

Of course, this is not readable using an ASCII editor, but
Python will continue to produce the intended string.
NLS strings don't belong into program text anyway: i10n usually
takes the gettext() approach to handle these issues.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From andy@robanal.demon.co.uk  Tue Nov 16 00:09:28 1999
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Tue, 16 Nov 1999 00:09:28 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38308F2E.44B9C6BF@lemburg.com>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>   <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com>
Message-ID: <3839a078.22625844@post.demon.co.uk>

On Mon, 15 Nov 1999 23:54:38 +0100, you wrote:

>[I'll get back on this tomorrow, just some quick notes here...]
>The Codecs provide implementations for encoding and decoding,
>they are not intended as complete wrappers for e.g. files or
>sockets.
>
>The unicodec module will define a generic stream wrapper
>(which is yet to be defined) for dealing with files, sockets,
>etc. It will use the codec registry to do the actual codec
>work.
> 
>XXX unicodec.file(,,) could be provided as
>    short-hand for unicodec.file(open(,),) which
>    also assures that  contains the 'b' character when needed.
>
>The Codec interface defines two pairs of methods
>on purpose: one which works internally (ie. directly between
>strings and Unicode objects), and one which works externally
>(directly between a stream and Unicode objects).

That's the problem Guido and I are worried about.  Your present API is
not enough to build stream encoders.  The 'slurp it into a unicode
string in one go' approach fails for big files or for network
connections.  And you just cannot build a generic stream reader/writer
by slicing it into strings.   The solution must be specific to the
codec - only it knows how much to buffer, when to flip states etc.  

So the codec should provide proper stream reading and writing
services.  

Unicodec can then wrap those up in labour-saving ways - I'm not fussy
which but I like the one-line file-open utility.


- Andy







From tim_one@email.msn.com  Tue Nov 16 05:38:32 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:38:32 -0500
Subject: [Python-Dev] Unicode proposal: %-formatting ?
In-Reply-To: <382AE7D9.147D58CB@lemburg.com>
Message-ID: <000001bf2ff4$d36e2540$042d153f@tim>

[MAL]
> I wonder how we could add %-formatting to Unicode strings without
> duplicating the PyString_Format() logic.
>
> First, do we need Unicode object %-formatting at all ?

Sure -- in the end, all the world speaks Unicode natively and encodings
become historical baggage.  Granted I won't live that long, but I may last
long enough to see encodings become almost purely an I/O hassle, with all
computation done in Unicode.

> Second, here is an emulation using strings and 
> that should give an idea of one could work with the different
> encodings:
>
>     s = '%s %i abcäöü' # a Latin-1 encoded string
>     t = (u,3)

What's u?  A Unicode object?  Another Latin-1 string?  A default-encoded
string?  How does the following know the difference?

>     # Convert Latin-1 s to a  string via Unicode
>     s1 = unicode(s,'latin-1').encode()
>
>     # The '%s' will now add u in 
>     s2 = s1 % t
>
>     # Finally, convert the  encoded string to Unicode
>     u1 = unicode(s2)

I don't expect this actually works:  for example, change %s to %4s.
Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to
know that some (or all) characters in u consume multiple bytes, so can't
extract "the right" number of bytes from u.  I think % formating has to know
the truth of what you're doing.

> Note that .encode() defaults to the current setting of
> .
>
> Provided u maps to Latin-1, an alternative would be:
>
>     u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1')

More interesting is fmt % tuple where everything is Unicode; people can muck
with Latin-1 directly today using regular strings, so the example above
mostly shows artificial convolution.




From tim_one@email.msn.com  Tue Nov 16 05:38:40 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:38:40 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <382BDD81.458D3125@lemburg.com>
Message-ID: <000101bf2ff4$d636bb20$042d153f@tim>

[MAL, on raw Unicode strings]
> ...
> Agreed... note that you could also write your own codec for just this
> reason and then use:
>
> u = unicode('....\u1234...\...\...','raw-unicode-escaped')
>
> Put that into a function called 'ur' and you have:
>
> u = ur('...\u4545...\...\...')
>
> which is not that far away from ur'...' w/r to cosmetics.

Well, not quite.  In general you need to pass raw strings:

u = unicode(r'....\u1234...\...\...','raw-unicode-escaped')
            ^
u = ur(r'...\u4545...\...\...')
       ^

else Python will replace all the other backslash sequences.  This is a
crucial distinction at times; e.g., else \b in a Unicode regexp will expand
into a backspace character before the regexp processor ever sees it (\b is
supposed to be a word boundary assertion).




From tim_one@email.msn.com  Tue Nov 16 05:44:42 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:44:42 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000201bf2ff5$ae6aefc0$042d153f@tim>

[Tim, wonders why Perl and Tcl went w/ UTF-8 internally]

[Greg Stein]
> Probably for the exact reason that you stated in your messages: many
> 8-bit (7-bit?) functions continue to work quite well when given a
> UTF-8-encoded string. i.e. they didn't have to rewrite the entire
> Perl/TCL interpreter to deal with a new string type.
>
> I'd guess it is a helluva lot easier for us to add a Python Type than
> for Perl or TCL to whack around with new string types (since they use
> strings so heavily).

Sounds convincing to me!  Bumped into an old thread on c.l.p.m. that
suggested Perl was also worried about UCS-2's 64K code point limit.  But I'm
already on record as predicting we'll regret any decision .




From tim_one@email.msn.com  Tue Nov 16 05:52:12 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:52:12 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000501bf2ff6$ba943a80$042d153f@tim>

[Da Silva, Mike]
> ...
> 5.	UTF-16 requires string operations that do not make assumptions
> about nulls - this means re-implementing most of the C runtime
> functions to work with unsigned shorts.

Python strings are already null-friendly, so Python has already recoded
everything it needs to get away from the no-null assumption; stropmodule.c
is < 1,500 lines of code, and MAL can turn it into C++ template functions in
his sleep .




From tim_one@email.msn.com  Tue Nov 16 05:56:18 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:56:18 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <19991112121303.27452.rocketmail@ web605.yahoomail.com>
Message-ID: <000601bf2ff7$4d8a4c80$042d153f@tim>

[Andy Robinson]
> ...
> I presume no one is actually advocating dropping
> ordinary Python strings, or the ability to do
>    rawdata = open('myfile.txt', 'rb').read()
> without any transformations?

If anyone has advocated either, they've successfully hidden it from me.
Anyone?




From tim_one@email.msn.com  Tue Nov 16 06:09:04 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:09:04 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382BF6C3.D79840EC@lemburg.com>
Message-ID: <000701bf2ff9$15cecda0$042d153f@tim>

[MAL]
> BTW, wouldn't it be possible to take pcre and have it
> use Py_Unicode instead of char ? [Of course, there would have to
> be some extensions for character classes etc.]

No, alas.  The assumption that characters are 8 bits is ubiquitous, in both
obvious and subtle ways.

if ((start_bits[c/8] & (1 << (c&7))) == 0) start_match++; else break;




From tim_one@email.msn.com  Tue Nov 16 06:19:16 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:19:16 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C3749.198EEBC6@lemburg.com>
Message-ID: <000801bf2ffa$82273400$042d153f@tim>

[MAL]
> sys.bom should return the byte order mark (BOM) for the format used
> internally. The unicodec module should provide symbols for all
> possible values of this variable:
>
>   BOM_BE: '\376\377' 
>     (corresponds to Unicode 0x0000FEFF in UTF-16 
>      == ZERO WIDTH NO-BREAK SPACE)
>
>   BOM_LE: '\377\376' 
>     (corresponds to Unicode 0x0000FFFE in UTF-16 
>      == illegal Unicode character)
>
>   BOM4_BE: '\000\000\377\376'
>     (corresponds to Unicode 0x0000FEFF in UCS-4)

Should be
    BOM4_BE: '\000\000\376\377'   
 
>   BOM4_LE: '\376\377\000\000'
>     (corresponds to Unicode 0x0000FFFE in UCS-4)

Should be
    BOM4_LE: '\377\376\000\000'




From tim_one@email.msn.com  Tue Nov 16 06:31:39 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:31:39 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
Message-ID: <000901bf2ffc$3d4bb8e0$042d153f@tim>

[Fred L. Drake, Jr.]
> ...
>   I wasn't suggesting the PyStringObject be changed, only that the
> PyUnicodeObject could maintain a reference.  Consider:
>
>         s = fp.read()
>         u = unicode(s, 'utf-8')
>
> u would now hold a reference to s, and s/s# would return a pointer
> into s instead of re-building the UTF-8 form.  I talked myself out of
> this because it would be too easy to keep a lot more string objects
> around than were actually needed.

Yet another use for a weak reference <0.5 wink>.




From tim_one@email.msn.com  Tue Nov 16 06:41:44 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:41:44 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000b01bf2ffd$a5ad69a0$042d153f@tim>

[MAL]
>   BOM_BE: '\376\377'
>     (corresponds to Unicode 0x0000FEFF in UTF-16
>      == ZERO WIDTH NO-BREAK SPACE)

[Greg Stein]
> Are you sure about that interpretation? I thought the BOM characters
> (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.

I can't speak to MAL's degree of certainty , but he's right about this
stuff.  There is only one BOM character, U+FEFF, which is the zero-width
no-break space.  The byte-swapped form is not only reserved, it's guaranteed
never to be assigned to a character.




From tim_one@email.msn.com  Tue Nov 16 07:47:06 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 02:47:06 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <000d01bf3006$c7823700$042d153f@tim>

[Guido]
> ...
> While I'm on the topic, I don't see in your proposal a description of
> the source file character encoding.  Currently, this is undefined, and
> in fact can be (ab)used to enter non-ASCII in string literals.
> ...
> What should we do about this?  The safest and most radical solution is
> to disallow non-ASCII source characters; François will then have to
> type
>
>   print u"Written by Fran\u00E7ois."
>
> but, knowing François, he probably won't like this solution very much
> (since he didn't like the \347 version either).

So long as Python opens source files using libc text mode, it can't
guarantee more than C does:  the presence of any character other than tab,
newline, and ASCII 32-126 inclusive renders the file contents undefined.

Go beyond that, and you've got the same problem as mailers and browsers, and
so also the same solution:  open source files in binary mode, and add a
pragma specifying the intended charset.

As a practical matter, declare that Python source is Latin-1 for now, and
declare any *system* that doesn't support that non-conforming .

python-is-the-measure-of-all-things-ly y'rs  - tim




From tim_one@email.msn.com  Tue Nov 16 07:47:08 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 02:47:08 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38308F2E.44B9C6BF@lemburg.com>
Message-ID: <000e01bf3006$c8c11fa0$042d153f@tim>

[Guido]
>> Does '\u0020' (no u prefix) have a meaning?

[MAL]
> No, \uXXXX is only defined for u"" strings or strings that are
> used to build Unicode objects with this encoding:

I believe your intent is that '\u0020' be exactly those 6 characters, just
as today.  That is, it does have a meaning, but its meaning differs between
Unicode string literals and regular string literals.

> Note that writing \uXX is an error, e.g. u"\u12 " will cause
> cause a syntax error.

Although I believe your intent  is that, just as today, '\u12' is not
an error.

> Aside: I just noticed that '\x2010' doesn't give '\x20' + '10'
> but instead '\x10' -- is this intended ?

Yes; see 2.4.1 ("String literals") of the Lang Ref.  Blame the C committee
for not defining \x in a platform-independent way.  Note that a Python \x
escape consumes *all* following hex characters, no matter how many -- and
ignores all but the last two.

> This [raw Unicode strings] can be had via unicode():
>
> u = unicode(r'\a\b\c\u0020','unicode-escaped')
>
> If that's too long, define a ur() function which wraps up the
> above line in a function.

As before, I think that's fine for now, but won't stand forever.




From fredrik@pythonware.com  Tue Nov 16 08:39:20 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 09:39:20 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>             <38305D17.60EC94D0@lemburg.com>  <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <010001bf300e$14741310$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> I had thought something more like what Java does: an output stream
> codec's constructor takes a writable file object and the object
> returned by the constructor has a write() method, a flush() method and
> a close() method.  It acts like a buffering interface to the
> underlying file; this allows it to generate the minimal number of
> shift sequeuces.  Similar for input stream codecs.

note that the html/sgml/xml parsers generally
support the feed/close protocol.  to be able
to use these codecs in that context, we need

1) codes written according to the "data
   consumer model", instead of the "stream"
   model.

        class myDecoder:
            def __init__(self, target):
                self.target = target
                self.state = ...
            def feed(self, data):
                ... extract as much data as possible ...
                self.target.feed(extracted data)
            def close(self):
                ... extract what's left ...
                self.target.feed(additional data)
                self.target.close()

or

2) make threads mandatory, just like in Java.

or

3) add light-weight threads (ala stackless python)
   to the interpreter...

(I vote for alternative 3, but that's another story ;-)





From fredrik@pythonware.com  Tue Nov 16 08:58:50 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 09:58:50 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf2ff4$d636bb20$042d153f@tim>
Message-ID: <016a01bf3010$cde52620$f29b12c2@secret.pythonware.com>

Tim Peters  wrote:
> (\b is supposed to be a word boundary assertion).

in some places, that is.



    Main Entry: regˇuˇlar
    Pronunciation: 're-gy&-l&r, 're-g(&-)l&r

    1 : belonging to a religious order
    2 a : formed, built, arranged, or ordered according
    to some established rule, law, principle, or type ...
    3 a : ORDERLY, METHODICAL  ...
    4 a : constituted, conducted, or done in conformity
    with established or prescribed usages, rules, or
    discipline ...



From jack@oratrix.nl  Tue Nov 16 11:05:55 1999
From: jack@oratrix.nl (Jack Jansen)
Date: Tue, 16 Nov 1999 12:05:55 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Message by "M.-A. Lemburg"  ,
 Mon, 15 Nov 1999 20:20:55 +0100 , <38305D17.60EC94D0@lemburg.com>
Message-ID: <19991116110555.8B43335BB1E@snelboot.oratrix.nl>

> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
> 
>   'utf-8':		8-bit variable length encoding
>   'utf-16':		16-bit variable length encoding (litte/big endian)
>   'utf-16-le':		utf-16 but explicitly little endian
>   'utf-16-be':		utf-16 but explicitly big endian
>   'ascii':		7-bit ASCII codepage
>   'latin-1':		Latin-1 codepage
>   'html-entities':	Latin-1 + HTML entities;
> 			see htmlentitydefs.py from the standard Pythin Lib
>   'jis' (a popular version XXX):
> 			Japanese character encoding
>   'unicode-escape':	See Unicode Constructors for a definition
>   'native':		Dump of the Internal Format used by Python

I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets 
(their equivalents of latin-1) too, as documents in these encoding are pretty 
ubiquitous. But maybe these should only be added on the respective platforms.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 




From mal@lemburg.com  Tue Nov 16 08:35:28 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 09:35:28 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <000e01bf3006$c8c11fa0$042d153f@tim>
Message-ID: <38311750.22D17EC1@lemburg.com>

Tim Peters wrote:
> 
> [Guido]
> >> Does '\u0020' (no u prefix) have a meaning?
> 
> [MAL]
> > No, \uXXXX is only defined for u"" strings or strings that are
> > used to build Unicode objects with this encoding:
> 
> I believe your intent is that '\u0020' be exactly those 6 characters, just
> as today.  That is, it does have a meaning, but its meaning differs between
> Unicode string literals and regular string literals.

Right.
 
> > Note that writing \uXX is an error, e.g. u"\u12 " will cause
> > cause a syntax error.
> 
> Although I believe your intent  is that, just as today, '\u12' is not
> an error.

Right again :-) "\u12" gives a 4 byte string, u"\u12" produces an
exception.
 
> > Aside: I just noticed that '\x2010' doesn't give '\x20' + '10'
> > but instead '\x10' -- is this intended ?
> 
> Yes; see 2.4.1 ("String literals") of the Lang Ref.  Blame the C committee
> for not defining \x in a platform-independent way.  Note that a Python \x
> escape consumes *all* following hex characters, no matter how many -- and
> ignores all but the last two.

Strange definition...
 
> > This [raw Unicode strings] can be had via unicode():
> >
> > u = unicode(r'\a\b\c\u0020','unicode-escaped')
> >
> > If that's too long, define a ur() function which wraps up the
> > above line in a function.
> 
> As before, I think that's fine for now, but won't stand forever.

If Guido agrees to ur"", I can put that into the proposal too
-- it's just that things are starting to get a little crowded
for a strawman proposal ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue Nov 16 10:50:31 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:50:31 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk>
Message-ID: <383136F7.AB73A90@lemburg.com>

Andy Robinson wrote:
> 
> Leave JISXXX and the CJK stuff out.  If you get into Japanese, you
> really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there
> are lots of options about how to do it.  The other ones are
> algorithmic and can be small and fast and fit into the core.
> 
> Ditto with HTML, and maybe even escaped-unicode too.

So I can drop JIS ? [I won't be able to drop the escaped unicode
codec because this is needed for u"" and ur"".]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue Nov 16 10:42:19 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:42:19 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf2ff4$d636bb20$042d153f@tim>
Message-ID: <3831350B.8F69CB6D@lemburg.com>

Tim Peters wrote:
> 
> [MAL, on raw Unicode strings]
> > ...
> > Agreed... note that you could also write your own codec for just this
> > reason and then use:
> >
> > u = unicode('....\u1234...\...\...','raw-unicode-escaped')
> >
> > Put that into a function called 'ur' and you have:
> >
> > u = ur('...\u4545...\...\...')
> >
> > which is not that far away from ur'...' w/r to cosmetics.
> 
> Well, not quite.  In general you need to pass raw strings:
> 
> u = unicode(r'....\u1234...\...\...','raw-unicode-escaped')
>             ^
> u = ur(r'...\u4545...\...\...')
>        ^
> 
> else Python will replace all the other backslash sequences.  This is a
> crucial distinction at times; e.g., else \b in a Unicode regexp will expand
> into a backspace character before the regexp processor ever sees it (\b is
> supposed to be a word boundary assertion).

Right.

Here is a sample implementation of what I had in mind:

""" Demo for 'unicode-escape' encoding.
"""
import struct,string,re

pack_format = '>H'

def convert_string(s):

    l = map(None,s)
    for i in range(len(l)):
	l[i] = struct.pack(pack_format,ord(l[i]))
    return l

u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})')

def unicode_unescape(s):

    l = []
    start = 0
    while start < len(s):
	m = u_escape.search(s,start)
	if not m:
	    l[len(l):] = convert_string(s[start:])
	    break
	m_start,m_end = m.span()
	if m_start > start:
	    l[len(l):] = convert_string(s[start:m_start])
	hexcode = m.group(1)
	#print hexcode,start,m_start
	if len(hexcode) != 4:
	    raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode
	ordinal = string.atoi(hexcode,16)
	l.append(struct.pack(pack_format,ordinal))
	start = m_end
    #print l
    return string.join(l,'')
    
def hexstr(s,sep=''):

    return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue Nov 16 10:40:42 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:40:42 +0100
Subject: [Python-Dev] Unicode proposal: %-formatting ?
References: <000001bf2ff4$d36e2540$042d153f@tim>
Message-ID: <383134AA.4B49D178@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > I wonder how we could add %-formatting to Unicode strings without
> > duplicating the PyString_Format() logic.
> >
> > First, do we need Unicode object %-formatting at all ?
> 
> Sure -- in the end, all the world speaks Unicode natively and encodings
> become historical baggage.  Granted I won't live that long, but I may last
> long enough to see encodings become almost purely an I/O hassle, with all
> computation done in Unicode.
> 
> > Second, here is an emulation using strings and 
> > that should give an idea of one could work with the different
> > encodings:
> >
> >     s = '%s %i abcäöü' # a Latin-1 encoded string
> >     t = (u,3)
> 
> What's u?  A Unicode object?  Another Latin-1 string?  A default-encoded
> string?  How does the following know the difference?

u refers to a Unicode object in the proposal. Sorry, forgot to
mention that.
 
> >     # Convert Latin-1 s to a  string via Unicode
> >     s1 = unicode(s,'latin-1').encode()
> >
> >     # The '%s' will now add u in 
> >     s2 = s1 % t
> >
> >     # Finally, convert the  encoded string to Unicode
> >     u1 = unicode(s2)
> 
> I don't expect this actually works:  for example, change %s to %4s.
> Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to
> know that some (or all) characters in u consume multiple bytes, so can't
> extract "the right" number of bytes from u.  I think % formating has to know
> the truth of what you're doing.

Hmm, guess you're right... format parameters should indeed refer
to characters rather than number of encoding bytes.

This means a new PyUnicode_Format() implementation mapping
Unicode format objects to Unicode objects.
 
> > Note that .encode() defaults to the current setting of
> > .
> >
> > Provided u maps to Latin-1, an alternative would be:
> >
> >     u1 = unicode('%s %i abcäöü' % (u.encode('latin-1'),3), 'latin-1')
> 
> More interesting is fmt % tuple where everything is Unicode; people can muck
> with Latin-1 directly today using regular strings, so the example above
> mostly shows artificial convolution.

... hmm, there is a problem there: how should the PyUnicode_Format()
API deal with '%s' when it sees a Unicode object as argument ?

E.g. what would you get in these cases:

u = u"%s %s" % (u"abc", "abc")

Perhaps we need a new marker for "insert Unicode object here".

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue Nov 16 10:48:13 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:48:13 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>   <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com> <3839a078.22625844@post.demon.co.uk>
Message-ID: <3831366D.8A09E194@lemburg.com>

Andy Robinson wrote:
> 
> On Mon, 15 Nov 1999 23:54:38 +0100, you wrote:
> 
> >[I'll get back on this tomorrow, just some quick notes here...]
> >The Codecs provide implementations for encoding and decoding,
> >they are not intended as complete wrappers for e.g. files or
> >sockets.
> >
> >The unicodec module will define a generic stream wrapper
> >(which is yet to be defined) for dealing with files, sockets,
> >etc. It will use the codec registry to do the actual codec
> >work.
> >
> >XXX unicodec.file(,,) could be provided as
> >    short-hand for unicodec.file(open(,),) which
> >    also assures that  contains the 'b' character when needed.
> >
> >The Codec interface defines two pairs of methods
> >on purpose: one which works internally (ie. directly between
> >strings and Unicode objects), and one which works externally
> >(directly between a stream and Unicode objects).
> 
> That's the problem Guido and I are worried about.  Your present API is
> not enough to build stream encoders.  The 'slurp it into a unicode
> string in one go' approach fails for big files or for network
> connections.  And you just cannot build a generic stream reader/writer
> by slicing it into strings.   The solution must be specific to the
> codec - only it knows how much to buffer, when to flip states etc.
> 
> So the codec should provide proper stream reading and writing
> services.

I guess I'll have to rethink the Codec specs. Some leads:

1. introduce a new StreamCodec class which is designed for
   handling stream encoding and decoding (and supports
   state)

2. give more information to the unicodec registry: 
   one could register classes instead of instances which the Unicode
   imlementation would then instantiate whenever it needs to
   apply the conversion; since this is only needed for encodings
   maintaining state, the registery would only have to do the
   instantiation for these codecs and could use cached instances for
   stateless codecs.
 
> Unicodec can then wrap those up in labour-saving ways - I'm not fussy
> which but I like the one-line file-open utility.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fredrik@pythonware.com  Tue Nov 16 11:38:31 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 12:38:31 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>
Message-ID: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com>

> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
> 
>   'utf-8': 8-bit variable length encoding
>   'utf-16': 16-bit variable length encoding (litte/big endian)
>   'utf-16-le': utf-16 but explicitly little endian
>   'utf-16-be': utf-16 but explicitly big endian
>   'ascii': 7-bit ASCII codepage
>   'latin-1': Latin-1 codepage
>   'html-entities': Latin-1 + HTML entities;
> see htmlentitydefs.py from the standard Pythin Lib
>   'jis' (a popular version XXX):
> Japanese character encoding
>   'unicode-escape': See Unicode Constructors for a definition
>   'native': Dump of the Internal Format used by Python

since this is already very close, maybe we could adopt
the naming guidelines from XML:

    In an encoding declaration, the values "UTF-8", "UTF-16",
    "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
    for the various encodings and transformations of
    Unicode/ISO/IEC 10646, the values "ISO-8859-1",
    "ISO-8859-2", ... "ISO-8859-9" should be used for the parts
    of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS",
    and "EUC-JP" should be used for the various encoded
    forms of JIS X-0208-1997.

    XML processors may recognize other encodings; it is
    recommended that character encodings registered
    (as charsets) with the Internet Assigned Numbers
    Authority [IANA], other than those just listed,
    should be referred to using their registered names.

    Note that these registered names are defined to be
    case-insensitive, so processors wishing to match
    against them should do so in a case-insensitive way.

(ie "iso-8859-1" instead of "latin-1", etc -- at least as
aliases...).





From gstein@lyra.org  Tue Nov 16 11:45:48 1999
From: gstein@lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 03:45:48 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com>
Message-ID: 

On Tue, 16 Nov 1999, Fredrik Lundh wrote:
>...
> since this is already very close, maybe we could adopt
> the naming guidelines from XML:
> 
>     In an encoding declaration, the values "UTF-8", "UTF-16",
>     "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
>     for the various encodings and transformations of
>     Unicode/ISO/IEC 10646, the values "ISO-8859-1",
>     "ISO-8859-2", ... "ISO-8859-9" should be used for the parts
>     of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS",
>     and "EUC-JP" should be used for the various encoded
>     forms of JIS X-0208-1997.
> 
>     XML processors may recognize other encodings; it is
>     recommended that character encodings registered
>     (as charsets) with the Internet Assigned Numbers
>     Authority [IANA], other than those just listed,
>     should be referred to using their registered names.
> 
>     Note that these registered names are defined to be
>     case-insensitive, so processors wishing to match
>     against them should do so in a case-insensitive way.
> 
> (ie "iso-8859-1" instead of "latin-1", etc -- at least as
> aliases...).

+1

(as we'd say in Apache-land... :-)

-g

--
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Tue Nov 16 12:04:47 1999
From: gstein@lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 04:04:47 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <3830595B.348E8CC7@lemburg.com>
Message-ID: 

On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> Guido van Rossum wrote:
>...
> > t# refers to byte-encoded data.  Multibyte encodings are explicitly
> > designed to be passed cleanly through processing steps that handle
> > single-byte character data, as long as they are 8-bit clean and don't
> > do too much processing.
> 
> Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> "8-bit clean" as you obviously did.

Hrm. That might be dangerous. Many of the functions that use "t#" assume
that each character is 8-bits long. i.e. the returned length == the number
of characters.

I'm not sure what the implications would be if you interpret the semantics
of "t#" as multi-byte characters.

>...
> > For example, take an encryption engine.  While it is defined in terms
> > of byte streams, there's no requirement that the bytes represent
> > characters -- they could be the bytes of a GIF file, an MP3 file, or a
> > gzipped tar file.  If we pass Unicode to an encryption engine, we want
> > Unicode to come out at the other end, not UTF-8.  (If we had wanted to
> > encrypt UTF-8, we should have fed it UTF-8.)

Heck. I just want to quickly throw the data onto my disk. I'll write a
BOM, following by the raw data. Done. It's even portable.

>...
> > Aha, I think there's a confusion about what "8-bit" means.  For me, a
> > multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?

Maybe. I don't see multi-byte characters as 8-bit (in the sense of the "t"
format).

> > (As far as I know, C uses char* to represent multibyte characters.)
> > Maybe we should disambiguate it more explicitly?

We can disambiguate with a new format character, or we can clarify the
semantics of "t" to mean single- *or* multi- byte characters. Again, I
think there may be trouble if the semantics of "t" are defined to allow
multibyte characters.

> There should be some definition for the two markers and the
> ideas behind them in the API guide, I guess.

Certainly.

[ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]

> > > Hmm, I would strongly object to making "s#" return the internal
> > > format. file.write() would then default to writing UTF-16 data
> > > instead of UTF-8 data. This could result in strange errors
> > > due to the UTF-16 format being endian dependent.
> > 
> > But this was the whole design.  file.write() needs to be changed to
> > use s# when the file is open in binary mode and t# when the file is
> > open in text mode.

Interesting idea, but that presumes that "t" will be defined for the
Unicode
object (i.e. it implements the getcharbuffer type slot). Because of the
multi-byte problem, I don't think it will.
[ not to mention, that I don't think the Unicode object should implicitly
  do a UTF-8 conversion and hold a ref to the resulting string ]

>...
> I still don't feel very comfortable about the fact that all
> existing APIs using "s#" will suddenly receive UTF-16 data if
> being passed Unicode objects: this probably won't get us the
> "magical" Unicode integration we invision, since "t#" usage is not
> very wide spread and character handling code will probably not
> work well with UTF-16 encoded strings.

I'm not sure that we should definitely go for "magical." Perl has magic in
it, and that is one of its worst faults. Go for clean and predictable, and
leave as much logic to the Python level as possible. The interpreter
should provide a minimum of functionality, rather than second-guessing and
trying to be neat and sneaky with its operation.

>...
> > Because file.write() for a binary file, and other similar things
> > (e.g. the encryption engine example I mentioned above) must have
> > *some* way to get at the raw bits.
> 
> What for ?

How about: "because I'm the application developer, and I say that I want
the raw bytes in the file."

> Any lossless encoding should do the trick... UTF-8
> is just as good as UTF-16 for binary files; plus it's more compact
> for ASCII data. I don't really see a need to get explicitly
> at the internal data representation because both encodings are
> in fact "internal" w/r to Unicode objects.
> 
> The only argument I can come up with is that using UTF-16 for
> binary files could (possibly) eliminate the UTF-8 conversion step
> which is otherwise always needed.

The argument that I come up with is "don't tell me how to design my
storage format, and don't make Python force me into one."

If I want to write Unicode text to a file, the most natural thing to do
is:

open('file', 'w').write(u)

If you do a conversion on me, then I'm not writing Unicode. I've got to go
and do some nasty conversion which just monkeys up my program.

If I have a Unicode object, but I *want* to write UTF-8 to the file, then
the cleanest thing is:

open('file', 'w').write(encode(u, 'utf-8'))

This is clear that I've got a Unicode object input, but I'm writing UTF-8.

I have a second argument, too: See my first argument. :-)

Really... this is kind of what Fredrik was trying to say: don't get in the
way of the application programmer. Give them tools, but avoid policy and
gimmicks and other "magic".

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Tue Nov 16 12:09:17 1999
From: gstein@lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 04:09:17 -0800 (PST)
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: 

On Mon, 15 Nov 1999, Guido van Rossum wrote:
>...
> > The problem with these large tables is that currently
> > Python modules are not shared among processes since
> > every process builds its own table.
> > 
> > Static C data has the advantage of being shareable at
> > the OS level.
> 
> Don't worry about it.  128K is too small to care, I think...

This is the reason Python starts up so slow and has a large memory
footprint. There hasn't been any concern for moving stuff into shared data
pages. As a result, a process must map in a bunch of vmem pages, for no
other reason than to allocate Python structures in that memory and copy
constants in.

Go start Perl 100 times, then do the same with Python. Python is
significantly slower. I've actually written a web app in PHP because
another one that I did in Python had slow response time.
[ yah: the Real Man Answer is to write a real/good mod_python. ]

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From andy@robanal.demon.co.uk  Tue Nov 16 12:18:19 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 16 Nov 1999 04:18:19 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
Message-ID: <19991116121819.21509.rocketmail@web606.mail.yahoo.com>


--- "M.-A. Lemburg"  wrote:
> So I can drop JIS ? [I won't be able to drop the
> escaped unicode
> codec because this is needed for u"" and ur"".]

Drop Japanese from the core language.  

JIS0208 is a big character set with three popular
encodings (Shift-JIS, EUC-JP and JIS), and a host of
slight variations; it has 6879 characters, and there
are a range of options a user might need to set for it
to be useful.  So let's assume for now this a separate
package.  There's a good chance I'll do it but it is
not a small job.  If you start statically linking in
tables of 7000 characters for one Asian language,
you'll have to do the lot.

As for the single-byte Latin ones, a prototype Python
module could be whipped up in a couple of evenings,
and a tiny C function which does single-byte to
double-byte mappings and vice versa could make it
fast.  We can have an extensible, data driven solution
in no time without having to build it into the core.

The way I see it, to claim that python has i18n, a
serious effort is needed to ensure every major
encoding in the world is available to Python users.  
But that's separate to the core languages.  Your spec
should only cover what is going to be hard-coded into
Python.  

I'd like to see one paragraph in your spec stating
that our architecture seperates the encodings
themselves from the core language changes, and that
getting them sorted is a logically separate (but
important) project.  Ideally, we could put together a
separate proposal for the encoding library itself and
run it by some world class experts in that field, but
after yours is done.


- Andy

 



=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From guido@CNRI.Reston.VA.US  Tue Nov 16 13:28:42 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 08:28:42 -0500
Subject: [Python-Dev] Unicode proposal: %-formatting ?
In-Reply-To: Your message of "Tue, 16 Nov 1999 11:40:42 +0100."
 <383134AA.4B49D178@lemburg.com>
References: <000001bf2ff4$d36e2540$042d153f@tim>
 <383134AA.4B49D178@lemburg.com>
Message-ID: <199911161328.IAA29042@eric.cnri.reston.va.us>

> ... hmm, there is a problem there: how should the PyUnicode_Format()
> API deal with '%s' when it sees a Unicode object as argument ?
> 
> E.g. what would you get in these cases:
> 
> u = u"%s %s" % (u"abc", "abc")

From the user's perspective, it should clearly return u"abc abc".

> Perhaps we need a new marker for "insert Unicode object here".

No, please!

BTW, we also need to look at the proposal from JPython's perspective
(where all strings are Unicode; I don't know if they are UTF-16 or
UCS-2).  It should be possible to add a small number of dummy things
to JPython so that a CPython program using unicode can be run
unchanged there.  A minimal set seems to be:

- u"..." is treated the same as "..."; and ur"..." (if accepted) is r"..."
- unichr(c) is the same as chr(c)
- unicode(s[,encoding]) is added
- s.encode([encoding]) is added

Anything I forgot?

The default encoding may be tricky; it makes most sense to let the
default encoding be "native" so that unicode(s) and s.encode() can
return s unchanged.  This can occasionally cause programs to fail that
work in CPython, e.g. a program that opens a file in binary mode,
reads a string from it, and converts it to unicode using the default
encoding.  But such programs are on thin ice already (it's always
better to be explicit about encodings).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Tue Nov 16 13:45:17 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 08:45:17 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Tue, 16 Nov 1999 04:04:47 PST."
 
References: 
Message-ID: <199911161345.IAA29064@eric.cnri.reston.va.us>

> > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> > "8-bit clean" as you obviously did.
> 
> Hrm. That might be dangerous. Many of the functions that use "t#" assume
> that each character is 8-bits long. i.e. the returned length == the number
> of characters.
> 
> I'm not sure what the implications would be if you interpret the semantics
> of "t#" as multi-byte characters.

Hrm.  Can you quote examples of users of t# who would be confused by
multibyte characters?  I guess that there are quite a few places where
they will be considered illegal, but that's okay -- the string will be
parsed at some point and rejected, e.g. as an illegal filename,
hostname or whatever.  On the other hand, there are quite some places
where I would think that multibyte characters would do just the right
thing.  Many places using t# could just as well be using 's' except
they need to know the length and they don't want to call strlen().
In all cases I've looked at, the reason they need the length because
they are allocating a buffer (or checking whether it fits in a
statically allocated buffer) -- and there the number of bytes in a
multibyte string is just fine.

Note that I take the same stance on 's' -- it should return multibyte
characters.

> > What for ?
> 
> How about: "because I'm the application developer, and I say that I want
> the raw bytes in the file."

Here I'm with you, man!

> Greg Stein, http://www.lyra.org/

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gward@cnri.reston.va.us  Tue Nov 16 14:10:33 1999
From: gward@cnri.reston.va.us (Greg Ward)
Date: Tue, 16 Nov 1999 09:10:33 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: ; from gstein@lyra.org on Tue, Nov 16, 1999 at 04:09:17AM -0800
References: <199911152137.QAA28280@eric.cnri.reston.va.us> 
Message-ID: <19991116091032.A4063@cnri.reston.va.us>

On 16 November 1999, Greg Stein said:
> This is the reason Python starts up so slow and has a large memory
> footprint. There hasn't been any concern for moving stuff into shared data
> pages. As a result, a process must map in a bunch of vmem pages, for no
> other reason than to allocate Python structures in that memory and copy
> constants in.
> 
> Go start Perl 100 times, then do the same with Python. Python is
> significantly slower. I've actually written a web app in PHP because
> another one that I did in Python had slow response time.
> [ yah: the Real Man Answer is to write a real/good mod_python. ]

I don't think this is the only factor in startup overhead.  Try looking
into the number of system calls for the trivial startup case of each
interpreter:

  $ truss perl -e 1 2> perl.log 
  $ truss python -c 1 2> python.log

(This is on Solaris; I did the same thing on Linux with "strace", and on
IRIX with "par -s -SS".  Dunno about other Unices.)  The results are
interesting, and useful despite the platform and version disparities.

(For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on
Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX.  The Solaris is 2.6,
using the Official CNRI Python Build by Barry, and the ditto Perl build
by me; the Linux system is starship, using whatever Perl and Python the
Starship Masters provide us with; the IRIX box is an elderly but
well-maintained SGI Challenge running IRIX 5.3.)

Also, this is with an empty PYTHONPATH.  The Solaris build of Python has
different prefix and exec_prefix, but on the Linux and IRIX builds, they
are the same.  (I think this will reflect poorly on the Solaris
version.)  PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect
startup of the trivial "1" script, so I haven't paid attention to them.

First, the size of log files (in lines), i.e. number of system calls:

               Solaris     Linux    IRIX[1]
  Perl              88        85      70
  Python           425       316     257

[1] after chopping off the summary counts from the "par" output -- ie.
    these really are the number of system calls, not the number of
    lines in the log files

Next, the number of "open" calls:

               Solaris     Linux    IRIX
  Perl             16         10       9
  Python          107         71      48

(It looks as though *all* of the Perl 'open' calls are due to the
dynamic linker going through /usr/lib and/or /lib.)

And the number of unsuccessful "open" calls:

               Solaris     Linux    IRIX
  Perl              6          1       3
  Python           77         49      32

Number of "mmap" calls:

               Solaris     Linux    IRIX
  Perl              25        25       1
  Python            36        24       1

...nope, guess we can't blame mmap for any Perl/Python startup
disparity.

How about "brk":

               Solaris     Linux    IRIX
  Perl               6        11      12
  Python            47        39      25

...ok, looks like Greg's gripe about memory holds some water.

Rerunning "truss" on Solaris with "python -S -c 1" drastically reduces
the startup overhead as measured by "number of system calls".  Some
quick timing experiments show a drastic speedup (in wall-clock time) by
adding "-S": about 37% faster under Solaris, 56% faster under Linux, and
35% under IRIX.  These figures should be taken with a large grain of
salt, as the Linux and IRIX systems were fairly well loaded at the time,
and the wall-clock results I measured had huge variance.  Still, it gets
the point across.

Oh, also for the record, all timings were done like:

   perl -e 'for $i (1 .. 100) { system "python", "-S", "-c", "1"; }'

because I wanted to guarantee no shell was involved in the Python
startup.

        Greg
-- 
Greg Ward - software developer                    gward@cnri.reston.va.us
Corporation for National Research Initiatives    
1895 Preston White Drive                           voice: +1-703-620-8990
Reston, Virginia, USA  20191-5434                    fax: +1-703-620-0913


From mal@lemburg.com  Tue Nov 16 11:33:07 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 12:33:07 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991116110555.8B43335BB1E@snelboot.oratrix.nl>
Message-ID: <383140F3.EDDB307A@lemburg.com>

Jack Jansen wrote:
> 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8':            8-bit variable length encoding
> >   'utf-16':           16-bit variable length encoding (litte/big endian)
> >   'utf-16-le':                utf-16 but explicitly little endian
> >   'utf-16-be':                utf-16 but explicitly big endian
> >   'ascii':            7-bit ASCII codepage
> >   'latin-1':          Latin-1 codepage
> >   'html-entities':    Latin-1 + HTML entities;
> >                       see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> >                       Japanese character encoding
> >   'unicode-escape':   See Unicode Constructors for a definition
> >   'native':           Dump of the Internal Format used by Python
> 
> I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets
> (their equivalents of latin-1) too, as documents in these encoding are pretty
> ubiquitous. But maybe these should only be added on the respective platforms.

Good idea. What code pages would that be ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue Nov 16 14:13:25 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 15:13:25 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.6
References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com>
Message-ID: <38316685.7977448D@lemburg.com>

FYI, I've uploaded a new version of the proposal which incorporates
many things we have discussed lately, e.g. the buffer interface,
"s#" vs. "t#", etc.

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ˇ Unicode objects support for %-formatting

    ˇ specifying StreamCodecs

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mal@lemburg.com  Tue Nov 16 12:54:51 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 13:54:51 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com>
Message-ID: <3831541B.B242FFA9@lemburg.com>

Fredrik Lundh wrote:
> 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8': 8-bit variable length encoding
> >   'utf-16': 16-bit variable length encoding (litte/big endian)
> >   'utf-16-le': utf-16 but explicitly little endian
> >   'utf-16-be': utf-16 but explicitly big endian
> >   'ascii': 7-bit ASCII codepage
> >   'latin-1': Latin-1 codepage
> >   'html-entities': Latin-1 + HTML entities;
> > see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> > Japanese character encoding
> >   'unicode-escape': See Unicode Constructors for a definition
> >   'native': Dump of the Internal Format used by Python
> 
> since this is already very close, maybe we could adopt
> the naming guidelines from XML:
> 
>     In an encoding declaration, the values "UTF-8", "UTF-16",
>     "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
>     for the various encodings and transformations of
>     Unicode/ISO/IEC 10646, the values "ISO-8859-1",
>     "ISO-8859-2", ... "ISO-8859-9" should be used for the parts
>     of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS",
>     and "EUC-JP" should be used for the various encoded
>     forms of JIS X-0208-1997.
> 
>     XML processors may recognize other encodings; it is
>     recommended that character encodings registered
>     (as charsets) with the Internet Assigned Numbers
>     Authority [IANA], other than those just listed,
>     should be referred to using their registered names.
> 
>     Note that these registered names are defined to be
>     case-insensitive, so processors wishing to match
>     against them should do so in a case-insensitive way.
> 
> (ie "iso-8859-1" instead of "latin-1", etc -- at least as
> aliases...).

>From the proposal:
"""
General Remarks:
----------------

ˇ Unicode encoding names should be lower case on output and
  case-insensitive on input (they will be converted to lower case
  by all APIs taking an encoding name as input).

  Encoding names should follow the name conventions as used by the
  Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
  written as 'utf-16'.
"""

Is there a naming scheme definition for these encoding names?
(The quote you gave above doesn't really sound like a definition
to me.)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue Nov 16 13:15:19 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 14:15:19 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991116121819.21509.rocketmail@web606.mail.yahoo.com>
Message-ID: <383158E7.BC574A1F@lemburg.com>

Andy Robinson wrote:
> 
> --- "M.-A. Lemburg"  wrote:
> > So I can drop JIS ? [I won't be able to drop the
> > escaped unicode
> > codec because this is needed for u"" and ur"".]
> 
> Drop Japanese from the core language.

Done ... that one was easy ;-)
 
> JIS0208 is a big character set with three popular
> encodings (Shift-JIS, EUC-JP and JIS), and a host of
> slight variations; it has 6879 characters, and there
> are a range of options a user might need to set for it
> to be useful.  So let's assume for now this a separate
> package.  There's a good chance I'll do it but it is
> not a small job.  If you start statically linking in
> tables of 7000 characters for one Asian language,
> you'll have to do the lot.
> 
> As for the single-byte Latin ones, a prototype Python
> module could be whipped up in a couple of evenings,
> and a tiny C function which does single-byte to
> double-byte mappings and vice versa could make it
> fast.  We can have an extensible, data driven solution
> in no time without having to build it into the core.

Perhaps these helper function could be intergrated into
the core to avoid compilation when adding a new codec.

> The way I see it, to claim that python has i18n, a
> serious effort is needed to ensure every major
> encoding in the world is available to Python users.
> But that's separate to the core languages.  Your spec
> should only cover what is going to be hard-coded into
> Python.

Right.
 
> I'd like to see one paragraph in your spec stating
> that our architecture seperates the encodings
> themselves from the core language changes, and that
> getting them sorted is a logically separate (but
> important) project.  Ideally, we could put together a
> separate proposal for the encoding library itself and
> run it by some world class experts in that field, but
> after yours is done.

I've added:
All other encoding such as the CJK ones to support Asian scripts
should be implemented in seperate packages which do not get included
in the core Python distribution and are not a part of this proposal.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue Nov 16 13:06:39 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 14:06:39 +0100
Subject: [Python-Dev] just say no...
References: 
Message-ID: <383156DF.2209053F@lemburg.com>

Greg Stein wrote:
> 
> On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> >...
> > > t# refers to byte-encoded data.  Multibyte encodings are explicitly
> > > designed to be passed cleanly through processing steps that handle
> > > single-byte character data, as long as they are 8-bit clean and don't
> > > do too much processing.
> >
> > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> > "8-bit clean" as you obviously did.
> 
> Hrm. That might be dangerous. Many of the functions that use "t#" assume
> that each character is 8-bits long. i.e. the returned length == the number
> of characters.
> 
> I'm not sure what the implications would be if you interpret the semantics
> of "t#" as multi-byte characters.

FYI, the next version of the proposal now says "s#" gives you
UTF-16 and "t#" returns UTF-8. File objects opened in text mode
will use "t#" and binary ones use "s#".

I'll just use explicit u.encode('utf-8') calls if I want to write
UTF-8 to binary files -- perhaps everyone else should too ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From akuchlin@mems-exchange.org  Tue Nov 16 14:35:39 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 Nov 1999 09:35:39 -0500 (EST)
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <19991116091032.A4063@cnri.reston.va.us>
References: <199911152137.QAA28280@eric.cnri.reston.va.us>
 
 <19991116091032.A4063@cnri.reston.va.us>
Message-ID: <14385.27579.292173.433577@amarok.cnri.reston.va.us>

Greg Ward writes:
>Next, the number of "open" calls:
>               Solaris     Linux    IRIX
>  Perl             16         10       9
>  Python          107         71      48

Running 'python -v' explains this:

amarok akuchlin>python -v
# /usr/local/lib/python1.5/exceptions.pyc matches /usr/local/lib/python1.5/exceptions.py
import exceptions # precompiled from /usr/local/lib/python1.5/exceptions.pyc
# /usr/local/lib/python1.5/site.pyc matches /usr/local/lib/python1.5/site.py
import site # precompiled from /usr/local/lib/python1.5/site.pyc
# /usr/local/lib/python1.5/os.pyc matches /usr/local/lib/python1.5/os.py
import os # precompiled from /usr/local/lib/python1.5/os.pyc
import posix # builtin
# /usr/local/lib/python1.5/posixpath.pyc matches /usr/local/lib/python1.5/posixpath.py
import posixpath # precompiled from /usr/local/lib/python1.5/posixpath.pyc
# /usr/local/lib/python1.5/stat.pyc matches /usr/local/lib/python1.5/stat.py
import stat # precompiled from /usr/local/lib/python1.5/stat.pyc
# /usr/local/lib/python1.5/UserDict.pyc matches /usr/local/lib/python1.5/UserDict.py
import UserDict # precompiled from /usr/local/lib/python1.5/UserDict.pyc
Python 1.5.2 (#80, May 25 1999, 18:06:07)  [GCC 2.8.1] on sunos5
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
import readline # dynamically loaded from /usr/local/lib/python1.5/lib-dynload/readline.so

And each import tries several different forms of the module name:

stat("/usr/local/lib/python1.5/os", 0xEFFFD5E0) Err#2 ENOENT
open("/usr/local/lib/python1.5/os.so", O_RDONLY) Err#2 ENOENT
open("/usr/local/lib/python1.5/osmodule.so", O_RDONLY) Err#2 ENOENT
open("/usr/local/lib/python1.5/os.py", O_RDONLY) = 4

I don't see how this is fixable, unless we strip down site.py, which
drags in os, which drags in os.path and stat and UserDict. 

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I'm going stir-crazy, and I've joined the ranks of the walking brain-dead, but
otherwise I'm just peachy.
    -- Lyta Hall on parenthood, in SANDMAN #40: "Parliament of Rooks"



From guido@CNRI.Reston.VA.US  Tue Nov 16 14:43:07 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 09:43:07 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Tue, 16 Nov 1999 14:06:39 +0100."
 <383156DF.2209053F@lemburg.com>
References: 
 <383156DF.2209053F@lemburg.com>
Message-ID: <199911161443.JAA29149@eric.cnri.reston.va.us>

> FYI, the next version of the proposal now says "s#" gives you
> UTF-16 and "t#" returns UTF-8. File objects opened in text mode
> will use "t#" and binary ones use "s#".

Good.

> I'll just use explicit u.encode('utf-8') calls if I want to write
> UTF-8 to binary files -- perhaps everyone else should too ;-)

You could write UTF-8 to files opened in text mode too; at least most
actual systems will leave the UTF-8 escapes alone and just to LF ->
CRLF translation, which should be fine.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake@acm.org  Tue Nov 16 14:50:55 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 09:50:55 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000901bf2ffc$3d4bb8e0$042d153f@tim>
References: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
 <000901bf2ffc$3d4bb8e0$042d153f@tim>
Message-ID: <14385.28495.685427.598748@weyr.cnri.reston.va.us>

Tim Peters writes:
 > Yet another use for a weak reference <0.5 wink>.

  Those just keep popping up!  I seem to recall Diane Hackborne
actually implemented these under the name "vref" long ago; perhaps
that's worth revisiting after all?  (Not the implementation so much as 
the idea.)  I think to make it general would cost one PyObject* in
each object's structure, and some code in some constructors (maybe),
and all destructors, but not much.
  Is this worth pursuing, or is it locked out of the core because of
the added space for the PyObject*?  (Note that the concept isn't
necessarily useful for all object types -- numbers in particular --
but it only makes sense to bother if it works for everything, even if
it's not very useful in some cases.)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From fdrake@acm.org  Tue Nov 16 15:12:43 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 10:12:43 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: 
References: <3830595B.348E8CC7@lemburg.com>
 
Message-ID: <14385.29803.459364.456840@weyr.cnri.reston.va.us>

Greg Stein writes:
 > [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]

  And the sooner I receive them, the sooner they can be integrated!
Any plans to get them to me?  I'll probably want to do another release 
before the IPC8.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From mal@lemburg.com  Tue Nov 16 14:36:54 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 15:36:54 +0100
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us>
Message-ID: <38316C06.8B0E1D7B@lemburg.com>

Greg Ward wrote:
> 
> > Go start Perl 100 times, then do the same with Python. Python is
> > significantly slower. I've actually written a web app in PHP because
> > another one that I did in Python had slow response time.
> > [ yah: the Real Man Answer is to write a real/good mod_python. ]
> 
> I don't think this is the only factor in startup overhead.  Try looking
> into the number of system calls for the trivial startup case of each
> interpreter:
> 
>   $ truss perl -e 1 2> perl.log
>   $ truss python -c 1 2> python.log
> 
> (This is on Solaris; I did the same thing on Linux with "strace", and on
> IRIX with "par -s -SS".  Dunno about other Unices.)  The results are
> interesting, and useful despite the platform and version disparities.
> 
> (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on
> Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX.  The Solaris is 2.6,
> using the Official CNRI Python Build by Barry, and the ditto Perl build
> by me; the Linux system is starship, using whatever Perl and Python the
> Starship Masters provide us with; the IRIX box is an elderly but
> well-maintained SGI Challenge running IRIX 5.3.)
> 
> Also, this is with an empty PYTHONPATH.  The Solaris build of Python has
> different prefix and exec_prefix, but on the Linux and IRIX builds, they
> are the same.  (I think this will reflect poorly on the Solaris
> version.)  PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect
> startup of the trivial "1" script, so I haven't paid attention to them.

For kicks I've done a similar test with cgipython, the 
one file version of Python 1.5.2:
 
> First, the size of log files (in lines), i.e. number of system calls:
> 
>                Solaris     Linux    IRIX[1]
>   Perl              88        85      70
>   Python           425       316     257

    cgipython                  182 
 
> [1] after chopping off the summary counts from the "par" output -- ie.
>     these really are the number of system calls, not the number of
>     lines in the log files
> 
> Next, the number of "open" calls:
> 
>                Solaris     Linux    IRIX
>   Perl             16         10       9
>   Python          107         71      48

    cgipython                   33 

> (It looks as though *all* of the Perl 'open' calls are due to the
> dynamic linker going through /usr/lib and/or /lib.)
> 
> And the number of unsuccessful "open" calls:
> 
>                Solaris     Linux    IRIX
>   Perl              6          1       3
>   Python           77         49      32

    cgipython                   28

Note that cgipython does search for sitecutomize.py.

> 
> Number of "mmap" calls:
> 
>                Solaris     Linux    IRIX
>   Perl              25        25       1
>   Python            36        24       1

    cgipython                   13

> 
> ...nope, guess we can't blame mmap for any Perl/Python startup
> disparity.
> 
> How about "brk":
> 
>                Solaris     Linux    IRIX
>   Perl               6        11      12
>   Python            47        39      25

    cgipython                   41 (?)

So at least in theory, using cgipython for the intended
purpose should gain some performance.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Tue Nov 16 16:00:58 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 17:00:58 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
Message-ID: <38317FBA.4F3D6B1F@lemburg.com>

Here is a new proposal for the codec interface:

class Codec:

    def encode(self,u,slice=None):
	
	""" Return the Unicode object u encoded as Python string.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is encoded.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	"""
	...

    def decode(self,s,slice=None):

	""" Return an equivalent Unicode object for the encoded Python
	    string s.

	    If slice is given (as slice object), only the sliced part
	    of the Python string is decoded and returned as Unicode
	    object.  Note that this can cause the decoding algorithm
	    to fail due to truncations in the encoding.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	""" 
	...
	

class StreamCodec(Codec):

    def __init__(self,stream=None,errors='strict'):

	""" Creates a StreamCodec instance.

	    stream must be a file-like object open for reading and/or
	    writing binary data depending on the intended codec
            action or None.

	    The StreamCodec may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are known (they need not all be supported by StreamCodec
            subclasses): 

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def write(self,u,slice=None):

	""" Writes the Unicode object's contents encoded to self.stream.

	    stream must be a file-like object open for writing binary
	    data.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def read(self,length=None):

	""" Reads an encoded string from the stream and returns
	    an equivalent Unicode object.

	    If length is given, only length Unicode characters are
	    returned (the StreamCodec instance reads as many raw bytes
            as needed to fulfill this requirement). Otherwise, all
	    available data is read and decoded.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...


It is not required by the unicodec.register() API to provide a
subclass of these base class, only the given methods must be present;
this allows writing Codecs as extensions types.  All Codecs must
provide the .encode()/.decode() methods. Codecs having the .read()
and/or .write() methods are considered to be StreamCodecs.

The Unicode implementation will by itself only use the
stateless .encode() and .decode() methods.

All other conversion have to be done by explicitly instantiating
the appropriate [Stream]Codec.
--

Feel free to beat on this one ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mal@lemburg.com  Tue Nov 16 16:08:49 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 17:08:49 +0100
Subject: [Python-Dev] just say no...
References: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
 <000901bf2ffc$3d4bb8e0$042d153f@tim> <14385.28495.685427.598748@weyr.cnri.reston.va.us>
Message-ID: <38318191.11D93903@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> Tim Peters writes:
>  > Yet another use for a weak reference <0.5 wink>.
> 
>   Those just keep popping up!  I seem to recall Diane Hackborne
> actually implemented these under the name "vref" long ago; perhaps
> that's worth revisiting after all?  (Not the implementation so much as
> the idea.)  I think to make it general would cost one PyObject* in
> each object's structure, and some code in some constructors (maybe),
> and all destructors, but not much.
>   Is this worth pursuing, or is it locked out of the core because of
> the added space for the PyObject*?  (Note that the concept isn't
> necessarily useful for all object types -- numbers in particular --
> but it only makes sense to bother if it works for everything, even if
> it's not very useful in some cases.)

FYI, there's mxProxy which implements a flavor of them. Look
in the standard places for mx stuff ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From fdrake@acm.org  Tue Nov 16 16:14:06 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 11:14:06 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <38318191.11D93903@lemburg.com>
References: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
 <000901bf2ffc$3d4bb8e0$042d153f@tim>
 <14385.28495.685427.598748@weyr.cnri.reston.va.us>
 <38318191.11D93903@lemburg.com>
Message-ID: <14385.33486.855802.187739@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > FYI, there's mxProxy which implements a flavor of them. Look
 > in the standard places for mx stuff ;-)

  Yes, but still not in the core.  So we have two general examples
(vrefs and mxProxy) and there's WeakDict (or something like that).  I
think there really needs to be a core facility for this.  There are a
lot of users (including myself) who think that things are far less
useful if they're not in the core.  (No, I'm not saying that
everything should be in the core, or even that it needs a lot more
stuff.  I just don't want to be writing code that requires a lot of
separate packages to be installed.  At least not until we can tell an
installation tool to "install this and everything it depends on." ;)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From bwarsaw@cnri.reston.va.us (Barry A. Warsaw)  Tue Nov 16 16:14:55 1999
From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw)
Date: Tue, 16 Nov 1999 11:14:55 -0500 (EST)
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
References: <199911152137.QAA28280@eric.cnri.reston.va.us>
 
 <19991116091032.A4063@cnri.reston.va.us>
 <14385.27579.292173.433577@amarok.cnri.reston.va.us>
Message-ID: <14385.33535.23316.286575@anthem.cnri.reston.va.us>

>>>>> "AMK" == Andrew M Kuchling  writes:

    AMK> I don't see how this is fixable, unless we strip down
    AMK> site.py, which drags in os, which drags in os.path and stat
    AMK> and UserDict.

One approach might be to support loading modules out of jar files (or
whatever) using Greg imputils.  We could put the bootstrap .pyc files
in this jar and teach Python to import from it first.  Python
installations could even craft their own modules.jar file to include
whatever modules they are willing to "hard code".  This, with -S might
make Python start up much faster, at the small cost of some
flexibility (which could be regained with a c.l. switch or other
mechanism to bypass modules.jar).

-Barry


From guido@CNRI.Reston.VA.US  Tue Nov 16 16:20:28 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 11:20:28 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Tue, 16 Nov 1999 17:00:58 +0100."
 <38317FBA.4F3D6B1F@lemburg.com>
References: <38317FBA.4F3D6B1F@lemburg.com>
Message-ID: <199911161620.LAA02643@eric.cnri.reston.va.us>

> It is not required by the unicodec.register() API to provide a
> subclass of these base class, only the given methods must be present;
> this allows writing Codecs as extensions types.  All Codecs must
> provide the .encode()/.decode() methods. Codecs having the .read()
> and/or .write() methods are considered to be StreamCodecs.
> 
> The Unicode implementation will by itself only use the
> stateless .encode() and .decode() methods.
> 
> All other conversion have to be done by explicitly instantiating
> the appropriate [Stream]Codec.

Looks okay, although I'd like someone to implement a simple
shift-state-based stream codec to check this out further.

I have some questions about the constructor.  You seem to imply
that instantiating the class without arguments creates a codec without
state.  That's fine.  When given a stream argument, shouldn't the
direction of the stream be given as an additional argument, so the
proper state for encoding or decoding can be set up?  I can see that
for an implementation it might be more convenient to have separate
classes for encoders and decoders -- certainly the state being kept is
very different.

Also, I don't want to ignore the alternative interface that was
suggested by /F.  It uses feed() similar to htmllib c.s.  This has
some advantages (although we might want to define some compatibility
so it can also feed directly into a file).

Perhaps someone should go ahead and implement prototype codecs using
either paradigm and then write some simple apps, so we can make a
better decision.

In any case I think the specs codec registry API aren't on the
critical path, integration of /F's basic unicode object is the first
thing we need.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Tue Nov 16 16:27:53 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 11:27:53 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: Your message of "Tue, 16 Nov 1999 11:14:55 EST."
 <14385.33535.23316.286575@anthem.cnri.reston.va.us>
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us>
 <14385.33535.23316.286575@anthem.cnri.reston.va.us>
Message-ID: <199911161627.LAA02665@eric.cnri.reston.va.us>

> >>>>> "AMK" == Andrew M Kuchling  writes:
> 
>     AMK> I don't see how this is fixable, unless we strip down
>     AMK> site.py, which drags in os, which drags in os.path and stat
>     AMK> and UserDict.
> 
> One approach might be to support loading modules out of jar files (or
> whatever) using Greg imputils.  We could put the bootstrap .pyc files
> in this jar and teach Python to import from it first.  Python
> installations could even craft their own modules.jar file to include
> whatever modules they are willing to "hard code".  This, with -S might
> make Python start up much faster, at the small cost of some
> flexibility (which could be regained with a c.l. switch or other
> mechanism to bypass modules.jar).

A completely different approach (which, incidentally, HP has lobbied
for before; and which has been implemented by Sjoerd Mullender for one
particular application) would be to cache a mapping from module names
to filenames in a dbm file.  For Sjoerd's app (which imported hundreds
of modules) this made a huge difference.  The problem is that it's
hard to deal with issues like updating the cache while sharing it with
other processes and even other users...  But if those can be solved,
this could greatly reduce the number of stats and unsuccessful opens,
without having to resort to jar files.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gmcm@hypernet.com  Tue Nov 16 16:56:19 1999
From: gmcm@hypernet.com (Gordon McMillan)
Date: Tue, 16 Nov 1999 11:56:19 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <14385.33535.23316.286575@anthem.cnri.reston.va.us>
Message-ID: <1269351119-9152905@hypernet.com>

Barry A. Warsaw writes:

> One approach might be to support loading modules out of jar files
> (or whatever) using Greg imputils.  We could put the bootstrap
> .pyc files in this jar and teach Python to import from it first. 
> Python installations could even craft their own modules.jar file
> to include whatever modules they are willing to "hard code". 
> This, with -S might make Python start up much faster, at the
> small cost of some flexibility (which could be regained with a
> c.l. switch or other mechanism to bypass modules.jar).

Couple hundred Windows users have been doing this for 
months (http://starship.python.net/crew/gmcm/install.html). 
The .pyz files are cross-platform, although the "embedding" 
app would have to be redone for *nix, (and all the embedding 
really does is keep Python from hunting all over your disk). 
Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a 
diskette with a little room left over.

but-since-its-WIndows-it-must-be-tainted-ly y'rs


- Gordon


From guido@CNRI.Reston.VA.US  Tue Nov 16 17:00:15 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 12:00:15 -0500
Subject: [Python-Dev] Python 1.6 status
Message-ID: <199911161700.MAA02716@eric.cnri.reston.va.us>

Greg Stein recently reminded me that he was holding off on 1.6 patches
because he was under the impression that I wasn't accepting them yet.

The situation is rather more complicated than that.  There are a great
deal of things that need to be done, and for many of them I'd be most
happy to receive patches!  For other things, however, I'm still in the
requirements analysis phase, and patches might be premature (e.g., I
want to redesign the import mechanisms, and while I like some of the
prototypes that have been posted, I'm not ready to commit to any
specific implementation).

How do you know for which things I'm ready for patches?  Ask me.  I've
tried to make lists before, and there are probably some hints in the
TODO FAQ wizard as well as in the "requests" section of the Python
Bugs List.

Greg also suggested that I might receive more patches if I opened up
the CVS tree for checkins by certain valued contributors.  On the one
hand I'm reluctant to do that (I feel I have a pretty good track
record of checking in patches that are mailed to me, assuming I agree
with them) but on the other hand there might be something to say for
this, because it gives contributors more of a sense of belonging to
the inner core.  Of course, checkin privileges don't mean you can
check in anything you like -- as in the Apache world, changes must be
discussed and approved by the group, and I would like to have a veto.
However once a change is approved, it's much easier if the contributor
can check the code in without having to go through me all the time.

A drawback may be that some people will make very forceful requests to
be given checkin privileges, only to never use them; just like there
are some members of python-dev who have never contributed.  I
definitely want to limit the number of privileged contributors to a
very small number (e.g. 10-15).

One additional detail is the legal side -- contributors will have to
sign some kind of legal document similar to the current (wetsign.html)
release form, but guiding all future contributions.  I'll have to
discuss this with CNRI's legal team.

Greg, I understand you have checkin privileges for Apache.  What is
the procedure there for handing out those privileges?  What is the
procedure for using them?  (E.g. if you made a bogus change to part of
Apache you're not supposed to work on, what happens?)

I'm hoping for several kind of responses to this email:

- uncontroversial patches

- questions about whether specific issues are sufficiently settled to
start coding a patch

- discussion threads opening up some issues that haven't been settled
yet (like the current, very productive, thread in i18n)

- posts summarizing issues that were settled long ago in the past,
requesting reverification that the issue is still settled

- suggestions for new issues that maybe ought to be settled in 1.6

- requests for checkin privileges, preferably with a specific issue or
area of expertise for which the requestor will take responsibility

--Guido van Rossum (home page: http://www.python.org/~guido/)


From akuchlin@mems-exchange.org  Tue Nov 16 17:11:48 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 Nov 1999 12:11:48 -0500 (EST)
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us>
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <14385.36948.610106.195971@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>I'm hoping for several kind of responses to this email:

My list of things to do for 1.6 is:

   * Translate re.py to C and switch to the latest PCRE 2 codebase
(mostly done, perhaps ready for public review in a week or so).

   * Go through the O'Reilly POSIX book and draw up a list of missing
POSIX functions that aren't available in the posix module.  This
was sparked by Greg Ward showing me a Perl daemonize() function
he'd written, and I realized that some of the functions it used
weren't available in Python at all.  (setsid() was one of them, I
think.)

   * A while back I got approval to add the mmapfile module to the
core.  The outstanding issue there is that the constructor has a
different interface on Unix and Windows platforms.

On Windows:
mm = mmapfile.mmapfile("filename", "tag name", )

On Unix, it looks like the mmap() function:

mm = mmapfile.mmapfile(, , 
                        (like MAP_SHARED),
		        (like PROT_READ, PROT_READWRITE) 
                      )

Can we reconcile these interfaces, have two different function names,
or what?

>- suggestions for new issues that maybe ought to be settled in 1.6

Perhaps we should figure out what new capabilities, if any, should be
added in 1.6.  Fred has mentioned weak references, and there are other
possibilities such as ExtensionClass.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Society, my dear, is like salt water, good to swim in but hard to swallow.
    -- Arthur Stringer, _The Silver Poppy_



From beazley@cs.uchicago.edu  Tue Nov 16 17:24:24 1999
From: beazley@cs.uchicago.edu (David Beazley)
Date: Tue, 16 Nov 1999 11:24:24 -0600 (CST)
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
 <14385.36948.610106.195971@amarok.cnri.reston.va.us>
Message-ID: <199911161724.LAA13496@gargoyle.cs.uchicago.edu>

Andrew M. Kuchling writes:
> Guido van Rossum writes:
> >I'm hoping for several kind of responses to this email:
> 
>    * Go through the O'Reilly POSIX book and draw up a list of missing
> POSIX functions that aren't available in the posix module.  This
> was sparked by Greg Ward showing me a Perl daemonize() function
> he'd written, and I realized that some of the functions it used
> weren't available in Python at all.  (setsid() was one of them, I
> think.)
> 

I second this!   This was one of the things I noticed when doing the
Essential Reference Book.   Assuming no one has done it already,
I wouldn't mind volunteering to take a crack at it.

Cheers,

Dave




From fdrake@acm.org  Tue Nov 16 17:25:02 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 12:25:02 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <199911161620.LAA02643@eric.cnri.reston.va.us>
References: <38317FBA.4F3D6B1F@lemburg.com>
 <199911161620.LAA02643@eric.cnri.reston.va.us>
Message-ID: <14385.37742.816993.642515@weyr.cnri.reston.va.us>

Guido van Rossum writes:
 > Also, I don't want to ignore the alternative interface that was
 > suggested by /F.  It uses feed() similar to htmllib c.s.  This has
 > some advantages (although we might want to define some compatibility
 > so it can also feed directly into a file).

  I think one or the other can be used, and then a wrapper that
converts to the other interface.  Perhaps the encoders should provide
feed(), and a file-like wrapper can convert write() to feed().  It
could also be done the other way; I'm not sure if it matters which is
"normal."  (Or perhaps feed() was badly named and should be write()?
The general intent was a little different, I think, but an output file 
is very much a stream consumer.)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From akuchlin@mems-exchange.org  Tue Nov 16 17:32:41 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 Nov 1999 12:32:41 -0500 (EST)
Subject: [Python-Dev] mmapfile module
In-Reply-To: <199911161720.MAA02764@eric.cnri.reston.va.us>
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
 <14385.36948.610106.195971@amarok.cnri.reston.va.us>
 <199911161720.MAA02764@eric.cnri.reston.va.us>
Message-ID: <14385.38201.301429.786642@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>Hm, this seems to require a higher-level Python module to hide the
>differences.  Maybe the Unix version could also use a filename?  I
>would think that mmap'ed files should always be backed by a file (not
>by a pipe, socket etc.).  Or is there an issue with secure creation of
>temp files?  This is a question for a separate thread.

Hmm... I don't know of any way to use mmap() on non-file things,
either; there are odd special cases, like using MAP_ANONYMOUS on
/dev/zero to allocate memory, but that's still using a file.  On the
other hand, there may be some special case where you need to do that.
We could add a fileno() method to get the file descriptor, but I don't
know if that's useful to Windows.  (Is Sam Rushing, the original
author of the Win32 mmapfile, on this list?)  

What do we do about the tagname, which is a Win32 argument that has no
Unix counterpart -- I'm not even sure what its function is.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I had it in me to be the Pierce Brosnan of my generation.
    -- Vincent Me's past career plans in EGYPT #1


From mal@lemburg.com  Tue Nov 16 17:53:46 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 18:53:46 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us>
Message-ID: <38319A2A.4385D2E7@lemburg.com>

Guido van Rossum wrote:
> 
> > It is not required by the unicodec.register() API to provide a
> > subclass of these base class, only the given methods must be present;
> > this allows writing Codecs as extensions types.  All Codecs must
> > provide the .encode()/.decode() methods. Codecs having the .read()
> > and/or .write() methods are considered to be StreamCodecs.
> >
> > The Unicode implementation will by itself only use the
> > stateless .encode() and .decode() methods.
> >
> > All other conversion have to be done by explicitly instantiating
> > the appropriate [Stream]Codec.
> 
> Looks okay, although I'd like someone to implement a simple
> shift-state-based stream codec to check this out further.
> 
> I have some questions about the constructor.  You seem to imply
> that instantiating the class without arguments creates a codec without
> state.  That's fine.  When given a stream argument, shouldn't the
> direction of the stream be given as an additional argument, so the
> proper state for encoding or decoding can be set up?  I can see that
> for an implementation it might be more convenient to have separate
> classes for encoders and decoders -- certainly the state being kept is
> very different.

Wouldn't it be possible to have the read/write methods set up
the state when called for the first time ?

Note that I wrote ".read() and/or .write() methods" in the proposal
on purpose: you can of course implement Codecs which only implement
one of them, i.e. Readers and Writers. The registry doesn't care
about them anyway :-)

Then, if you use a Reader for writing, it will result in an
AttributeError...
 
> Also, I don't want to ignore the alternative interface that was
> suggested by /F.  It uses feed() similar to htmllib c.s.  This has
> some advantages (although we might want to define some compatibility
> so it can also feed directly into a file).

AFAIK, .feed() and .finalize() (or .close() etc.) have a different
backgound: you add data in chunks and then process it at some
final stage rather than for each feed. This is often more
efficient.

With respest to codecs this would mean, that you buffer the
output in memory, first doing only preliminary operations on
the feeds and then apply some final logic to the buffer at
the time .finalize() is called.

We could define a StreamCodec subclass for this kind of operation.

> Perhaps someone should go ahead and implement prototype codecs using
> either paradigm and then write some simple apps, so we can make a
> better decision.
> 
> In any case I think the specs codec registry API aren't on the
> critical path, integration of /F's basic unicode object is the first
> thing we need.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From gward@cnri.reston.va.us  Tue Nov 16 17:54:06 1999
From: gward@cnri.reston.va.us (Greg Ward)
Date: Tue, 16 Nov 1999 12:54:06 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <199911161627.LAA02665@eric.cnri.reston.va.us>; from guido@cnri.reston.va.us on Tue, Nov 16, 1999 at 11:27:53AM -0500
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us>
Message-ID: <19991116125405.B4063@cnri.reston.va.us>

On 16 November 1999, Guido van Rossum said:
> A completely different approach (which, incidentally, HP has lobbied
> for before; and which has been implemented by Sjoerd Mullender for one
> particular application) would be to cache a mapping from module names
> to filenames in a dbm file.  For Sjoerd's app (which imported hundreds
> of modules) this made a huge difference.

Hey, this could be a big win for Zope startup.  Dunno how much of that
20-30 sec startup overhead is due to loading modules, but I'm sure it's
a sizeable percentage.  Any Zope-heads listening?

> The problem is that it's
> hard to deal with issues like updating the cache while sharing it with
> other processes and even other users...

Probably not a concern in the case of Zope: one installation, one
process, only gets started when it's explicitly shut down and
restarted.  HmmmMMMMmmm...

        Greg


From petrilli@amber.org  Tue Nov 16 18:04:46 1999
From: petrilli@amber.org (Christopher Petrilli)
Date: Tue, 16 Nov 1999 13:04:46 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <19991116125405.B4063@cnri.reston.va.us>; from gward@cnri.reston.va.us on Tue, Nov 16, 1999 at 12:54:06PM -0500
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us> <19991116125405.B4063@cnri.reston.va.us>
Message-ID: <19991116130446.A3068@trump.amber.org>

Greg Ward [gward@cnri.reston.va.us] wrote:
> On 16 November 1999, Guido van Rossum said:
> > A completely different approach (which, incidentally, HP has lobbied
> > for before; and which has been implemented by Sjoerd Mullender for one
> > particular application) would be to cache a mapping from module names
> > to filenames in a dbm file.  For Sjoerd's app (which imported hundreds
> > of modules) this made a huge difference.
> 
> Hey, this could be a big win for Zope startup.  Dunno how much of that
> 20-30 sec startup overhead is due to loading modules, but I'm sure it's
> a sizeable percentage.  Any Zope-heads listening?

Wow, that's a huge start up that I've personally never seen.  I can't
imagine... even loading the Oracle libraries dynamically, which are HUGE
(2Mb or so), it's only a couple seconds.  

> > The problem is that it's
> > hard to deal with issues like updating the cache while sharing it with
> > other processes and even other users...
> 
> Probably not a concern in the case of Zope: one installation, one
> process, only gets started when it's explicitly shut down and
> restarted.  HmmmMMMMmmm...

This doesn't reslve a lot of other users of Python howver... and Zope
would always benefit, especially when you're running multiple instances
on th same machine... would perhaps share more code.

Chris
-- 
| Christopher Petrilli
| petrilli@amber.org


From gmcm@hypernet.com  Tue Nov 16 18:04:41 1999
From: gmcm@hypernet.com (Gordon McMillan)
Date: Tue, 16 Nov 1999 13:04:41 -0500
Subject: [Python-Dev] mmapfile module
In-Reply-To: <14385.38201.301429.786642@amarok.cnri.reston.va.us>
References: <199911161720.MAA02764@eric.cnri.reston.va.us>
Message-ID: <1269347016-9399681@hypernet.com>

Andrew M. Kuchling wrote:

> Hmm... I don't know of any way to use mmap() on non-file things,
> either; there are odd special cases, like using MAP_ANONYMOUS on
> /dev/zero to allocate memory, but that's still using a file.  On
> the other hand, there may be some special case where you need to
> do that. We could add a fileno() method to get the file
> descriptor, but I don't know if that's useful to Windows.  (Is
> Sam Rushing, the original author of the Win32 mmapfile, on this
> list?)  
> 
> What do we do about the tagname, which is a Win32 argument that
> has no Unix counterpart -- I'm not even sure what its function
> is.

On Windows, a mmap is always backed by disk (swap 
space), but is not necessarily associated with a (user-land) 
file. The tagname is like the "name" associated with a 
semaphore; two processes opening the same tagname get 
shared memory.

Fileno (in the c runtime sense) would be useless on Windows. 
As with all Win32 resources, there's a "handle", which is 
analagous. But different enough, it seems to me, to confound 
any attempts at a common API.

Another fundamental difference (IIRC) is that Windows mmap's 
can be resized on the fly.

- Gordon


From guido@CNRI.Reston.VA.US  Tue Nov 16 18:09:43 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 13:09:43 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Tue, 16 Nov 1999 18:53:46 +0100."
 <38319A2A.4385D2E7@lemburg.com>
References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us>
 <38319A2A.4385D2E7@lemburg.com>
Message-ID: <199911161809.NAA02894@eric.cnri.reston.va.us>

> > I have some questions about the constructor.  You seem to imply
> > that instantiating the class without arguments creates a codec without
> > state.  That's fine.  When given a stream argument, shouldn't the
> > direction of the stream be given as an additional argument, so the
> > proper state for encoding or decoding can be set up?  I can see that
> > for an implementation it might be more convenient to have separate
> > classes for encoders and decoders -- certainly the state being kept is
> > very different.
> 
> Wouldn't it be possible to have the read/write methods set up
> the state when called for the first time ?

Hm, I'd rather be explicit.  We don't do this for files either.

> Note that I wrote ".read() and/or .write() methods" in the proposal
> on purpose: you can of course implement Codecs which only implement
> one of them, i.e. Readers and Writers. The registry doesn't care
> about them anyway :-)
> 
> Then, if you use a Reader for writing, it will result in an
> AttributeError...
>  
> > Also, I don't want to ignore the alternative interface that was
> > suggested by /F.  It uses feed() similar to htmllib c.s.  This has
> > some advantages (although we might want to define some compatibility
> > so it can also feed directly into a file).
> 
> AFAIK, .feed() and .finalize() (or .close() etc.) have a different
> backgound: you add data in chunks and then process it at some
> final stage rather than for each feed. This is often more
> efficient.
> 
> With respest to codecs this would mean, that you buffer the
> output in memory, first doing only preliminary operations on
> the feeds and then apply some final logic to the buffer at
> the time .finalize() is called.

This is part of the purpose, yes.

> We could define a StreamCodec subclass for this kind of operation.

The difference is that to decode from a file, your proposed interface
is to call read() on the codec which will in turn call read() on the
stream.  In /F's version, I call read() on the stream (geting multibyte
encoded data), feed() that to the codec, which in turn calls feed() to
some other back end -- perhaps another codec which in turn feed()s its
converted data to another file, perhaps an XML parser.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fdrake@acm.org  Tue Nov 16 18:16:42 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 13:16:42 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <38319A2A.4385D2E7@lemburg.com>
References: <38317FBA.4F3D6B1F@lemburg.com>
 <199911161620.LAA02643@eric.cnri.reston.va.us>
 <38319A2A.4385D2E7@lemburg.com>
Message-ID: <14385.40842.709711.12141@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Wouldn't it be possible to have the read/write methods set up
 > the state when called for the first time ?

  That slows the down; the constructor should handle initialization.
Perhaps what gets registered should be:  encoding function, decoding
function, stream encoder factory (can be a class), stream decoder
factory (again, can be a class).  These can be encapsulated either
before or after hitting the registry, and can be None.  The registry
and provide default implementations from what is provided (stream
handlers from the functions, or functions from the stream handlers) as 
required.
  Ideally, I should be able to write a module with four well-known
entry points and then provide the module object itself as the
registration entry.  Or I could construct a new object that has the
right interface and register that if it made more sense for the
encoding.

 > AFAIK, .feed() and .finalize() (or .close() etc.) have a different
 > backgound: you add data in chunks and then process it at some
 > final stage rather than for each feed. This is often more

  Many of the classes that provide feed() do as much work as possible
as data is fed into them (see htmllib.HTMLParser); this structure is
commonly used to support asynchonous operation.

 > With respest to codecs this would mean, that you buffer the
 > output in memory, first doing only preliminary operations on
 > the feeds and then apply some final logic to the buffer at
 > the time .finalize() is called.

  That depends on the encoding.  I'd expect it to feed encoded data to 
a sink as quickly as it could and let the target decide what needs to
happen.  If buffering is needed, the target could be a StringIO or
whatever.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From fredrik@pythonware.com  Tue Nov 16 19:32:21 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 20:32:21 +0100
Subject: [Python-Dev] mmapfile module
References: <199911161700.MAA02716@eric.cnri.reston.va.us><14385.36948.610106.195971@amarok.cnri.reston.va.us><199911161720.MAA02764@eric.cnri.reston.va.us> <14385.38201.301429.786642@amarok.cnri.reston.va.us>
Message-ID: <002201bf3069$4e232a50$f29b12c2@secret.pythonware.com>

> Hmm... I don't know of any way to use mmap() on non-file things,
> either; there are odd special cases, like using MAP_ANONYMOUS on
> /dev/zero to allocate memory, but that's still using a file.

but that's not always the case -- OSF/1 supports
truly anonymous mappings, for example.  in fact,
it bombs if you use ANONYMOUS with a file handle:

$ man mmap

    ...

    If MAP_ANONYMOUS is set in the flags parameter:

        +  A new memory region is created and initialized to all zeros.  This
           memory region can be shared only with descendents of the current pro-
           cess.

        +  If the filedes parameter is not -1, the mmap() function fails.

    ...

(btw, doing anonymous maps isn't exactly an odd special
case under this operating system; it's the only memory-
allocation mechanism provided by the kernel...)





From fredrik@pythonware.com  Tue Nov 16 19:33:52 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 20:33:52 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us>
Message-ID: <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> Also, I don't want to ignore the alternative interface that was
> suggested by /F.  It uses feed() similar to htmllib c.s.  This has
> some advantages (although we might want to define some
> compatibility so it can also feed directly into a file).

seeing this made me switch on my brain for a moment,
and recall how things are done in PIL (which is, as I've
bragged about before, another library with an internal
format, and many possible external encodings).  among
other things, PIL lets you read and write images to both
ordinary files and arbitrary file objects, but it also lets
you incrementally decode images by feeding it chunks
of data (through ImageFile.Parser).  and it's fast -- it has
to be, since images tends to contain lots of pixels...

anyway, here's what I came up with (code will follow,
if someone's interested).

--------------------------------------------------------------------
A PIL-like Unicode Codec Proposal
--------------------------------------------------------------------

In the PIL model, the codecs are called with a piece of data, and
returns the result to the caller.  The codecs maintain internal state
when needed.

class decoder:

    def decode(self, s, offset=0):
        # decode as much data as we possibly can from the
        # given string.  if there's not enough data in the
        # input string to form a full character, return
        # what we've got this far (this might be an empty
        # string).

    def flush(self):
        # flush the decoding buffers.  this should usually
        # return None, unless the fact that knowing that the
        # input stream has ended means that the state can be
        # interpreted in a meaningful way.  however, if the
        # state indicates that there last character was not
        # finished, this method should raise a UnicodeError
        # exception.

class encoder:

    def encode(self, u, offset=0, buffersize=0):
        # encode data from the given offset in the input
        # unicode string into a buffer of the given size
        # (or slightly larger, if required to proceed).
        # if the buffer size is 0, the decoder is free
        # to pick a suitable size itself (if at all
        # possible, it should make it large enough to
        # encode the entire input string).  returns a
        # 2-tuple containing the encoded data, and the
        # number of characters consumed by this call.

    def flush(self):
        # flush the encoding buffers.  returns an ordinary
        # string (which may be empty), or None.

Note that a codec instance can be used for a single string; the codec
registry should hold codec factories, not codec instances.  In
addition, you may use a single type or class to implement both
interfaces at once.

--------------------------------------------------------------------
Use Cases
--------------------------------------------------------------------

A null decoder:

    class decoder:
        def decode(self, s, offset=0):
            return s[offset:]
        def flush(self):
            pass

A null encoder:

    class encoder:
        def encode(self, s, offset=0, buffersize=0):
            if buffersize:
                s = s[offset:offset+buffersize]
            else:
                s = s[offset:]
            return s, len(s)
        def flush(self):
            pass

Decoding a string:

    def decode(s, encoding)
        c = registry.getdecoder(encoding)
        u = c.decode(s)
        t = c.flush()
        if not t:
            return u
        return u + t # not very common

Encoding a string:

    def encode(u, encoding)
        c = registry.getencoder(encoding)
        p = []
        o = 0
        while o < len(u):
            s, n = c.encode(u, o)
            p.append(s)
            o = o + n
        if len(p) == 1:
            return p[0]
        return string.join(p, "") # not very common

Implementing stream codecs is left as an exercise (see the zlib
material in the eff-bot guide for a decoder example).

--- end of proposal



From fredrik@pythonware.com  Tue Nov 16 19:37:40 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 20:37:40 +0100
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us>
Message-ID: <003d01bf306a$0bdea330$f29b12c2@secret.pythonware.com>

>    * Go through the O'Reilly POSIX book and draw up a list of missing
> POSIX functions that aren't available in the posix module.  This
> was sparked by Greg Ward showing me a Perl daemonize() function
> he'd written, and I realized that some of the functions it used
> weren't available in Python at all.  (setsid() was one of them, I
> think.)

$ python
Python 1.5.2 (#1, Aug 23 1999, 14:42:39)  [GCC 2.7.2.3] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import os
>>> os.setsid






From mhammond@skippinet.com.au  Tue Nov 16 21:54:15 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 17 Nov 1999 08:54:15 +1100
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <19991116110555.8B43335BB1E@snelboot.oratrix.nl>
Message-ID: <00f701bf307d$20f0cb00$0501a8c0@bobcat>

[Andy writes:]
> Leave JISXXX and the CJK stuff out.  If you get into Japanese, you
> really need to cover ShiftJIS, EUC-JP and JIS, they are big, and
there

[Then Marc relpies:]
> 2. give more information to the unicodec registry:
>    one could register classes instead of instances which the Unicode

[Jack chimes in with:]
> I would suggest adding the Dos, Windows and Macintosh
> standard 8-bit charsets
> (their equivalents of latin-1) too, as documents in these
> encoding are pretty
> ubiquitous. But maybe these should only be added on the
> respective platforms.

[And the conversation twisted around to Greg noting:]
> Next, the number of "open" calls:
>
>               Solaris     Linux    IRIX
>  Perl             16         10       9
>  Python          107         71      48

This is leading me to conclude that our "codec registry" should be the
file system, and Python modules.

Would it be possible to define a "standard package" called
"encodings", and when we need an encoding, we simply attempt to load a
module from that package?  The key benefits I see are:

* No need to load modules simply to register a codec (which would make
the number of open calls even higher, and the startup time even
slower.)  This makes it truly demand-loading of the codecs, rather
than explicit load-and-register.

* Making language specific distributions becomes simple - simply
select a different set of modules from the "encodings" directory.  The
Python source distribution has them all, but (say) the Windows binary
installer selects only a few.  The Japanese binary installer for
Windows installs a few more.

* Installing new codecs becomes trivial - no need to hack site.py
etc - simply copy the new "codec module" to the encodings directory
and you are done.

* No serious problem for GMcM's installer nor for freeze

We would probably need to assume that certain codes exist for _all_
platforms and language - but this is no different to assuming that
"exceptions.py" also exists for all platforms.

Is this worthy of consideration?

Mark.



From andy@robanal.demon.co.uk  Wed Nov 17 00:14:06 1999
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Wed, 17 Nov 1999 00:14:06 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <010001bf300e$14741310$f29b12c2@secret.pythonware.com>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>             <38305D17.60EC94D0@lemburg.com>  <199911152137.QAA28280@eric.cnri.reston.va.us> <010001bf300e$14741310$f29b12c2@secret.pythonware.com>
Message-ID: <3836f28c.4929177@post.demon.co.uk>

On Tue, 16 Nov 1999 09:39:20 +0100, you wrote:

>1) codes written according to the "data
>   consumer model", instead of the "stream"
>   model.
>
>        class myDecoder:
>            def __init__(self, target):
>                self.target = target
>                self.state = ...
>            def feed(self, data):
>                ... extract as much data as possible ...
>                self.target.feed(extracted data)
>            def close(self):
>                ... extract what's left ...
>                self.target.feed(additional data)
>                self.target.close()
>
Apart from feed() instead of write(), how is that different from a
Java-like Stream writer as Guido suggested?  He said:

>Andy's file translation example could then be written as follows:
>
># assuming variables input_file, input_encoding, output_file,
># output_encoding, and constant BUFFER_SIZE
>
>f = open(input_file, "rb")
>f1 = unicodec.codecs[input_encoding].stream_reader(f)
>g = open(output_file, "wb")
>g1 = unicodec.codecs[output_encoding].stream_writer(f)
>
>while 1:
>      buffer = f1.read(BUFFER_SIZE)
>      if not buffer:
>	 break
>      f2.write(buffer)
>
>f2.close()
>f1.close()
>
>Note that we could possibly make these the only API that a codec needs
>to provide; the string object <--> unicode object conversions can be
>done using this and the cStringIO module.  (On the other hand it seems
>a common case that would be quite useful.)

- Andy


From gstein@lyra.org  Wed Nov 17 02:03:21 1999
From: gstein@lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 18:03:21 -0800 (PST)
Subject: [Python-Dev] shared data
In-Reply-To: <1269351119-9152905@hypernet.com>
Message-ID: 

On Tue, 16 Nov 1999, Gordon McMillan wrote:
> Barry A. Warsaw writes:
> > One approach might be to support loading modules out of jar files
> > (or whatever) using Greg imputils.  We could put the bootstrap
> > .pyc files in this jar and teach Python to import from it first. 
> > Python installations could even craft their own modules.jar file
> > to include whatever modules they are willing to "hard code". 
> > This, with -S might make Python start up much faster, at the
> > small cost of some flexibility (which could be regained with a
> > c.l. switch or other mechanism to bypass modules.jar).
> 
> Couple hundred Windows users have been doing this for 
> months (http://starship.python.net/crew/gmcm/install.html). 
> The .pyz files are cross-platform, although the "embedding" 
> app would have to be redone for *nix, (and all the embedding 
> really does is keep Python from hunting all over your disk). 
> Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a 
> diskette with a little room left over.

I've got a patch from Jim Ahlstrom to provide a "standardized" library
file. I've got to review and fold that thing in (I'll post here when that
is done).

As Gordon states: yes, the startup time is considerably improved.

The DBM approach is interesting. That could definitely be used thru an
imputils Importer; it would be quite interesting to try that out.

(Note that the library style approach would be even harder to deal with
updates, relative to what Sjoerd saw with the DBM approach; I would guess 
that the "right" approach is to rebuild the library from scratch and
atomically replace the thing (but that would bust people with open
references...))

Certainly something to look at.

Cheers,
-g

p.s. I also want to try mmap'ing a library and creating code objects that
use PyBufferObjects (rather than PyStringObjects) that refer to portions
of the mmap. Presuming the mmap is shared, there "should" be a large
reduction in heap usage. Question is that I don't know the proportion of
code bytes to other heap usage caused by loading a .pyc.

p.p.s. I also want to try the buffer approach for frozen code.

--
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Wed Nov 17 02:29:42 1999
From: gstein@lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 18:29:42 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <14385.40842.709711.12141@weyr.cnri.reston.va.us>
Message-ID: 

On Tue, 16 Nov 1999, Fred L. Drake, Jr. wrote:
> M.-A. Lemburg writes:
>  > Wouldn't it be possible to have the read/write methods set up
>  > the state when called for the first time ?
> 
>   That slows the down; the constructor should handle initialization.
> Perhaps what gets registered should be:  encoding function, decoding
> function, stream encoder factory (can be a class), stream decoder
> factory (again, can be a class).  These can be encapsulated either
> before or after hitting the registry, and can be None.  The registry

I'm with Fred here; he beat me to the punch (and his email is better than 
what I'd write anyhow :-).

I'd like to see the API be *functions* rather than a particular class
specification. If the spec is going to say "do not alter/store state",
then a function makes much more sense than a method on an object.

Of course, bound method objects could be registered. This might occur if
you have a general JIS encode/decoder but need to instantiate it a little
differently for each JIS variant.
(Andy also mentioned something about "options" in JIS encoding/decoding)

> and provide default implementations from what is provided (stream
> handlers from the functions, or functions from the stream handlers) as 
> required.

Excellent idea...

"I'll provide the encode/decode functions, but I don't have a spiffy
algorithm for streaming -- please provide a stream wrapper for my
functions."

>   Ideally, I should be able to write a module with four well-known
> entry points and then provide the module object itself as the
> registration entry.  Or I could construct a new object that has the
> right interface and register that if it made more sense for the
> encoding.

Mark's idea about throwing these things into a package for on-demand
registrations is much better than a "register-beforehand" model. When the
module is loaded from the package, it calls a registration function to
insert its 4-tuple of registration data.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Wed Nov 17 02:40:07 1999
From: gstein@lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 18:40:07 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
Message-ID: 

On Wed, 17 Nov 1999, Mark Hammond wrote:
>...
> Would it be possible to define a "standard package" called
> "encodings", and when we need an encoding, we simply attempt to load a
> module from that package?  The key benefits I see are:
>...
> Is this worthy of consideration?

Absolutely!

You will need to provide a way for a module (in the "codec" package) to
state *beforehand* that it should be loaded for the X, Y, and Z encodings.
This might be in terms of little "info" files that get dropped into the
package. The __init__.py module scans the directory for the info files and
loads them to build an encoding => module-name mapping.

The alternative would be to have stub modules like:

iso-8859-1.py:

import unicodec

def encode_1(...)
  ...
def encode_2(...)
  ...
...

unicodec.register('iso-8859-1', encode_1, decode_1)
unicodec.register('iso-8859-2', encode_2, decode_2)
...


iso-8859-2.py:
import iso-8859-1


I believe that encoding names are legitimate file names, but they aren't
necessarily Python identifiers. That kind of bungs up "import
codec.iso-8859-1". The codec package would need to programmatically import
the modules. Clients should not be directly importing the modules, so I
don't see a difficult here.
[ if we do decide to allow clients access to the modules, then maybe they
  have to arrive through a "helper" module that has a nice name, or the
  codec package provides a "module = code.load('iso-8859-1')" idiom. ]

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From mhammond@skippinet.com.au  Wed Nov 17 02:57:48 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 17 Nov 1999 13:57:48 +1100
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: 
Message-ID: <010501bf30a7$88c00320$0501a8c0@bobcat>

> You will need to provide a way for a module (in the "codec"
> package) to
> state *beforehand* that it should be loaded for the X, Y, and
...

> The alternative would be to have stub modules like:

Actually, I was thinking even more radically - drop the codec registry
all together, and use modules with "well-known" names  (a slight
precedent, but Python isnt adverse to well-known names in general)

eg:
iso-8859-1.py:

import unicodec
def encode(...):
  ...
def decode(...):
  ...

iso-8859-2.py:
from iso-8859-1 import *

The codec registry then is trivial, and effectively does not exist
(cant get much more trivial than something that doesnt exist :-):

def getencoder(encoding):
  mod = __import__( "encodings." + encoding )
  return getattr(mod, "encode")


> I believe that encoding names are legitimate file names, but
> they aren't
> necessarily Python identifiers. That kind of bungs up "import
> codec.iso-8859-1".

Agreed - clients should never need to import them, and codecs that
wish to import other codes could use "__import__"

Of course, I am not adverse to the idea of a registry as well and
having the modules manually register themselves - but it doesnt seem
to buy much, and the logic for getting a codec becomes more complex -
ie, it needs to determine the module to import, then look in the
registry - if it needs to determine the module anyway, why not just
get it from the module and be done with it?

Mark.



From andy@robanal.demon.co.uk  Wed Nov 17 00:18:22 1999
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Wed, 17 Nov 1999 00:18:22 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
References: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
Message-ID: <3837f379.5166829@post.demon.co.uk>

On Wed, 17 Nov 1999 08:54:15 +1100, you wrote:

>This is leading me to conclude that our "codec registry" should be the
>file system, and Python modules.
>
>Would it be possible to define a "standard package" called
>"encodings", and when we need an encoding, we simply attempt to load a
>module from that package?  The key benefits I see are:
[snip]
>Is this worthy of consideration?

Exactly what I am aiming for.  The real icing on the cake would be a
small state machine or some helper functions in C which made it
possible to write fast codecs in pure Python, but that can come a bit
later when we have examples up and running.   

- Andy




From andy@robanal.demon.co.uk  Wed Nov 17 00:08:01 1999
From: andy@robanal.demon.co.uk (Andy Robinson)
Date: Wed, 17 Nov 1999 00:08:01 GMT
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <000601bf2ff7$4d8a4c80$042d153f@tim>
References: <000601bf2ff7$4d8a4c80$042d153f@tim>
Message-ID: <3834f142.4599884@post.demon.co.uk>

On Tue, 16 Nov 1999 00:56:18 -0500, you wrote:

>[Andy Robinson]
>> ...
>> I presume no one is actually advocating dropping
>> ordinary Python strings, or the ability to do
>>    rawdata = open('myfile.txt', 'rb').read()
>> without any transformations?
>
>If anyone has advocated either, they've successfully hidden it from me.
>Anyone?

Well, I hear statements looking forward to when all string-handling is
done in Unicode internally.  This scares the hell out of me - it is
what VB does and that bit us badly on simple stream operations.  For
encoding work, you will always need raw strings, and often need
Unicode ones.

- Andy


From tim_one@email.msn.com  Wed Nov 17 07:33:06 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 02:33:06 -0500
Subject: [Python-Dev] Unicode proposal: %-formatting ?
In-Reply-To: <383134AA.4B49D178@lemburg.com>
Message-ID: <000001bf30cd$fd6be9c0$a42d153f@tim>

[MAL]
> ...
> This means a new PyUnicode_Format() implementation mapping
> Unicode format objects to Unicode objects.

It's a bitch, isn't it <0.5 wink>?  I hope they're paying you a lot for
this!

> ... hmm, there is a problem there: how should the PyUnicode_Format()
> API deal with '%s' when it sees a Unicode object as argument ?

Anything other than taking the Unicode characters as-is would be
incomprehensible.  I mean, it's a Unicode format string sucking up Unicode
strings -- what else could possibly make *sense*?

> E.g. what would you get in these cases:
>
> u = u"%s %s" % (u"abc", "abc")

That u"abc" gets substituted as-is seems screamingly necessary to me.

I'm more baffled about what "abc" should do.  I didn't understand the t#/s#
etc arguments, and how those do or don't relate to what str() does.  On the
face of it, the idea that a gazillion and one distinct encodings all get
lumped into "a string object" without remembering their nature makes about
as much sense as if Python were to treat all instances of all user-defined
classes as being of a single InstanceType type  -- except in the
latter case you at least get a __class__ attribute to find your way home
again.

As an ignorant user, I would hope that

    u"%s" % string

had enough sense to know what string's encoding is all on its own, and
promote it correctly to Unicode by magic.

> Perhaps we need a new marker for "insert Unicode object here".

%s means string, and at this level a Unicode object *is* "a string".  If
this isn't obvious, it's likely because we're too clever about what
non-Unicode string objects do in this context.




From andy@robanal.demon.co.uk  Wed Nov 17 07:53:53 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 16 Nov 1999 23:53:53 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
Message-ID: <19991117075353.16046.rocketmail@web606.mail.yahoo.com>

--- Mark Hammond  wrote:
> Actually, I was thinking even more radically - drop
> the codec registry
> all together, and use modules with "well-known"
> names  (a slight
> precedent, but Python isnt adverse to well-known
> names in general)
> 
> eg:
> iso-8859-1.py:
> 
> import unicodec
> def encode(...):
>   ...
> def decode(...):
>   ...
> 
> iso-8859-2.py:
> from iso-8859-1 import *
> 
This is the simplest if each codec really is likely to
be implemented in a separate module.  But just look at
the data!  All the iso-8859 encodings need identical
functionality, and just have a different mapping table
with 256 elements.  It would be trivial to implement
these in one module.  And the wide variety of Japanese
encodings (mostly corporate or historical variants of
the same character set) are again best treated from
one code base with a bunch of mapping tables and
routines to generate the variants - basically one can
store the deltas.

So the choice is between possibly having a lot of
almost-dummy modules, or having Python modules which
generate and register a logical family of encodings.  

I may have some time next week and will try to code up
a few so we can pound on something.

- Andy



=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From andy@robanal.demon.co.uk  Wed Nov 17 07:58:23 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 16 Nov 1999 23:58:23 -0800 (PST)
Subject: [Python-Dev] Unicode proposal: %-formatting ?
Message-ID: <19991117075823.6498.rocketmail@web602.mail.yahoo.com>


--- Tim Peters  wrote:
> I'm more baffled about what "abc" should do.  I
> didn't understand the t#/s#
> etc arguments, and how those do or don't relate to
> what str() does.  On the
> face of it, the idea that a gazillion and one
> distinct encodings all get
> lumped into "a string object" without remembering
> their nature makes about
> as much sense as if Python were to treat all
> instances of all user-defined
> classes as being of a single InstanceType type
>  -- except in the
> latter case you at least get a __class__ attribute
> to find your way home
> again.

Well said.  When the core stuff is done, I'm going to
implement a set of "TypedString" helper routines which
will remember what they are encoded in and won't let
you abuse them by concatenating or otherwise mixing
different encodings.  If you are consciously working
with multi-encoding data, this higher level of
abstraction is really useful.  But I reckon that can
be done in pure Python (just overload '%;, '+' etc.
with some encoding checks).

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From mal@lemburg.com  Wed Nov 17 10:03:59 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 11:03:59 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000201bf30d3$cb2cb240$a42d153f@tim>
Message-ID: <38327D8F.7A5352E6@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > ...demo script...
> 
> It looks like
> 
>     r'\\u0000'
> 
> will get translated into a 2-character Unicode string.

Right...

> That's probably not
> good, if for no other reason than that Java would not do this (it would
> create the obvious 7-character Unicode string), and having something that
> looks like a Java escape that doesn't *work* like the Java escape will be
> confusing as heck for JPython users.  Keeping track of even-vs-odd number of
> backslashes can't be done with a regexp search, but is easy if the code is
> simple :
> ...Tim's version of the demo...

Guido and I have decided to turn \uXXXX into a standard
escape sequence with no further magic applied. \uXXXX will
only be expanded in u"" strings.

Here's the new scheme:

With the 'unicode-escape' encoding being defined as:

ˇ all non-escape characters represent themselves as a Unicode ordinal
  (e.g. 'a' -> U+0061).

ˇ all existing defined Python escape sequences are interpreted as
  Unicode ordinals; note that \xXXXX can represent all Unicode
  ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.

ˇ a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
  error to have fewer than 4 digits after \u.

Examples:

u'abc'          -> U+0061 U+0062 U+0063
u'\u1234'       -> U+1234
u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c

Now how should we define ur"abc\u1234\n"  ... ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From tim_one@email.msn.com  Wed Nov 17 09:31:27 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 04:31:27 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <000801bf30de$85bea500$a42d153f@tim>

[Guido]
> ...
> I'm hoping for several kind of responses to this email:
> ...
> - requests for checkin privileges, preferably with a specific issue
> or area of expertise for which the requestor will take responsibility.

I'm specifically requesting not to have checkin privileges.  So there.

I see two problems:

1. When patches go thru you, you at least eyeball them.  This catches bugs
and design errors early.

2. For a multi-platform app, few people have adequate resources for testing;
e.g., I can test under an obsolete version of Win95, and NT if I have to,
but that's it.  You may not actually do better testing than that, but having
patches go thru you allows me the comfort of believing you do .




From mal@lemburg.com  Wed Nov 17 10:11:05 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 11:11:05 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
Message-ID: <38327F39.AA381647@lemburg.com>

Mark Hammond wrote:
> 
> This is leading me to conclude that our "codec registry" should be the
> file system, and Python modules.
> 
> Would it be possible to define a "standard package" called
> "encodings", and when we need an encoding, we simply attempt to load a
> module from that package?  The key benefits I see are:
> 
> * No need to load modules simply to register a codec (which would make
> the number of open calls even higher, and the startup time even
> slower.)  This makes it truly demand-loading of the codecs, rather
> than explicit load-and-register.
> 
> * Making language specific distributions becomes simple - simply
> select a different set of modules from the "encodings" directory.  The
> Python source distribution has them all, but (say) the Windows binary
> installer selects only a few.  The Japanese binary installer for
> Windows installs a few more.
> 
> * Installing new codecs becomes trivial - no need to hack site.py
> etc - simply copy the new "codec module" to the encodings directory
> and you are done.
> 
> * No serious problem for GMcM's installer nor for freeze
> 
> We would probably need to assume that certain codes exist for _all_
> platforms and language - but this is no different to assuming that
> "exceptions.py" also exists for all platforms.
> 
> Is this worthy of consideration?

Why not... using the new registry scheme I proposed in the
thread "Codecs and StreamCodecs" you could implement this
via factory_functions and lazy imports (with the encoding
name folded to make up a proper Python identifier, e.g.
hyphens get converted to '' and spaces to '_').

I'd suggest grouping encodings:

[encodings]
	[iso}
		[iso88591]
		[iso88592]
	[jis]
		...
	[cyrillic]
		...
	[misc]

The unicodec registry could then query encodings.get(encoding,action)
and the package would take care of the rest.

Note that the "walk-me-up-scotty" import patch would probably
be nice in this situation too, e.g. to reach the modules in
[misc] or in higher levels such the ones in [iso] from
[iso88591].

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mal@lemburg.com  Wed Nov 17 09:29:34 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 10:29:34 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com>
Message-ID: <3832757E.B9503606@lemburg.com>

Fredrik Lundh wrote:
> 
> --------------------------------------------------------------------
> A PIL-like Unicode Codec Proposal
> --------------------------------------------------------------------
> 
> In the PIL model, the codecs are called with a piece of data, and
> returns the result to the caller.  The codecs maintain internal state
> when needed.
> 
> class decoder:
> 
>     def decode(self, s, offset=0):
>         # decode as much data as we possibly can from the
>         # given string.  if there's not enough data in the
>         # input string to form a full character, return
>         # what we've got this far (this might be an empty
>         # string).
> 
>     def flush(self):
>         # flush the decoding buffers.  this should usually
>         # return None, unless the fact that knowing that the
>         # input stream has ended means that the state can be
>         # interpreted in a meaningful way.  however, if the
>         # state indicates that there last character was not
>         # finished, this method should raise a UnicodeError
>         # exception.

Could you explain for reason for having a .flush() method
and what it should return.

Note that the .decode method is not so much different
from my Codec.decode method except that it uses a single
offset where my version uses a slice (the offset is probably
the better variant, because it avoids data truncation).
 
> class encoder:
> 
>     def encode(self, u, offset=0, buffersize=0):
>         # encode data from the given offset in the input
>         # unicode string into a buffer of the given size
>         # (or slightly larger, if required to proceed).
>         # if the buffer size is 0, the decoder is free
>         # to pick a suitable size itself (if at all
>         # possible, it should make it large enough to
>         # encode the entire input string).  returns a
>         # 2-tuple containing the encoded data, and the
>         # number of characters consumed by this call.

Dito.
 
>     def flush(self):
>         # flush the encoding buffers.  returns an ordinary
>         # string (which may be empty), or None.
> 
> Note that a codec instance can be used for a single string; the codec
> registry should hold codec factories, not codec instances.  In
> addition, you may use a single type or class to implement both
> interfaces at once.

Perhaps I'm missing something, but how would you define
stream codecs using this interface ? 

> Implementing stream codecs is left as an exercise (see the zlib
> material in the eff-bot guide for a decoder example).

...?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Wed Nov 17 09:55:05 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 10:55:05 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>
 <199911161620.LAA02643@eric.cnri.reston.va.us>
 <38319A2A.4385D2E7@lemburg.com> <14385.40842.709711.12141@weyr.cnri.reston.va.us>
Message-ID: <38327B79.2415786B@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > Wouldn't it be possible to have the read/write methods set up
>  > the state when called for the first time ?
> 
>   That slows the down; the constructor should handle initialization.
> Perhaps what gets registered should be:  encoding function, decoding
> function, stream encoder factory (can be a class), stream decoder
> factory (again, can be a class).

Guido proposed the factory approach too, though not seperated
into these 4 APIs (note that your proposal looks very much like
what I had in the early version of my proposal).

Anyway, I think that factory functions are the way to go,
because they offer more flexibility w/r to reusing already
instantiated codecs, importing modules on-the-fly as was
suggested in another thread (thereby making codec module
import lazy) or mapping encoder and decoder requests all
to one class.

So here's a new registry approach:

unicodec.register(encoding,factory_function,action)

with 
	encoding - name of the supported encoding, e.g. Shift_JIS
	factory_function - a function that returns an object
                   or function ready to be used for action
	action - a string stating the supported action:
			'encode'
			'decode'
			'stream write'
			'stream read'

The factory_function API depends on the implementation of
the codec. The returned object's interface on the value of action:

Codecs:
-------

obj = factory_function_for_(errors='strict')

'encode': obj(u,slice=None) -> Python string
'decode': obj(s,offset=0,chunksize=0) -> (Unicode object, bytes consumed)

factory_functions are free to return simple function objects
for stateless encodings.

StreamCodecs:
-------------

obj = factory_function_for_(stream,errors='strict')

obj should provide access to all methods defined for the stream
object, overriding these:

'stream write': obj.write(u,slice=None) -> bytes written to stream
		obj.flush() -> ???
'stream read':  obj.read(chunksize=0) -> (Unicode object, bytes read)
		obj.flush() -> ???

errors is defined like in my Codec spec. The codecs are
expected to use this argument to handle error conditions.

I'm not sure what Fredrik intended with the .flush() methods,
so the definition is still open. I would expect it to do some
finalization of state.

Perhaps we need another set of actions for the .feed()/.close()
approach...

As in earlier version of the proposal:
The registry should provide default implementations for
missing action factory_functions using the other registered
functions, e.g. 'stream write' can be emulated using
'encode' and 'stream read' using 'decode'. The same probably
holds for feed approach.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From tim_one@email.msn.com  Wed Nov 17 08:14:38 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 03:14:38 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <3831350B.8F69CB6D@lemburg.com>
Message-ID: <000201bf30d3$cb2cb240$a42d153f@tim>

[MAL]
> ...
> Here is a sample implementation of what I had in mind:
>
> """ Demo for 'unicode-escape' encoding.
> """
> import struct,string,re
>
> pack_format = '>H'
>
> def convert_string(s):
>
>     l = map(None,s)
>     for i in range(len(l)):
> 	l[i] = struct.pack(pack_format,ord(l[i]))
>     return l
>
> u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})')
>
> def unicode_unescape(s):
>
>     l = []
>     start = 0
>     while start < len(s):
> 	m = u_escape.search(s,start)
> 	if not m:
> 	    l[len(l):] = convert_string(s[start:])
> 	    break
> 	m_start,m_end = m.span()
> 	if m_start > start:
> 	    l[len(l):] = convert_string(s[start:m_start])
> 	hexcode = m.group(1)
> 	#print hexcode,start,m_start
> 	if len(hexcode) != 4:
> 	    raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode
> 	ordinal = string.atoi(hexcode,16)
> 	l.append(struct.pack(pack_format,ordinal))
> 	start = m_end
>     #print l
>     return string.join(l,'')
>
> def hexstr(s,sep=''):
>
>     return string.join(map(lambda x,hex=hex,ord=ord: '%02x' %
> ord(x),s),sep)

It looks like

    r'\\u0000'

will get translated into a 2-character Unicode string.  That's probably not
good, if for no other reason than that Java would not do this (it would
create the obvious 7-character Unicode string), and having something that
looks like a Java escape that doesn't *work* like the Java escape will be
confusing as heck for JPython users.  Keeping track of even-vs-odd number of
backslashes can't be done with a regexp search, but is easy if the code is
simple :

def unicode_unescape(s):
    from string import atoi
    import array
    i, n = 0, len(s)
    result = array.array('H') # unsigned short, native order
    while i < n:
        ch = s[i]
        i = i+1
        if ch != "\\":
            result.append(ord(ch))
            continue
        if i == n:
            raise ValueError("string ends with lone backslash")
        ch = s[i]
        i = i+1
        if ch != "u":
            result.append(ord("\\"))
            result.append(ord(ch))
            continue
        hexchars = s[i:i+4]
        if len(hexchars) != 4:
            raise ValueError("\\u escape at end not followed by "
                             "at least 4 characters")
        i = i+4
        for ch in hexchars:
            if ch not in "01234567890abcdefABCDEF":
                raise ValueError("\\u" + hexchars + " contains "
                                 "non-hex characters")
        result.append(atoi(hexchars, 16))

    # print result
    return result.tostring()




From tim_one@email.msn.com  Wed Nov 17 08:47:48 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 03:47:48 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: <383156DF.2209053F@lemburg.com>
Message-ID: <000401bf30d8$6cf30bc0$a42d153f@tim>

[MAL]
> FYI, the next version of the proposal ...
> File objects opened in text mode will use "t#" and binary ones use "s#".

Am I the only one who sees magical distinctions between text and binary mode
as a Really Bad Idea?  I wouldn't have guessed the Unix natives here would
quietly acquiesce to importing a bit of Windows madness .




From tim_one@email.msn.com  Wed Nov 17 08:47:46 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 03:47:46 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <383140F3.EDDB307A@lemburg.com>
Message-ID: <000301bf30d8$6bbd4ae0$a42d153f@tim>

[Jack Jansen]
> I would suggest adding the Dos, Windows and Macintosh standard
> 8-bit charsets (their equivalents of latin-1) too, as documents
> in these encoding are pretty ubiquitous. But maybe these should
> only be added on the respective platforms.

[MAL]
> Good idea. What code pages would that be ?

I'm not clear on what's being suggested; e.g., Windows supports *many*
different "code pages".  CP 1252 is default in the U.S., and is an extension
of Latin-1.  See e.g.

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

which appears to be up-to-date (has 0x80 as the euro symbol, Unicode
U+20AC -- although whether your version of U.S. Windows actually has this
depends on whether you installed the service pack that added it!).

See

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP850.TXT

for the closest DOS got.




From tim_one@email.msn.com  Wed Nov 17 09:05:21 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 04:05:21 -0500
Subject: Weak refs (was [Python-Dev] just say no...)
In-Reply-To: <14385.33486.855802.187739@weyr.cnri.reston.va.us>
Message-ID: <000601bf30da$e069d820$a42d153f@tim>

[Fred L. Drake, Jr., pines for some flavor of weak refs; MAL reminds us
 of his work; & back to Fred]

>   Yes, but still not in the core.  So we have two general examples
> (vrefs and mxProxy) and there's WeakDict (or something like that).  I
> think there really needs to be a core facility for this.

This kind of thing certainly belongs in the core (for efficiency and smooth
integration) -- if it belongs in the language at all.  This was discussed at
length here some months ago; that's what prompted MAL to "do something"
about it.  Guido hasn't shown visible interest, and nobody has been willing
to fight him to the death over it.  So it languishes.  Buy him lunch
tomorrow and get him excited .




From tim_one@email.msn.com  Wed Nov 17 09:10:24 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 04:10:24 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <1269351119-9152905@hypernet.com>
Message-ID: <000701bf30db$94d4ac40$a42d153f@tim>

[Gordon McMillan]
> ...
> Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a
> diskette with a little room left over.

That's truly remarkable (he says while waiting for the Inbox Repair Tool to
finish repairing his 50Mb Outlook mail file ...)!

> but-since-its-WIndows-it-must-be-tainted-ly y'rs

Indeed -- if it runs on Windows, it's a worthless piece o' crap .




From fredrik@pythonware.com  Wed Nov 17 11:00:10 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:00:10 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com>
Message-ID: <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com>

M.-A. Lemburg  wrote:
> >     def flush(self):
> >         # flush the decoding buffers.  this should usually
> >         # return None, unless the fact that knowing that the
> >         # input stream has ended means that the state can be
> >         # interpreted in a meaningful way.  however, if the
> >         # state indicates that there last character was not
> >         # finished, this method should raise a UnicodeError
> >         # exception.
>
> Could you explain for reason for having a .flush() method
> and what it should return.

in most cases, it should either return None, or
raise a UnicodeError exception:

    >>> u = unicode("ĺ i ĺa ä e ö", "iso-latin-1")
    >>> # yes, that's a valid Swedish sentence ;-)
    >>> s = u.encode("utf-8")
    >>> d = decoder("utf-8")
    >>> d.decode(s[:-1])
    "ĺ i ĺa ä e "
    >>> d.flush()
    UnicodeError: last character not complete

on the other hand, there are situations where it
might actually return a string.  consider a "HTML
entity decoder" which uses the following pattern
to match a character entity: "&\w+;?" (note that
the trailing semicolon is optional).

    >>> u = unicode("ĺ i ĺa ä e ö", "iso-latin-1")
    >>> s = u.encode("html-entities")
    >>> d = decoder("html-entities")
    >>> d.decode(s[:-1])
    "ĺ i ĺa ä e "
    >>> d.flush()
    "ö"

> Perhaps I'm missing something, but how would you define
> stream codecs using this interface ?

input: read chunks of data, decode, and
keep extra data in a local buffer.

output: encode data into suitable chunks,
and write to the output stream (that's why
there's a buffersize argument to encode --
if someone writes a 10mb unicode string to
an encoded stream, python shouldn't allocate
an extra 10-30 megabytes just to be able to
encode the darn thing...)

> > Implementing stream codecs is left as an exercise (see the zlib
> > material in the eff-bot guide for a decoder example).

everybody should have a copy of the eff-bot guide ;-)

(but alright, I plan to post a complete utf-8 implementation
in a not too distant future).





From gstein@lyra.org  Wed Nov 17 10:57:36 1999
From: gstein@lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 02:57:36 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38327F39.AA381647@lemburg.com>
Message-ID: 

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
>...
> I'd suggest grouping encodings:
> 
> [encodings]
> 	[iso}
> 		[iso88591]
> 		[iso88592]
> 	[jis]
> 		...
> 	[cyrillic]
> 		...
> 	[misc]

WHY?!?!

This is taking a simple solution and making it complicated. I see no
benefit to the creating yet-another-level-of-hierarchy. Why should they be
grouped?

Leave the modules just under "encodings" and be done with it.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Wed Nov 17 11:14:01 1999
From: gstein@lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 03:14:01 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <38327B79.2415786B@lemburg.com>
Message-ID: 

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
>...
> Anyway, I think that factory functions are the way to go,
> because they offer more flexibility w/r to reusing already
> instantiated codecs, importing modules on-the-fly as was
> suggested in another thread (thereby making codec module
> import lazy) or mapping encoder and decoder requests all
> to one class.

Why a factory? I've got a simple encode() function. I don't need a
factory. "flexibility" at the cost of complexity (IMO).

> So here's a new registry approach:
> 
> unicodec.register(encoding,factory_function,action)
> 
> with 
> 	encoding - name of the supported encoding, e.g. Shift_JIS
> 	factory_function - a function that returns an object
>                    or function ready to be used for action
> 	action - a string stating the supported action:
> 			'encode'
> 			'decode'
> 			'stream write'
> 			'stream read'

This action thing is subject to error. *if* you're wanting to go this
route, then have:

unicodec.register_encode(...)
unicodec.register_decode(...)
unicodec.register_stream_write(...)
unicodec.register_stream_read(...)

They are equivalent. Guido has also told me in the past that he dislikes
parameters that alter semantics -- preferring different functions instead.
(this is why there are a good number of PyBufferObject interfaces; I had
fewer to start with)

This suggested approach is also quite a bit more wordy/annoying than
Fred's alternative:

unicode.register('iso-8859-1', encoder, decoder, None, None)

And don't say "future compatibility allows us to add new actions." Well,
those same future changes can add new registration functions or additional
parameters to the single register() function.

Not that I'm advocating it, but register() could also take a single
parameter: if a class, then instantiate it and call methods for each
action; if an instance, then just call methods for each action.

[ and the third/original variety: a function object as the first param is
  the actual hook, and params 2 thru 4 (each are optional, or just the
  stream funcs?) are the other hook functions ]

> The factory_function API depends on the implementation of
> the codec. The returned object's interface on the value of action:
> 
> Codecs:
> -------
> 
> obj = factory_function_for_(errors='strict')

Where does this "errors" value come from? How does a user alter that
value? Without an ability to change this, I see no reason for a factory.
[ and no: don't tell me it is a thread-state value :-) ]

On the other hand: presuming the "errors" thing is valid, *then* I see a
need for a factory.

Truly... I dislike factories. IMO, they just add code/complexity in many
cases where the functionality isn't needed. But that's just me :-)

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From andy@robanal.demon.co.uk  Wed Nov 17 11:17:00 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 17 Nov 1999 03:17:00 -0800 (PST)
Subject: [Python-Dev] Rosette i18n API
Message-ID: <19991117111700.8831.rocketmail@web603.mail.yahoo.com>

There is a very capable C++ library at

http://rosette.basistech.com/

It is well worth looking at the things this API
actually lets you do for ideas on patterns.

- Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From gstein@lyra.org  Wed Nov 17 11:21:18 1999
From: gstein@lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 03:21:18 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim>
Message-ID: 

On Wed, 17 Nov 1999, Tim Peters wrote:
> [MAL]
> > FYI, the next version of the proposal ...
> > File objects opened in text mode will use "t#" and binary ones use "s#".
> 
> Am I the only one who sees magical distinctions between text and binary mode
> as a Really Bad Idea?  I wouldn't have guessed the Unix natives here would
> quietly acquiesce to importing a bit of Windows madness .

It's a seductive idea... yes, it feels wrong, but then... it seems kind of
right, too...

:-)

Yes. It is a mode. Is it bad? Not sure. You've already told the system
that you want to treat the file differently. Much like you're treating it
differently when you specify 'r' vs. 'w'.

The real annoying thing would be to assume that opening a file as 'r'
means that I *meant* text mode and to start using "t#". In actuality, I
typically open files that way since I do most of my coding on Linux. If
I now have to pay attention to things and open it as 'rb', then I'll be
pissed.

And the change in behavior and bugs that interpreting 'r' as text would
introduce? Ack!

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From fredrik@pythonware.com  Wed Nov 17 11:36:32 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:36:32 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: 
Message-ID: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com>

Greg Stein  wrote:
> Why a factory? I've got a simple encode() function. I don't need a
> factory. "flexibility" at the cost of complexity (IMO).

so where do you put the state?

how do you reset the state between
strings?

how do you handle incremental
decoding/encoding?

etc.

(I suggest taking another look at PIL's codec
design.  it solves all these problems with a
minimum of code, and it works -- people
have been hammering on PIL for years...)





From gstein@lyra.org  Wed Nov 17 11:34:30 1999
From: gstein@lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 03:34:30 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com>
Message-ID: 

On Wed, 17 Nov 1999, Fredrik Lundh wrote:
> Greg Stein  wrote:
> > Why a factory? I've got a simple encode() function. I don't need a
> > factory. "flexibility" at the cost of complexity (IMO).
> 
> so where do you put the state?

encode() is not supposed to retain state. It is supposed to do a complete
translation. It is not a stream thingy, which may have received partial
characters.

> how do you reset the state between
> strings?

There is none :-)

> how do you handle incremental
> decoding/encoding?

Streams.

-g

--
Greg Stein, http://www.lyra.org/



From fredrik@pythonware.com  Wed Nov 17 11:46:01 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:46:01 +0100
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> - suggestions for new issues that maybe ought to be settled in 1.6

three things: imputil, imputil, imputil





From fredrik@pythonware.com  Wed Nov 17 11:51:33 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:51:33 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: 
Message-ID: <006201bf30f2$194626f0$f29b12c2@secret.pythonware.com>

Greg Stein  wrote:
> > so where do you put the state?
>
> encode() is not supposed to retain state. It is supposed to do a complete
> translation. It is not a stream thingy, which may have received partial
> characters.
>
> > how do you handle incremental
> > decoding/encoding?
> 
> Streams.

hmm.  why have two different mechanisms when
you can do the same thing with one?





From gstein@lyra.org  Wed Nov 17 13:01:47 1999
From: gstein@lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 05:01:47 -0800 (PST)
Subject: [Python-Dev] Apache process (was: Python 1.6 status)
In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: 

On Tue, 16 Nov 1999, Guido van Rossum wrote:
>...
> Greg, I understand you have checkin privileges for Apache.  What is
> the procedure there for handing out those privileges?  What is the
> procedure for using them?  (E.g. if you made a bogus change to part of
> Apache you're not supposed to work on, what happens?)

Somebody proposes that a person is added to the list of people with
checkin privileges. If nobody else in the group vetoes that, then they're
in (their system doesn't require continual participation by each member,
so it can only operate at a veto level, rather than a unanimous assent).
It is basically determined on the basis of merit -- has the person been
active (on the Apache developer's mailing list) and has the person
contributed something significant? Further, by providing commit access,
will they further the goals of Apache? And, of course, does their
temperament seem to fit in with the other group members?

I can make any change that I'd like. However, there are about 20 other
people who can easily revert or alter my changes if they're bogus.
There are no programmatic restrictions.... You could say it is based on
mutual respect and a social contract of behavior. Large changes should be
discussed before committing to CVS. Bug fixes, doc enhancements, minor
functional improvements, etc, all follow a commit-then-review process. I
just check the thing in. Others see the diff (emailed to the checkins
mailing list (this is different from Python-checkins which only says what
files are changed, rather than providing the diff)) and can comment on the
change, make their own changes, etc.

To be concrete: I added the Expat code that now appears in Apache 1.3.9.
Before doing so, I queried the group. There were some issues that I dealt
with before finally commiting Expat to the CVS repository. On another
occasion, I added a new API to Apache; again, I proposed it first, got an
"all OK" and committed it. I've done a couple bug fixes which I just
checked in.
[ "all OK" means three +1 votes and no vetoes. everybody has veto
  ability (but the responsibility to explain why and to remove their veto 
  when their concerns are addressed). ]

On many occasions, I've reviewed the diffs that were posted to the
checkins list, and made comments back to the author. I've caught a few
problems this way.

For Apache 2.0, even large changes are commit-then-review at this point.
At some point, it will switch over to review-then-commit and the project
will start moving towards stabilization/release. (bug fixes and stuff will
always remain commit-then-review)

I'll note that the process works very well given that diffs are emailed. I
doubt that it would be effective if people had to fetch CVS diffs
themselves.

Your note also implies "areas of ownership". This doesn't really exist
within Apache. There aren't even "primary authors" or things like that. I
have the ability/rights to change any portions: from the low-level
networking, to the documentation, to the server-side include processing.
Of coures, if I'm going to make a big change, then I'll be posting a patch
for review first, and whoever has worked in that area in the past
may/will/should comment.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/



From guido@CNRI.Reston.VA.US  Wed Nov 17 13:32:05 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:32:05 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: Your message of "Wed, 17 Nov 1999 04:31:27 EST."
 <000801bf30de$85bea500$a42d153f@tim>
References: <000801bf30de$85bea500$a42d153f@tim>
Message-ID: <199911171332.IAA03266@kaluha.cnri.reston.va.us>

> I'm specifically requesting not to have checkin privileges.  So there.

I will force nobody to use checkin privileges.  However I see that
for some contributors, checkin privileges will save me and them time.

> I see two problems:
> 
> 1. When patches go thru you, you at least eyeball them.  This catches bugs
> and design errors early.

I will still eyeball them -- only after the fact.  Since checkins are
pretty public, being slapped on the wrist for a bad checkin is a
pretty big embarrassment, so few contributors will check in buggy code
more than once.  Moreover, there will be more eyeballs.

> 2. For a multi-platform app, few people have adequate resources for testing;
> e.g., I can test under an obsolete version of Win95, and NT if I have to,
> but that's it.  You may not actually do better testing than that, but having
> patches go thru you allows me the comfort of believing you do .

I expect that the same mechanisms will apply.  I have access to
Solaris, Linux and Windows (NT + 98) but it's actually a lot easier to
check portability after things have been checked in.  And again, there
will be more testers.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Wed Nov 17 13:34:23 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:34:23 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Tue, 16 Nov 1999 23:53:53 PST."
 <19991117075353.16046.rocketmail@web606.mail.yahoo.com>
References: <19991117075353.16046.rocketmail@web606.mail.yahoo.com>
Message-ID: <199911171334.IAA03374@kaluha.cnri.reston.va.us>

> This is the simplest if each codec really is likely to
> be implemented in a separate module.  But just look at
> the data!  All the iso-8859 encodings need identical
> functionality, and just have a different mapping table
> with 256 elements.  It would be trivial to implement
> these in one module.  And the wide variety of Japanese
> encodings (mostly corporate or historical variants of
> the same character set) are again best treated from
> one code base with a bunch of mapping tables and
> routines to generate the variants - basically one can
> store the deltas.
> 
> So the choice is between possibly having a lot of
> almost-dummy modules, or having Python modules which
> generate and register a logical family of encodings.  
> 
> I may have some time next week and will try to code up
> a few so we can pound on something.

I see no problem with having a lot of near-dummy modules if it
simplifies the architecture.  You can still do code sharing.  Files
are cheap; APIs are expensive.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Wed Nov 17 13:38:35 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:38:35 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Wed, 17 Nov 1999 02:57:36 PST."
 
References: 
Message-ID: <199911171338.IAA03511@kaluha.cnri.reston.va.us>

> This is taking a simple solution and making it complicated. I see no
> benefit to the creating yet-another-level-of-hierarchy. Why should they be
> grouped?
> 
> Leave the modules just under "encodings" and be done with it.

Agreed.  Tim Peters once remarked that Python likes shallow encodings
(or perhaps that *I* like them :-).  This is one such case where I
would strongly urge for the simplicity of a shallow hierarchy.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Wed Nov 17 13:43:44 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:43:44 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Wed, 17 Nov 1999 03:14:01 PST."
 
References: 
Message-ID: <199911171343.IAA03636@kaluha.cnri.reston.va.us>

> Why a factory? I've got a simple encode() function. I don't need a
> factory. "flexibility" at the cost of complexity (IMO).

Unless there are certain cases where factories are useful.  But let's
read on...

> > 	action - a string stating the supported action:
> > 			'encode'
> > 			'decode'
> > 			'stream write'
> > 			'stream read'
> 
> This action thing is subject to error. *if* you're wanting to go this
> route, then have:
> 
> unicodec.register_encode(...)
> unicodec.register_decode(...)
> unicodec.register_stream_write(...)
> unicodec.register_stream_read(...)
> 
> They are equivalent. Guido has also told me in the past that he dislikes
> parameters that alter semantics -- preferring different functions instead.

Yes, indeed!  (But weren't we going to do away with the whole registry
idea in favor of an encodings package?)

> Not that I'm advocating it, but register() could also take a single
> parameter: if a class, then instantiate it and call methods for each
> action; if an instance, then just call methods for each action.

Nah, that's bad -- a class is just a factory, and once you are
allowing classes it's really good to also allowing factory functions.

> [ and the third/original variety: a function object as the first param is
>   the actual hook, and params 2 thru 4 (each are optional, or just the
>   stream funcs?) are the other hook functions ]

Fine too.  They should all be optional.

> > obj = factory_function_for_(errors='strict')
> 
> Where does this "errors" value come from? How does a user alter that
> value? Without an ability to change this, I see no reason for a factory.
> [ and no: don't tell me it is a thread-state value :-) ]
> 
> On the other hand: presuming the "errors" thing is valid, *then* I see a
> need for a factory.

The idea is that various places that take an encoding name can also
take a codec instance.  So the user can call the factory function /
class constructor.

> Truly... I dislike factories. IMO, they just add code/complexity in many
> cases where the functionality isn't needed. But that's just me :-)

Get over it...  In a sense, every Python class is a factory for its
own instances!  I think you must be confusing Python with Java or
C++. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Wed Nov 17 13:56:56 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:56:56 -0500
Subject: [Python-Dev] Apache process (was: Python 1.6 status)
In-Reply-To: Your message of "Wed, 17 Nov 1999 05:01:47 PST."
 
References: 
Message-ID: <199911171356.IAA04005@kaluha.cnri.reston.va.us>

> Somebody proposes that a person is added to the list of people with
> checkin privileges. If nobody else in the group vetoes that, then they're
> in (their system doesn't require continual participation by each member,
> so it can only operate at a veto level, rather than a unanimous assent).
> It is basically determined on the basis of merit -- has the person been
> active (on the Apache developer's mailing list) and has the person
> contributed something significant? Further, by providing commit access,
> will they further the goals of Apache? And, of course, does their
> temperament seem to fit in with the other group members?

This makes sense, but I have one concern: if somebody who isn't liked
very much (say a capable hacker who is a real troublemaker) asks for
privileges, would people veto this?  I'd be reluctant to go on record
as veto'ing a particular person.  (E.g. there are a few troublemakers
in c.l.py, and I would never want them to join python-dev let alone
give them commit privileges, but I'm not sure if I would want to
discuss this on a publicly archived mailing list -- or even on a
privately archived mailing list, given that the number of members
might be in the hundreds.

[...stuff I like...]

> I'll note that the process works very well given that diffs are emailed. I
> doubt that it would be effective if people had to fetch CVS diffs
> themselves.

That's a great idea; I'll see if we can do that to our checkin email,
regardless of whether we hand out commit privileges.

> Your note also implies "areas of ownership". This doesn't really exist
> within Apache. There aren't even "primary authors" or things like that. I
> have the ability/rights to change any portions: from the low-level
> networking, to the documentation, to the server-side include processing.

But that's Apache, which is explicitly run as a collective.  In
Python, I definitely want to have ownership of certain sections of the
code.  But I agree that this doesn't need to be formalized by access
control lists; the social process you describe sounds like it will
work just fine.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fdrake@acm.org  Wed Nov 17 14:44:25 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Wed, 17 Nov 1999 09:44:25 -0500 (EST)
Subject: Weak refs (was [Python-Dev] just say no...)
In-Reply-To: <000601bf30da$e069d820$a42d153f@tim>
References: <14385.33486.855802.187739@weyr.cnri.reston.va.us>
 <000601bf30da$e069d820$a42d153f@tim>
Message-ID: <14386.48969.630893.119344@weyr.cnri.reston.va.us>

Tim Peters writes:
 > about it.  Guido hasn't shown visible interest, and nobody has been willing
 > to fight him to the death over it.  So it languishes.  Buy him lunch
 > tomorrow and get him excited .

  Guido has asked me to pursue this topic, so I'll be checking out
available implementations and seeing if any are adoptable or if
something different is needed to be fully general and
well-integrated.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From tim_one@email.msn.com  Thu Nov 18 03:21:16 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:21:16 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <38327D8F.7A5352E6@lemburg.com>
Message-ID: <000101bf3173$f9805340$c0a0143f@tim>

[MAL]
> Guido and I have decided to turn \uXXXX into a standard
> escape sequence with no further magic applied. \uXXXX will
> only be expanded in u"" strings.

Does that exclude ur"" strings?  Not arguing either way, just don't know
what all this means.

> Here's the new scheme:
>
> With the 'unicode-escape' encoding being defined as:
>
> ˇ all non-escape characters represent themselves as a Unicode ordinal
>   (e.g. 'a' -> U+0061).

Same as before (scream if that's wrong).

> ˇ all existing defined Python escape sequences are interpreted as
>   Unicode ordinals;

Same as before (ditto).

> note that \xXXXX can represent all Unicode ordinals,

This means that the definition of \xXXXX has changed, then -- as you pointed
out just yesterday , \xABCDq currently acts like \xCDq.  Does the new
\x definition apply only in u"" strings, or in "" strings too?  What is the
new \x definition?

> and \OOO (octal) can represent Unicode ordinals up to U+01FF.

Same as before (ditto).

> ˇ a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
>   error to have fewer than 4 digits after \u.

Same as before (ditto).

IOW, I don't see anything that's changed other than an unspecified new
treatment of \x escapes, and possibly that ur"" strings don't expand \u
escapes.

> Examples:
>
> u'abc'          -> U+0061 U+0062 U+0063
> u'\u1234'       -> U+1234
> u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c

The last example is damaged (U+05c isn't legit).  Other than that, these
look the same as before.

> Now how should we define ur"abc\u1234\n"  ... ?

If strings carried an encoding tag with them, the obvious answer is that
this acts exactly like r"abc\u1234\n" acts today except gets a
"unicode-escaped" encoding tag instead of a "[whatever the default is
today]" encoding tag.

If strings don't carry an encoding tag with them, you're in a bit of a
pickle:  you'll have to convert it to a regular string or a Unicode string,
but in either case have no way to communicate that it may need further
processing; i.e., no way to distinguish it from a regular or Unicode string
produced by any other mechanism.  The code I posted yesterday remains my
best answer to that unpleasant puzzle (i.e., produce a Unicode string,
fiddling with backslashes just enough to get the \u escapes expanded, in the
same way Java's (conceptual) preprocessor does it).




From tim_one@email.msn.com  Thu Nov 18 03:21:19 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:21:19 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: 
Message-ID: <000201bf3173$fb7f7ea0$c0a0143f@tim>

[MAL]
> File objects opened in text mode will use "t#" and binary
> ones use "s#".

[Greg Stein]
> ...
> The real annoying thing would be to assume that opening a file as 'r'
> means that I *meant* text mode and to start using "t#".

Isn't that exactly what MAL said would happen?  Note that a "t" flag for
"text mode" is an MS extension -- C doesn't define "t", and Python doesn't
either; a lone "r" has always meant text mode.

> In actuality, I typically open files that way since I do most of my
> coding on Linux. If I now have to pay attention to things and open it
> as 'rb', then I'll be pissed.
>
> And the change in behavior and bugs that interpreting 'r' as text would
> introduce? Ack!

'r' is already intepreted as text mode, but so far, on Unix-like systems,
there's been no difference between text and binary modes.  Introducing a
distinction will certainly cause problems.  I don't know what the
compensating advantages are thought to be.




From tim_one@email.msn.com  Thu Nov 18 03:23:00 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:23:00 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <199911171332.IAA03266@kaluha.cnri.reston.va.us>
Message-ID: <000301bf3174$37b465c0$c0a0143f@tim>

[Guido]
> I will force nobody to use checkin privileges.

That almost went without saying .

> However I see that for some contributors, checkin privileges will
> save me and them time.

Then it's Good!  Provided it doesn't hurt language stability.  I agree that
changing the system to mail out diffs addresses what I was worried about
there.




From tim_one@email.msn.com  Thu Nov 18 03:31:38 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:31:38 -0500
Subject: [Python-Dev] Apache process (was: Python 1.6 status)
In-Reply-To: <199911171356.IAA04005@kaluha.cnri.reston.va.us>
Message-ID: <000401bf3175$6c089660$c0a0143f@tim>

[Greg]
> ...
> Somebody proposes that a person is added to the list of people with
> checkin privileges. If nobody else in the group vetoes that, then
? they're in ...

[Guido]
> This makes sense, but I have one concern: if somebody who isn't liked
> very much (say a capable hacker who is a real troublemaker) asks for
> privileges, would people veto this?

It seems that a key point in Greg's description is that people don't propose
*themselves* for checkin.  They have to talk someone else into proposing
them.  That should keep Endang out of the running for a few years .

After that, I care more about their code than their personalities.  If the
stuff they check in is good, fine; if it's not, lock 'em out for direct
cause.

> I'd be reluctant to go on record as veto'ing a particular person.

Secret Ballot run off a web page -- although not so secret you can't see who
voted for what .




From tim_one@email.msn.com  Thu Nov 18 03:37:18 1999
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:37:18 -0500
Subject: Weak refs (was [Python-Dev] just say no...)
In-Reply-To: <14386.48969.630893.119344@weyr.cnri.reston.va.us>
Message-ID: <000501bf3176$36a5ca00$c0a0143f@tim>

[Fred L. Drake, Jr.]
> Guido has asked me to pursue this topic [weak refs], so I'll be
> checking out available implementations and seeing if any are
> adoptable or if something different is needed to be fully general
> and well-integrated.

Just don't let "fully general" stop anything for its sake alone; e.g., if
there's a slick trick that *could* exempt numbers, that's all to the good!
Adding a pointer to every object is really unattractive, while adding a flag
or two to type objects is dirt cheap.

Note in passing that current Java addresses weak refs too (several flavors
of 'em! -- very elaborate).




From gstein@lyra.org  Thu Nov 18 08:09:24 1999
From: gstein@lyra.org (Greg Stein)
Date: Thu, 18 Nov 1999 00:09:24 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000201bf3173$fb7f7ea0$c0a0143f@tim>
Message-ID: 

On Wed, 17 Nov 1999, Tim Peters wrote:
>...
> 'r' is already intepreted as text mode, but so far, on Unix-like systems,
> there's been no difference between text and binary modes.  Introducing a
> distinction will certainly cause problems.  I don't know what the
> compensating advantages are thought to be.

Wow. "compensating advantages" ... Excellent "power phrase" there.

hehe...

-g

--
Greg Stein, http://www.lyra.org/



From mal@lemburg.com  Thu Nov 18 08:15:04 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:15:04 +0100
Subject: [Python-Dev] just say no...
References: <000201bf3173$fb7f7ea0$c0a0143f@tim>
Message-ID: <3833B588.1E31F01B@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > File objects opened in text mode will use "t#" and binary
> > ones use "s#".
> 
> [Greg Stein]
> > ...
> > The real annoying thing would be to assume that opening a file as 'r'
> > means that I *meant* text mode and to start using "t#".
> 
> Isn't that exactly what MAL said would happen?  Note that a "t" flag for
> "text mode" is an MS extension -- C doesn't define "t", and Python doesn't
> either; a lone "r" has always meant text mode.

Em, I think you've got something wrong here: "t#" refers to the
parsing marker used for writing data to files opened in text mode.

Until now, all files used the "s#" parsing marker for writing
data, regardeless of being opened in text or binary mode. The
new interpretation (new, because there previously was none ;-)
of the buffer interface forces this to be changed to regain
conformance.

> > In actuality, I typically open files that way since I do most of my
> > coding on Linux. If I now have to pay attention to things and open it
> > as 'rb', then I'll be pissed.
> >
> > And the change in behavior and bugs that interpreting 'r' as text would
> > introduce? Ack!
> 
> 'r' is already intepreted as text mode, but so far, on Unix-like systems,
> there's been no difference between text and binary modes.  Introducing a
> distinction will certainly cause problems.  I don't know what the
> compensating advantages are thought to be.

I guess you won't notice any difference: strings define both
interfaces ("s#" and "t#") to mean the same thing. Only other
buffer compatible types may now fail to write to text files
-- which is not so bad, because it forces the programmer to
rethink what he really intended when opening the file in text
mode.

Besides, if you are writing portable scripts you should pay
close attention to "r" vs. "rb" anyway.

[Strange, I find myself argueing for a feature that I don't
like myself ;-)]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 18 08:59:21 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:59:21 +0100
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us> <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com>
Message-ID: <3833BFE9.6FD118B1@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum  wrote:
> > - suggestions for new issues that maybe ought to be settled in 1.6
> 
> three things: imputil, imputil, imputil

But please don't add the current version as default importer...
its strategy is way too slow for real life apps (yes, I've tested
this: imports typically take twice as long as with the builtin
importer).

I'd opt for an import manager which provides a useful API for
import hooks to register themselves with. What we really need
is not yet another complete reimplementation of what the
builtin importer does, but rather a more detailed exposure of
the various import aspects: finding modules and loading modules.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 18 08:50:36 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:50:36 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com>
Message-ID: <3833BDDC.7CD2CC1F@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg  wrote:
> > >     def flush(self):
> > >         # flush the decoding buffers.  this should usually
> > >         # return None, unless the fact that knowing that the
> > >         # input stream has ended means that the state can be
> > >         # interpreted in a meaningful way.  however, if the
> > >         # state indicates that there last character was not
> > >         # finished, this method should raise a UnicodeError
> > >         # exception.
> >
> > Could you explain for reason for having a .flush() method
> > and what it should return.
> 
> in most cases, it should either return None, or
> raise a UnicodeError exception:
> 
>     >>> u = unicode("ĺ i ĺa ä e ö", "iso-latin-1")
>     >>> # yes, that's a valid Swedish sentence ;-)
>     >>> s = u.encode("utf-8")
>     >>> d = decoder("utf-8")
>     >>> d.decode(s[:-1])
>     "ĺ i ĺa ä e "
>     >>> d.flush()
>     UnicodeError: last character not complete
> 
> on the other hand, there are situations where it
> might actually return a string.  consider a "HTML
> entity decoder" which uses the following pattern
> to match a character entity: "&\w+;?" (note that
> the trailing semicolon is optional).
> 
>     >>> u = unicode("ĺ i ĺa ä e ö", "iso-latin-1")
>     >>> s = u.encode("html-entities")
>     >>> d = decoder("html-entities")
>     >>> d.decode(s[:-1])
>     "ĺ i ĺa ä e "
>     >>> d.flush()
>     "ö"

Ah, ok. So the .flush() method checks for proper
string endings and then either returns the remaining
input or raises an error.
 
> > Perhaps I'm missing something, but how would you define
> > stream codecs using this interface ?
> 
> input: read chunks of data, decode, and
> keep extra data in a local buffer.
> 
> output: encode data into suitable chunks,
> and write to the output stream (that's why
> there's a buffersize argument to encode --
> if someone writes a 10mb unicode string to
> an encoded stream, python shouldn't allocate
> an extra 10-30 megabytes just to be able to
> encode the darn thing...)

So the stream codecs would be wrappers around the
string codecs.

Have you read my latest version of the Codec interface ?
Wouldn't that be a reasonable approach ? Note that I have
integrated your ideas into the new API -- it's basically
only missing the .flush() methods, which I can add now
that I know what you meant.
 
> > > Implementing stream codecs is left as an exercise (see the zlib
> > > material in the eff-bot guide for a decoder example).
> 
> everybody should have a copy of the eff-bot guide ;-)

Sure, but the format, the format... make it printed and add
a CD and you would probably have a good selling book
there ;-)
 
> (but alright, I plan to post a complete utf-8 implementation
> in a not too distant future).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 18 08:16:48 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:16:48 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: 
Message-ID: <3833B5F0.FA4620AD@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
> >...
> > I'd suggest grouping encodings:
> >
> > [encodings]
> >       [iso}
> >               [iso88591]
> >               [iso88592]
> >       [jis]
> >               ...
> >       [cyrillic]
> >               ...
> >       [misc]
> 
> WHY?!?!
> 
> This is taking a simple solution and making it complicated. I see no
> benefit to the creating yet-another-level-of-hierarchy. Why should they be
> grouped?
> 
> Leave the modules just under "encodings" and be done with it.

Nevermind, was just an idea...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 18 08:43:31 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:43:31 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References:  <199911171343.IAA03636@kaluha.cnri.reston.va.us>
Message-ID: <3833BC33.66E134F@lemburg.com>

Guido van Rossum wrote:
> 
> > Why a factory? I've got a simple encode() function. I don't need a
> > factory. "flexibility" at the cost of complexity (IMO).
> 
> Unless there are certain cases where factories are useful.  But let's
> read on...
>
> > >     action - a string stating the supported action:
> > >                     'encode'
> > >                     'decode'
> > >                     'stream write'
> > >                     'stream read'
> >
> > This action thing is subject to error. *if* you're wanting to go this
> > route, then have:
> >
> > unicodec.register_encode(...)
> > unicodec.register_decode(...)
> > unicodec.register_stream_write(...)
> > unicodec.register_stream_read(...)
> >
> > They are equivalent. Guido has also told me in the past that he dislikes
> > parameters that alter semantics -- preferring different functions instead.
> 
> Yes, indeed!

Ok.

> (But weren't we going to do away with the whole registry
> idea in favor of an encodings package?)

One way or another, the Unicode implementation will have to
access a dictionary containing references to the codecs for
a particular encoding. You won't get around registering these
at some point... be it in a lazy way, on-the-fly or by some
other means.

What we could do is implement the lookup like this:

1. call encodings.lookup_(encoding) and use the
   return value for the conversion
2. if all fails, cop out with an error

Step 1. would do all the import magic and then register
the found codecs in some dictionary for faster access
(perhaps this could be done in a way that is directly
available to the Unicode implementation, e.g. in a
global internal dictionary -- the one I originally had in
mind for the unicodec registry).

> > Not that I'm advocating it, but register() could also take a single
> > parameter: if a class, then instantiate it and call methods for each
> > action; if an instance, then just call methods for each action.
> 
> Nah, that's bad -- a class is just a factory, and once you are
> allowing classes it's really good to also allowing factory functions.
> 
> > [ and the third/original variety: a function object as the first param is
> >   the actual hook, and params 2 thru 4 (each are optional, or just the
> >   stream funcs?) are the other hook functions ]
> 
> Fine too.  They should all be optional.

Ok.
 
> > > obj = factory_function_for_(errors='strict')
> >
> > Where does this "errors" value come from? How does a user alter that
> > value? Without an ability to change this, I see no reason for a factory.
> > [ and no: don't tell me it is a thread-state value :-) ]
> >
> > On the other hand: presuming the "errors" thing is valid, *then* I see a
> > need for a factory.
> 
> The idea is that various places that take an encoding name can also
> take a codec instance.  So the user can call the factory function /
> class constructor.

Right. The argument is reachable via:

Codec = encodings.lookup_encode('utf-8')
codec = Codec(errors='?')
s = codec(u"abcäöäü")

s would then equal 'abc??'.

--

Should I go ahead then and change the registry business to
the new strategy (via the encodings package in the above
sense) ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mhammond@skippinet.com.au  Thu Nov 18 10:57:44 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 18 Nov 1999 21:57:44 +1100
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <3833BC33.66E134F@lemburg.com>
Message-ID: <002401bf31b3$bf16c230$0501a8c0@bobcat>

[Guido]
> > (But weren't we going to do away with the whole registry
> > idea in favor of an encodings package?)
>
[MAL]
> One way or another, the Unicode implementation will have to
> access a dictionary containing references to the codecs for
> a particular encoding. You won't get around registering these
> at some point... be it in a lazy way, on-the-fly or by some
> other means.

What is wrong with my idea of using well-known-names from the encoding
module?  The dict then is "encodings..__dict__".  All
encodings "just work" because the leverage from the Python module
system.  Unless Im missing something, there is no need for any extra
registry at all.  I guess it would actually resolve to 2 dict lookups,
but thats OK surely?

Mark.



From mal@lemburg.com  Thu Nov 18 09:39:30 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 10:39:30 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf3173$f9805340$c0a0143f@tim>
Message-ID: <3833C952.C6F154B1@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > Guido and I have decided to turn \uXXXX into a standard
> > escape sequence with no further magic applied. \uXXXX will
> > only be expanded in u"" strings.
> 
> Does that exclude ur"" strings?  Not arguing either way, just don't know
> what all this means.
> 
> > Here's the new scheme:
> >
> > With the 'unicode-escape' encoding being defined as:
> >
> > ˇ all non-escape characters represent themselves as a Unicode ordinal
> >   (e.g. 'a' -> U+0061).
> 
> Same as before (scream if that's wrong).
> 
> > ˇ all existing defined Python escape sequences are interpreted as
> >   Unicode ordinals;
> 
> Same as before (ditto).
> 
> > note that \xXXXX can represent all Unicode ordinals,
> 
> This means that the definition of \xXXXX has changed, then -- as you pointed
> out just yesterday , \xABCDq currently acts like \xCDq.  Does the new
> \x definition apply only in u"" strings, or in "" strings too?  What is the
> new \x definition?

Guido decided to make \xYYXX return U+YYXX *only* within u""
strings. In  "" (Python strings) the same sequence will result
in chr(0xXX).
 
> > and \OOO (octal) can represent Unicode ordinals up to U+01FF.
> 
> Same as before (ditto).
> 
> > ˇ a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
> >   error to have fewer than 4 digits after \u.
> 
> Same as before (ditto).
> 
> IOW, I don't see anything that's changed other than an unspecified new
> treatment of \x escapes, and possibly that ur"" strings don't expand \u
> escapes.

The difference is that we no longer take the two step approach.
\uXXXX is treated at the same time all other escape sequences
are decoded (the previous version first scanned and decoded
all standard Python sequences and then turned to the \uXXXX
sequences in a second scan).
 
> > Examples:
> >
> > u'abc'          -> U+0061 U+0062 U+0063
> > u'\u1234'       -> U+1234
> > u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c
> 
> The last example is damaged (U+05c isn't legit).  Other than that, these
> look the same as before.

Corrected; thanks.
 
> > Now how should we define ur"abc\u1234\n"  ... ?
> 
> If strings carried an encoding tag with them, the obvious answer is that
> this acts exactly like r"abc\u1234\n" acts today except gets a
> "unicode-escaped" encoding tag instead of a "[whatever the default is
> today]" encoding tag.
> 
> If strings don't carry an encoding tag with them, you're in a bit of a
> pickle:  you'll have to convert it to a regular string or a Unicode string,
> but in either case have no way to communicate that it may need further
> processing; i.e., no way to distinguish it from a regular or Unicode string
> produced by any other mechanism.  The code I posted yesterday remains my
> best answer to that unpleasant puzzle (i.e., produce a Unicode string,
> fiddling with backslashes just enough to get the \u escapes expanded, in the
> same way Java's (conceptual) preprocessor does it).

They don't have such tags... so I guess we're in trouble ;-)

I guess to make ur"" have a meaning at all, we'd need to go
the Java preprocessor way here, i.e. scan the string *only*
for \uXXXX sequences, decode these and convert the rest as-is
to Unicode ordinals.

Would that be ok ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 18 11:41:32 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 12:41:32 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
Message-ID: <3833E5EC.AAFE5016@lemburg.com>

Mark Hammond wrote:
> 
> [Guido]
> > > (But weren't we going to do away with the whole registry
> > > idea in favor of an encodings package?)
> >
> [MAL]
> > One way or another, the Unicode implementation will have to
> > access a dictionary containing references to the codecs for
> > a particular encoding. You won't get around registering these
> > at some point... be it in a lazy way, on-the-fly or by some
> > other means.
> 
> What is wrong with my idea of using well-known-names from the encoding
> module?  The dict then is "encodings..__dict__".  All
> encodings "just work" because the leverage from the Python module
> system.  Unless Im missing something, there is no need for any extra
> registry at all.  I guess it would actually resolve to 2 dict lookups,
> but thats OK surely?

The problem is that the encoding names are not Python identifiers,
e.g. iso-8859-1 is allowed as identifier. This and
the fact that applications may want to ship their own codecs (which
do not get installed under the system wide encodings package)
make the registry necessary.

I don't see a problem with the registry though -- the encodings
package can take care of the registration process without any
user interaction. There would only have to be an API for
looking up an encoding published by the encodings package for
the Unicode implementation to use. The magic behind that API
is left to the encodings package...

BTW, nothing's wrong with your idea :-) In fact, I like it
a lot because it keeps the encoding modules out of the
top-level scope which is good.

PS: we could probably even take the whole codec idea one step
further and also allow other input/output formats to be registered,
e.g. stream ciphers or pickle mechanisms. The step in that
direction is not a big one: we'd only have to drop the specification
of the Unicode object in the spec and replace it with an arbitrary
object. Of course, this will still have to be a Unicode object
for use by the Unicode implementation.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From gmcm@hypernet.com  Thu Nov 18 14:19:48 1999
From: gmcm@hypernet.com (Gordon McMillan)
Date: Thu, 18 Nov 1999 09:19:48 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <3833BFE9.6FD118B1@lemburg.com>
Message-ID: <1269187709-18981857@hypernet.com>

Marc-Andre wrote:

> Fredrik Lundh wrote:
> > 
> > Guido van Rossum  wrote:
> > > - suggestions for new issues that maybe ought to be settled in 1.6
> > 
> > three things: imputil, imputil, imputil
> 
> But please don't add the current version as default importer...
> its strategy is way too slow for real life apps (yes, I've tested
> this: imports typically take twice as long as with the builtin
> importer).

I think imputil's emulation of the builtin importer is more of a 
demonstration than a serious implementation. As for speed, it 
depends on the test. 
 
> I'd opt for an import manager which provides a useful API for
> import hooks to register themselves with. 

I think that rather than blindly chain themselves together, there 
should be a simple minded manager. This could let the 
programmer prioritize them.

> What we really need
> is not yet another complete reimplementation of what the
> builtin importer does, but rather a more detailed exposure of
> the various import aspects: finding modules and loading modules.

The first clause I sort of agree with - the current 
implementation is a fine implementation of a filesystem 
directory based importer.

I strongly disagree with the second clause. The current import 
hooks are just such a detailed exposure; and they are 
incomprehensible and unmanagable.

I guess you want to tweak the "finding" part of the builtin 
import mechanism. But that's no reason to ask all importers 
to break themselves up into "find" and "load" pieces. It's a 
reason to ask that the standard importer be, in some sense, 
"subclassable" (ie, expose hooks, or perhaps be an extension 
class like thingie).

- Gordon


From jim@interet.com  Thu Nov 18 14:39:20 1999
From: jim@interet.com (James C. Ahlstrom)
Date: Thu, 18 Nov 1999 09:39:20 -0500
Subject: [Python-Dev] Python 1.6 status
References: <1269187709-18981857@hypernet.com>
Message-ID: <38340F98.212F61@interet.com>

Gordon McMillan wrote:
> 
> Marc-Andre wrote:
> 
> > Fredrik Lundh wrote:
> > >
> > > Guido van Rossum  wrote:
> > > > - suggestions for new issues that maybe ought to be settled in 1.6
> > >
> > > three things: imputil, imputil, imputil
> >
> > But please don't add the current version as default importer...
> > its strategy is way too slow for real life apps (yes, I've tested
> > this: imports typically take twice as long as with the builtin
> > importer).
> 
> I think imputil's emulation of the builtin importer is more of a
> demonstration than a serious implementation. As for speed, it
> depends on the test.

IMHO the current import mechanism is good for developers who must
work on the library code in the directory tree, but a disaster
for sysadmins who must distribute Python applications either
internally to a number of machines or commercially.  What we
need is a standard Python library file like a Java "Jar" file.
Imputil can support this as 130 lines of Python.  I have also
written one in C.  I like the imputil approach, but if we want
to add a library importer to import.c, I volunteer to write it.

I don't want to just add more complicated and unmanageable hooks
which people will all use different ways and just add to the
confusion.

It is easy to install packages by just making them into a library
file and throwing it into a directory.  So why aren't we doing it?

Jim Ahlstrom


From guido@CNRI.Reston.VA.US  Thu Nov 18 15:30:28 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 10:30:28 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: Your message of "Thu, 18 Nov 1999 09:19:48 EST."
 <1269187709-18981857@hypernet.com>
References: <1269187709-18981857@hypernet.com>
Message-ID: <199911181530.KAA03887@eric.cnri.reston.va.us>

Gordon McMillan wrote:

> Marc-Andre wrote:
> 
> > Fredrik Lundh wrote:
> >
> > > Guido van Rossum  wrote:
> > > > - suggestions for new issues that maybe ought to be settled in 1.6
> > > 
> > > three things: imputil, imputil, imputil
> > 
> > But please don't add the current version as default importer...
> > its strategy is way too slow for real life apps (yes, I've tested
> > this: imports typically take twice as long as with the builtin
> > importer).
> 
> I think imputil's emulation of the builtin importer is more of a 
> demonstration than a serious implementation. As for speed, it 
> depends on the test. 

Agreed.  I like some of imputil's features, but I think the API
need to be redesigned.

> > I'd opt for an import manager which provides a useful API for
> > import hooks to register themselves with. 
> 
> I think that rather than blindly chain themselves together, there 
> should be a simple minded manager. This could let the 
> programmer prioritize them.

Indeed.  (A list of importers has been suggested, to replace the list
of directories currently used.)

> > What we really need
> > is not yet another complete reimplementation of what the
> > builtin importer does, but rather a more detailed exposure of
> > the various import aspects: finding modules and loading modules.
> 
> The first clause I sort of agree with - the current 
> implementation is a fine implementation of a filesystem 
> directory based importer.
> 
> I strongly disagree with the second clause. The current import 
> hooks are just such a detailed exposure; and they are 
> incomprehensible and unmanagable.

Based on how many people have successfully written import hooks, I
have to agree. :-(

> I guess you want to tweak the "finding" part of the builtin 
> import mechanism. But that's no reason to ask all importers 
> to break themselves up into "find" and "load" pieces. It's a 
> reason to ask that the standard importer be, in some sense, 
> "subclassable" (ie, expose hooks, or perhaps be an extension 
> class like thingie).

Agreed.  Subclassing is a good way towards flexibility.

And Jim Ahlstrom writes:

> IMHO the current import mechanism is good for developers who must
> work on the library code in the directory tree, but a disaster
> for sysadmins who must distribute Python applications either
> internally to a number of machines or commercially.

Unfortunately, you're right. :-(

> What we need is a standard Python library file like a Java "Jar"
> file.  Imputil can support this as 130 lines of Python.  I have also
> written one in C.  I like the imputil approach, but if we want to
> add a library importer to import.c, I volunteer to write it.

Please volunteer to design or at least review the grand architecture
-- see below.

> I don't want to just add more complicated and unmanageable hooks
> which people will all use different ways and just add to the
> confusion.

You're so right!

> It is easy to install packages by just making them into a library
> file and throwing it into a directory.  So why aren't we doing it?

Rhetorical question. :-)

So here's a challenge: redesign the import API from scratch.

Let me start with some requirements.

Compatibility issues:
---------------------

- the core API may be incompatible, as long as compatibility layers
can be provided in pure Python

- support for rexec functionality

- support for freeze functionality

- load .py/.pyc/.pyo files and shared libraries from files

- support for packages

- sys.path and sys.modules should still exist; sys.path might
have a slightly different meaning

- $PYTHONPATH and $PYTHONHOME should still be supported

(I wouldn't mind a splitting up of importdl.c into several
platform-specific files, one of which is chosen by the configure
script; but that's a bit of a separate issue.)

New features:
-------------

- Integrated support for Greg Ward's distribution utilities (i.e. a
  module prepared by the distutil tools should install painlessly)

- Good support for prospective authors of "all-in-one" packaging tool
  authors like Gordon McMillan's win32 installer or /F's squish.  (But
  I *don't* require backwards compatibility for existing tools.)

- Standard import from zip or jar files, in two ways:

  (1) an entry on sys.path can be a zip/jar file instead of a directory;
      its contents will be searched for modules or packages

  (2) a file in a directory that's on sys.path can be a zip/jar file;
      its contents will be considered as a package (note that this is
      different from (1)!)

  I don't particularly care about supporting all zip compression
  schemes; if Java gets away with only supporting gzip compression
  in jar files, so can we.

- Easy ways to subclass or augment the import mechanism along
  different dimensions.  For example, while none of the following
  features should be part of the core implementation, it should be
  easy to add any or all:

  - support for a new compression scheme to the zip importer

  - support for a new archive format, e.g. tar

  - a hook to import from URLs or other data sources (e.g. a
    "module server" imported in CORBA) (this needn't be supported
    through $PYTHONPATH though)

  - a hook that imports from compressed .py or .pyc/.pyo files

  - a hook to auto-generate .py files from other filename
    extensions (as currently implemented by ILU)

  - a cache for file locations in directories/archives, to improve
    startup time

  - a completely different source of imported modules, e.g. for an
    embedded system or PalmOS (which has no traditional filesystem)

- Note that different kinds of hooks should (ideally, and within
  reason) properly combine, as follows: if I write a hook to recognize
  .spam files and automatically translate them into .py files, and you
  write a hook to support a new archive format, then if both hooks are
  installed together, it should be possible to find a .spam file in an
  archive and do the right thing, without any extra action.  Right?

- It should be possible to write hooks in C/C++ as well as Python

- Applications embedding Python may supply their own implementations,
  default search path, etc., but don't have to if they want to piggyback
  on an existing Python installation (even though the latter is
  fraught with risk, it's cheaper and easier to understand).

Implementation:
---------------

- There must clearly be some code in C that can import certain
  essential modules (to solve the chicken-or-egg problem), but I don't
  mind if the majority of the implementation is written in Python.
  Using Python makes it easy to subclass.

- In order to support importing from zip/jar files using compression,
  we'd at least need the zlib extension module and hence libz itself,
  which may not be available everywhere.

- I suppose that the bootstrap is solved using a mechanism very
  similar to what freeze currently used (other solutions seem to be
  platform dependent).

- I also want to still support importing *everything* from the
  filesystem, if only for development.  (It's hard enough to deal with
  the fact that exceptions.py is needed during Py_Initialize();
  I want to be able to hack on the import code written in Python
  without having to rebuild the executable all the time.

Let's first complete the requirements gathering.  Are these
requirements reasonable?  Will they make an implementation too
complex?  Am I missing anything?

Finally, to what extent does this impact the desire for dealing
differently with the Python bytecode compiler (e.g. supporting
optimizers written in Python)?  And does it affect the desire to
implement the read-eval-print loop (the >>> prompt) in Python?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Thu Nov 18 15:37:49 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 10:37:49 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Thu, 18 Nov 1999 12:41:32 +0100."
 <3833E5EC.AAFE5016@lemburg.com>
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
 <3833E5EC.AAFE5016@lemburg.com>
Message-ID: <199911181537.KAA03911@eric.cnri.reston.va.us>

> The problem is that the encoding names are not Python identifiers,
> e.g. iso-8859-1 is allowed as identifier.

This is easily taken care of by translating each string of consecutive
non-identifier-characters to an underscore, so this would import the
iso_8859_1.py module.  (I also noticed in an earlier post that the
official name for Shift_JIS has an underscore, while most other
encodings use hyphens.)

> This and
> the fact that applications may want to ship their own codecs (which
> do not get installed under the system wide encodings package)
> make the registry necessary.

But it could be enough to register a package where to look for
encodings (in addition to the system package).

Or there could be a registry for encoding search functions.  (See the
import discussion.)

> I don't see a problem with the registry though -- the encodings
> package can take care of the registration process without any
> user interaction. There would only have to be an API for
> looking up an encoding published by the encodings package for
> the Unicode implementation to use. The magic behind that API
> is left to the encodings package...

I think that the collection of encodings will eventually grow large
enough to make it a requirement to avoid doing work proportional to
the number of supported encodings at startup (or even when an encoding
is referenced for the first time).  Any "lazy" mechanism (of which
module search is an example) will do.

> BTW, nothing's wrong with your idea :-) In fact, I like it
> a lot because it keeps the encoding modules out of the
> top-level scope which is good.

Yes.

> PS: we could probably even take the whole codec idea one step
> further and also allow other input/output formats to be registered,
> e.g. stream ciphers or pickle mechanisms. The step in that
> direction is not a big one: we'd only have to drop the specification
> of the Unicode object in the spec and replace it with an arbitrary
> object. Of course, this will still have to be a Unicode object
> for use by the Unicode implementation.

This is a step towards Java's architecture of stackable streams.

But I'm always in favor of tackling what we know we need before
tackling the most generalized version of the problem.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Thu Nov 18 15:52:26 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 16:52:26 +0100
Subject: [Python-Dev] Python 1.6 status
References: <1269187709-18981857@hypernet.com> <38340F98.212F61@interet.com>
Message-ID: <383420BA.EF8A6AC5@lemburg.com>

[imputil and friends]

"James C. Ahlstrom" wrote:
> 
> IMHO the current import mechanism is good for developers who must
> work on the library code in the directory tree, but a disaster
> for sysadmins who must distribute Python applications either
> internally to a number of machines or commercially.  What we
> need is a standard Python library file like a Java "Jar" file.
> Imputil can support this as 130 lines of Python.  I have also
> written one in C.  I like the imputil approach, but if we want
> to add a library importer to import.c, I volunteer to write it.
> 
> I don't want to just add more complicated and unmanageable hooks
> which people will all use different ways and just add to the
> confusion.
> 
> It is easy to install packages by just making them into a library
> file and throwing it into a directory.  So why aren't we doing it?

Perhaps we ought to rethink the strategy under a different
light: what are the real requirement we have for Python imports ?

Perhaps the outcome is only the addition of say one or two features
and those can probably easily be added to the builtin system...
then we can just forget about the whole import hook dilema
for quite a while (AFAIK, this is how we got packages into the
core -- people weren't happy with the import hook).

Well, just an idea... I have other threads to follow :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From fdrake@acm.org  Thu Nov 18 16:01:47 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Thu, 18 Nov 1999 11:01:47 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <3833E5EC.AAFE5016@lemburg.com>
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
 <3833E5EC.AAFE5016@lemburg.com>
Message-ID: <14388.8939.911928.41746@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > The problem is that the encoding names are not Python identifiers,
 > e.g. iso-8859-1 is allowed as identifier. This and
 > the fact that applications may want to ship their own codecs (which
 > do not get installed under the system wide encodings package)
 > make the registry necessary.

  This isn't a substantial problem.  Try this on for size (probably
not too different from what everyone is already thinking, but let's
make it clear).  This could be in encodings/__init__.py; I've tried to 
be really clear on the names.  (No testing, only partially complete.)

------------------------------------------------------------------------
import string
import sys

try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO


class EncodingError(Exception):
    def __init__(self, encoding, error):
        self.encoding = encoding
        self.strerror = "%s %s" % (error, `encoding`)
        self.error = error
        Exception.__init__(self, encoding, error)


_registry = {}

def registerEncoding(encoding, encode=None, decode=None,
                     make_stream_encoder=None, make_stream_decoder=None):
    encoding = encoding.lower()
    if _registry.has_key(encoding):
        info = _registry[encoding]
    else:
        info = _registry[encoding] = Codec(encoding)
    info._update(encode, decode,
                 make_stream_encoder, make_stream_decoder)


def getCodec(encoding):
    encoding = encoding.lower()
    if _registry.has_key(encoding):
        return _registry[encoding]

    # load the module
    modname = "encodings." + encoding.replace("-", "_")
    try:
        __import__(modname)
    except ImportError:
        raise EncodingError("unknown uncoding " + `encoding`)

    # if the module registered, use the codec as-is:
    if _registry.has_key(encoding):
        return _registry[encoding]

    # nothing registered, use well-known names
    module = sys.modules[modname]
    codec = _registry[encoding] = Codec(encoding)
    encode = getattr(module, "encode", None)
    decode = getattr(module, "decode", None)
    make_stream_encoder = getattr(module, "make_stream_encoder", None)
    make_stream_decoder = getattr(module, "make_stream_decoder", None)
    codec._update(encode, decode,
                  make_stream_encoder, make_stream_decoder)


class Codec:
    __encode = None
    __decode = None
    __stream_encoder_factory = None
    __stream_decoder_factory = None

    def __init__(self, name):
        self.name = name

    def encode(self, u):
        if self.__stream_encoder_factory:
            sio = StringIO()
            encoder = self.__stream_encoder_factory(sio)
            encoder.write(u)
            encoder.flush()
            return sio.getvalue()
        else:
            raise EncodingError("no encoder available for " + `self.name`)

    # similar for decode()...

    def make_stream_encoder(self, target):
        if self.__stream_encoder_factory:
            return self.__stream_encoder_factory(target)
        elif self.__encode:
            return DefaultStreamEncoder(target, self.__encode)
        else:
            raise EncodingError("no encoder available for " + `self.name`)

    # similar for make_stream_decoder()...

    def _update(self, encode, decode,
                make_stream_encoder, make_stream_decoder):
        self.__encode = encode or self.__encode
        self.__decode = decode or self.__decode
        self.__stream_encoder_factory = (
            make_stream_encoder or self.__stream_encoder_factory)
        self.__stream_decoder_factory = (
            make_stream_decoder or self.__stream_decoder_factory)
------------------------------------------------------------------------

 > I don't see a problem with the registry though -- the encodings
 > package can take care of the registration process without any

  No problem at all; we just need to make sure the right magic is
there for the "normal" case.

 > PS: we could probably even take the whole codec idea one step
 > further and also allow other input/output formats to be registered,

  File formats are different from text encodings, so let's keep them
separate.  Yes, a registry can be a good approach whenever the various 
things being registered are sufficiently similar semantically, but the 
behavior of the registry/lookup can be very different for each type of 
thing.  Let's not over-generalize.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From fdrake@acm.org  Thu Nov 18 16:02:45 1999
From: fdrake@acm.org (Fred L. Drake, Jr.)
Date: Thu, 18 Nov 1999 11:02:45 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <3833E5EC.AAFE5016@lemburg.com>
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
 <3833E5EC.AAFE5016@lemburg.com>
Message-ID: <14388.8997.703108.401808@weyr.cnri.reston.va.us>

  Er, I should note that the sample code I just sent makes use of
string methods.  ;)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives


From mal@lemburg.com  Thu Nov 18 16:23:09 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 17:23:09 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
 <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us>
Message-ID: <383427ED.45A01BBB@lemburg.com>

Guido van Rossum wrote:
> 
> > The problem is that the encoding names are not Python identifiers,
> > e.g. iso-8859-1 is allowed as identifier.
> 
> This is easily taken care of by translating each string of consecutive
> non-identifier-characters to an underscore, so this would import the
> iso_8859_1.py module.  (I also noticed in an earlier post that the
> official name for Shift_JIS has an underscore, while most other
> encodings use hyphens.)

Right. That's one way of doing it.

> > This and
> > the fact that applications may want to ship their own codecs (which
> > do not get installed under the system wide encodings package)
> > make the registry necessary.
> 
> But it could be enough to register a package where to look for
> encodings (in addition to the system package).
> 
> Or there could be a registry for encoding search functions.  (See the
> import discussion.)

Like a path of search functions ? Not a bad idea... I will still
want the internal dict for caching purposes though. I'm not sure
how often these encodings will be, but even a few hundred function
call will slow down the Unicode implementation quite a bit.

The implementation could proceed as follows:

def lookup(encoding):

    codecs = _internal_dict.get(encoding,None)
    if codecs:
	return codecs
    for query in sys.encoders:
	codecs = query(encoding)
	if codecs:
	    break
    else:
	raise UnicodeError,'unkown encoding: %s' % encoding
    _internal_dict[encoding] = codecs
    return codecs

For simplicity, codecs should be a tuple (encoder,decoder,
stream_writer,stream_reader) of factory functions.

...that is if we can agree on these 4 APIs :-) Here are my
current versions:
-----------------------------------------------------------------------
class Codec:

    """ Defines the interface for stateless encoders/decoders.
    """

    def __init__(self,errors='strict'):

	""" Creates a Codec instance.

	    The Codec may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.errors = errors

    def encode(self,u,slice=None):
	
	""" Return the Unicode object u encoded as Python string.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is encoded.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	"""
	...

    def decode(self,s,offset=0):

	""" Decodes data from the Python string s and returns a tuple 
	    (Unicode object, bytes consumed).
	
	    If offset is given, the decoding process starts at
	    s[offset]. It defaults to 0.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	""" 
	...


StreamWriter and StreamReader define the interface for stateful
encoders/decoders:

class StreamWriter(Codec):

    def __init__(self,stream,errors='strict'):

	""" Creates a StreamWriter instance.

	    stream must be a file-like object open for writing
	    (binary) data.

	    The StreamWriter may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def write(self,u,slice=None):

	""" Writes the Unicode object's contents encoded to self.stream
	    and returns the number of bytes written.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def flush(self):

	""" Flushed the codec buffers used for keeping state.

	    Returns values are not defined. Implementations are free to
	    return None, raise an exception (in case there is pending
	    data in the buffers which could not be decoded) or
	    return any remaining data from the state buffers used.

	"""
	pass

class StreamReader(Codec):

    def __init__(self,stream,errors='strict'):

	""" Creates a StreamReader instance.

	    stream must be a file-like object open for reading
	    (binary) data.

	    The StreamReader may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def read(self,chunksize=0):

	""" Decodes data from the stream self.stream and returns a tuple 
	    (Unicode object, bytes consumed).

	    chunksize indicates the approximate maximum number of
	    bytes to read from the stream for decoding purposes. The
	    decoder can modify this setting as appropriate. The default
	    value 0 indicates to read and decode as much as possible.
	    The chunksize is intended to prevent having to decode huge
	    files in one step.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...

    def flush(self):

	""" Flushed the codec buffers used for keeping state.

	    Returns values are not defined. Implementations are free to
	    return None, raise an exception (in case there is pending
	    data in the buffers which could not be decoded) or
	    return any remaining data from the state buffers used.

	"""

In addition to the above methods, the StreamWriter and StreamReader
instances should also provide access to all other methods defined for
the stream object.

Stream codecs are free to combine the StreamWriter and StreamReader
interfaces into one class.
-----------------------------------------------------------------------

> > I don't see a problem with the registry though -- the encodings
> > package can take care of the registration process without any
> > user interaction. There would only have to be an API for
> > looking up an encoding published by the encodings package for
> > the Unicode implementation to use. The magic behind that API
> > is left to the encodings package...
> 
> I think that the collection of encodings will eventually grow large
> enough to make it a requirement to avoid doing work proportional to
> the number of supported encodings at startup (or even when an encoding
> is referenced for the first time).  Any "lazy" mechanism (of which
> module search is an example) will do.

Right. The list of search functions should provide this kind
of lazyness. It also provides ways to implement other strategies
to look for codecs, e.g. PIL could provide such a search function
for its codecs, mxCrypto for the included ciphers, etc.
 
> > BTW, nothing's wrong with your idea :-) In fact, I like it
> > a lot because it keeps the encoding modules out of the
> > top-level scope which is good.
> 
> Yes.
> 
> > PS: we could probably even take the whole codec idea one step
> > further and also allow other input/output formats to be registered,
> > e.g. stream ciphers or pickle mechanisms. The step in that
> > direction is not a big one: we'd only have to drop the specification
> > of the Unicode object in the spec and replace it with an arbitrary
> > object. Of course, this will still have to be a Unicode object
> > for use by the Unicode implementation.
> 
> This is a step towards Java's architecture of stackable streams.
> 
> But I'm always in favor of tackling what we know we need before
> tackling the most generalized version of the problem.

Well, I just wanted to mention the possibility... might be
something to look into next year. I find it rather thrilling
to be able to create encrypted streams by just hooking together
a few stream codecs...

f = open('myfile.txt','w')

CipherWriter = sys.codec('rc5-cipher')[3]
sf = StreamWriter(f,key='xxxxxxxx')

UTF8Writer = sys.codec('utf-8')[3]
sfx = UTF8Writer(sf)

sfx.write('asdfasdfasdfasdf')
sfx.close()

Hmm, we should probably define the additional constructor
arguments to be keyword arguments... writers/readers other
than Unicode ones will probably need different kinds of
parameters (such as the key in the above example).

Ahem, ...I'm getting distracted here :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From bwarsaw@cnri.reston.va.us (Barry A. Warsaw)  Thu Nov 18 16:23:41 1999
From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw)
Date: Thu, 18 Nov 1999 11:23:41 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
 <3833E5EC.AAFE5016@lemburg.com>
 <14388.8997.703108.401808@weyr.cnri.reston.va.us>
Message-ID: <14388.10253.902424.904199@anthem.cnri.reston.va.us>

>>>>> "Fred" == Fred L Drake, Jr  writes:

    Fred>   Er, I should note that the sample code I just sent makes
    Fred> use of string methods.  ;)

Yay!


From guido@CNRI.Reston.VA.US  Thu Nov 18 16:37:08 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 11:37:08 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Thu, 18 Nov 1999 17:23:09 +0100."
 <383427ED.45A01BBB@lemburg.com>
References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us>
 <383427ED.45A01BBB@lemburg.com>
Message-ID: <199911181637.LAA04260@eric.cnri.reston.va.us>

> Like a path of search functions ? Not a bad idea... I will still
> want the internal dict for caching purposes though. I'm not sure
> how often these encodings will be, but even a few hundred function
> call will slow down the Unicode implementation quite a bit.

Of course.  (It's like sys.modules caching the results of an import).

[...]
>     def flush(self):
> 
> 	""" Flushed the codec buffers used for keeping state.
> 
> 	    Returns values are not defined. Implementations are free to
> 	    return None, raise an exception (in case there is pending
> 	    data in the buffers which could not be decoded) or
> 	    return any remaining data from the state buffers used.
> 
> 	"""

I don't know where this came from, but a flush() should work like
flush() on a file.  It doesn't return a value, it just sends any
remaining data to the underlying stream (for output).  For input it
shouldn't be supported at all.

The idea is that flush() should do the same to the encoder state that
close() followed by a reopen() would do.  Well, more or less.  But if
the process were to be killed right after a flush(), the data written
to disk should be a complete encoding, and not have a lingering shift
state.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Thu Nov 18 16:59:06 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 11:59:06 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Thu, 18 Nov 1999 09:50:36 +0100."
 <3833BDDC.7CD2CC1F@lemburg.com>
References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com>
 <3833BDDC.7CD2CC1F@lemburg.com>
Message-ID: <199911181659.LAA04303@eric.cnri.reston.va.us>

[Responding to some lingering mails]

[/F]
> >     >>> u = unicode("ĺ i ĺa ä e ö", "iso-latin-1")
> >     >>> s = u.encode("html-entities")
> >     >>> d = decoder("html-entities")
> >     >>> d.decode(s[:-1])
> >     "ĺ i ĺa ä e "
> >     >>> d.flush()
> >     "ö"

[MAL]
> Ah, ok. So the .flush() method checks for proper
> string endings and then either returns the remaining
> input or raises an error.

No, please.  See my previous post on flush().

> > input: read chunks of data, decode, and
> > keep extra data in a local buffer.
> > 
> > output: encode data into suitable chunks,
> > and write to the output stream (that's why
> > there's a buffersize argument to encode --
> > if someone writes a 10mb unicode string to
> > an encoded stream, python shouldn't allocate
> > an extra 10-30 megabytes just to be able to
> > encode the darn thing...)
> 
> So the stream codecs would be wrappers around the
> string codecs.

No -- the other way around.  Think of the stream encoder as a little
FSM engine that you feed with unicode characters and which sends bytes
to the backend stream.  When a unicode character comes in that
requires a particular shift state, and the FSM isn't in that shift
state, it emits the escape sequence to enter that shift state first.
It should use standard buffered writes to the output stream; i.e. one
call to feed the encoder could cause several calls to write() on the
output stream, or vice versa (if you fed the encoder a single
character it might keep it in its own buffer).  That's all up to the
codec implementation.

The flush() forces the FSM into the "neutral" shift state, possibly
writing an escape sequence to leave the current shift state, and
empties the internal buffer.

The string codec CONCEPTUALLY uses the stream codec to a cStringIO
object, using flush() to force the final output.  However the
implementation may take a shortcut.  For stateless encodings the
stream codec may call on the string codec, but that's all an
implementation issue.

For input, things are slightly different (you don't know how much
encoded data you must read to give you N Unicode characters, so you
may have to make a guess and hold on to some data that you read
unnecessarily -- either in encoded form or in Unicode form, at the
discretion of the implementation.  Using seek() on the input stream is
forbidden (it could be a pipe or socket).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Thu Nov 18 17:11:51 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 12:11:51 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: Your message of "Thu, 18 Nov 1999 10:39:30 +0100."
 <3833C952.C6F154B1@lemburg.com>
References: <000101bf3173$f9805340$c0a0143f@tim>
 <3833C952.C6F154B1@lemburg.com>
Message-ID: <199911181711.MAA04342@eric.cnri.reston.va.us>

> > > Now how should we define ur"abc\u1234\n"  ... ?
> > 
> > If strings carried an encoding tag with them, the obvious answer is that
> > this acts exactly like r"abc\u1234\n" acts today except gets a
> > "unicode-escaped" encoding tag instead of a "[whatever the default is
> > today]" encoding tag.
> > 
> > If strings don't carry an encoding tag with them, you're in a bit of a
> > pickle:  you'll have to convert it to a regular string or a Unicode string,
> > but in either case have no way to communicate that it may need further
> > processing; i.e., no way to distinguish it from a regular or Unicode string
> > produced by any other mechanism.  The code I posted yesterday remains my
> > best answer to that unpleasant puzzle (i.e., produce a Unicode string,
> > fiddling with backslashes just enough to get the \u escapes expanded, in the
> > same way Java's (conceptual) preprocessor does it).
> 
> They don't have such tags... so I guess we're in trouble ;-)
> 
> I guess to make ur"" have a meaning at all, we'd need to go
> the Java preprocessor way here, i.e. scan the string *only*
> for \uXXXX sequences, decode these and convert the rest as-is
> to Unicode ordinals.
> 
> Would that be ok ?

Read Tim's code (posted about 40 messages ago in this list).

Like Java, it interprets \u.... when the number of backslashes is odd,
but not when it's even.  So \\u.... returns exactly that, while
\\\u.... returns two backslashes and a unicode character.

This is nice and can be done regardless of whether we are going to
interpret other \ escapes or not.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From skip@mojam.com (Skip Montanaro)  Thu Nov 18 17:34:51 1999
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Thu, 18 Nov 1999 11:34:51 -0600 (CST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim>
References: <383156DF.2209053F@lemburg.com>
 <000401bf30d8$6cf30bc0$a42d153f@tim>
Message-ID: <14388.14523.158050.594595@dolphin.mojam.com>

    >> FYI, the next version of the proposal ...  File objects opened in
    >> text mode will use "t#" and binary ones use "s#".

    Tim> Am I the only one who sees magical distinctions between text and
    Tim> binary mode as a Really Bad Idea? 

No.

    Tim> I wouldn't have guessed the Unix natives here would quietly
    Tim> acquiesce to importing a bit of Windows madness .

We figured you and Guido would come to our rescue... ;-)

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...


From mal@lemburg.com  Thu Nov 18 18:15:54 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 19:15:54 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.7
References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com>
Message-ID: <3834425A.8E9C3B7E@lemburg.com>

FYI, I've uploaded a new version of the proposal which includes
new codec APIs, a new codec search mechanism and some minor
fixes here and there.

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ˇ Unicode objects support for %-formatting

    ˇ Design of the internal C API and the Python API for
      the Unicode character properties database

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 18 18:32:49 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 19:32:49 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf3173$f9805340$c0a0143f@tim>
 <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>
Message-ID: <38344651.960878A2@lemburg.com>

Guido van Rossum wrote:
> 
> > I guess to make ur"" have a meaning at all, we'd need to go
> > the Java preprocessor way here, i.e. scan the string *only*
> > for \uXXXX sequences, decode these and convert the rest as-is
> > to Unicode ordinals.
> >
> > Would that be ok ?
> 
> Read Tim's code (posted about 40 messages ago in this list).

I did, but wasn't sure whether he was argueing for going the
Java way...
 
> Like Java, it interprets \u.... when the number of backslashes is odd,
> but not when it's even.  So \\u.... returns exactly that, while
> \\\u.... returns two backslashes and a unicode character.
> 
> This is nice and can be done regardless of whether we are going to
> interpret other \ escapes or not.

So I'll take that as: this is what we want in Python too :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Thu Nov 18 18:38:41 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 19:38:41 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf3173$f9805340$c0a0143f@tim>
 <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>
Message-ID: <383447B1.1B7B594C@lemburg.com>

Would this definition be fine ?
"""

  u = ur''

The 'raw-unicode-escape' encoding is defined as follows:

ˇ \uXXXX sequence represent the U+XXXX Unicode character if and
  only if the number of leading backslashes is odd

ˇ all other characters represent themselves as Unicode ordinal
  (e.g. 'b' -> U+0062)

"""

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@CNRI.Reston.VA.US  Thu Nov 18 18:46:35 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 13:46:35 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Thu, 18 Nov 1999 11:34:51 CST."
 <14388.14523.158050.594595@dolphin.mojam.com>
References: <383156DF.2209053F@lemburg.com> <000401bf30d8$6cf30bc0$a42d153f@tim>
 <14388.14523.158050.594595@dolphin.mojam.com>
Message-ID: <199911181846.NAA04547@eric.cnri.reston.va.us>

>     >> FYI, the next version of the proposal ...  File objects opened in
>     >> text mode will use "t#" and binary ones use "s#".
> 
>     Tim> Am I the only one who sees magical distinctions between text and
>     Tim> binary mode as a Really Bad Idea? 
> 
> No.
> 
>     Tim> I wouldn't have guessed the Unix natives here would quietly
>     Tim> acquiesce to importing a bit of Windows madness .
> 
> We figured you and Guido would come to our rescue... ;-)

Don't count on me.  My brain is totally cross-platform these days, and
writing "rb" or "wb" for files containing binary data is second nature
for me.  I actually *like* it.

Anyway, the Unicode stuff ought to have a wrapper open(filename, mode,
encoding) where the 'b' will be added to the mode if you don't give it
and it's needed.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Thu Nov 18 18:50:20 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 13:50:20 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: Your message of "Thu, 18 Nov 1999 19:32:49 +0100."
 <38344651.960878A2@lemburg.com>
References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>
 <38344651.960878A2@lemburg.com>
Message-ID: <199911181850.NAA04576@eric.cnri.reston.va.us>

> > Like Java, it interprets \u.... when the number of backslashes is odd,
> > but not when it's even.  So \\u.... returns exactly that, while
> > \\\u.... returns two backslashes and a unicode character.
> > 
> > This is nice and can be done regardless of whether we are going to
> > interpret other \ escapes or not.
> 
> So I'll take that as: this is what we want in Python too :-)

I'll reserve judgement until we've got some experience with it in the
field, but it seems the best compromise.  It also gives a clear
explanation about why we have \uXXXX when we already have \xXXXX.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@CNRI.Reston.VA.US  Thu Nov 18 18:57:36 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 13:57:36 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: Your message of "Thu, 18 Nov 1999 19:38:41 +0100."
 <383447B1.1B7B594C@lemburg.com>
References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>
 <383447B1.1B7B594C@lemburg.com>
Message-ID: <199911181857.NAA04617@eric.cnri.reston.va.us>

> Would this definition be fine ?
> """
> 
>   u = ur''
> 
> The 'raw-unicode-escape' encoding is defined as follows:
> 
> ˇ \uXXXX sequence represent the U+XXXX Unicode character if and
>   only if the number of leading backslashes is odd
> 
> ˇ all other characters represent themselves as Unicode ordinal
>   (e.g. 'b' -> U+0062)
> 
> """

Yes.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From skip@mojam.com (Skip Montanaro)  Thu Nov 18 19:09:46 1999
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Thu, 18 Nov 1999 13:09:46 -0600 (CST)
Subject: [Python-Dev] Unicode Proposal: Version 0.7
In-Reply-To: <3834425A.8E9C3B7E@lemburg.com>
References: <382C0A54.E6E8328D@lemburg.com>
 <382D625B.DC14DBDE@lemburg.com>
 <38316685.7977448D@lemburg.com>
 <3834425A.8E9C3B7E@lemburg.com>
Message-ID: <14388.20218.294814.234327@dolphin.mojam.com>

I haven't been following this discussion closely at all, and have no
previous experience with Unicode, so please pardon a couple stupid questions
from the peanut gallery:

    1. What does U+0061 mean (other than 'a')?  That is, what is U?

    2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter
       description.  Given a Unicode object with encoding e1, how do I write
       it to a file that is to be encoded with encoding e2?  Seems like I
       would do something like

           u1 = unicode(s, encoding=e1)
	   f = open("somefile", "wb")
	   u2 = unicode(u1, encoding=e2)
	   f.write(u2)

       Is that how it would be done?  Does this question even make sense?

    3. What will the impact be on programmers such as myself currently
       living with blinders on (that is, writing in plain old 7-bit ASCII)?

Thx,

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...


From jim@interet.com  Thu Nov 18 19:23:53 1999
From: jim@interet.com (James C. Ahlstrom)
Date: Thu, 18 Nov 1999 14:23:53 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>
Message-ID: <38345249.4AFD91DA@interet.com>

Guido van Rossum wrote:
>
> Let's first complete the requirements gathering.

Yes.

> Are these
> requirements reasonable?  Will they make an implementation too
> complex?

I think you can get 90% of where you want to be with something
much simpler.  And the simpler implementation will be useful in
the 100% solution, so it is not wasted time.

How about if we just design a Python archive file format; provide
code in the core (in Python or C) to import from it; provide a
Python program to create archive files; and provide a Standard
Directory to put archives in so they can be found quickly.  For
extensibility and control, we add functions to the imp module.
Detailed comments follow:


> Compatibility issues:
> ---------------------
> [list of current features...]

Easily met by keeping the current C code.

> 
> New features:
> -------------
> 
> - Integrated support for Greg Ward's distribution utilities (i.e. a
>   module prepared by the distutil tools should install painlessly)
> 
> - Good support for prospective authors of "all-in-one" packaging tool
>   authors like Gordon McMillan's win32 installer or /F's squish.  (But
>   I *don't* require backwards compatibility for existing tools.)

These tools go well beyond just an archive file format, but hopefully
a file format will help.  Greg and Gordon should be able to control the
format so it meets their needs.  We need a standard format.
 
> - Standard import from zip or jar files, in two ways:
> 
>   (1) an entry on sys.path can be a zip/jar file instead of a directory;
>       its contents will be searched for modules or packages
> 
>   (2) a file in a directory that's on sys.path can be a zip/jar file;
>       its contents will be considered as a package (note that this is
>       different from (1)!)

I don't like sys.path at all.  It is currently part of the problem.
I suggest that archive files MUST be put into a known directory.
On Windows this is the directory of the executable, sys.executable.
On Unix this $PREFIX plus version, namely
  "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]).
Other platforms can have different rules.

We should also have the ability to append archive files to the
executable or a shared library assuming the OS allows this
(Windows and Linux do allow it).  This is the first location
searched, nails the archive to the interpreter, insulates us
from an erroneous sys.path, and enables single-file Python programs.

>   I don't particularly care about supporting all zip compression
>   schemes; if Java gets away with only supporting gzip compression
>   in jar files, so can we.

We don't need compression.  The whole ./Lib is 1.2 Meg, and if we
compress
it to zero we save a Meg.  Irrelevant.  Installers provide compression
anyway so when Python programs are shipped, they will be compressed
then.

Problems are that Python does not ship with compression, we will
have to add it, we will have to support it and its current method
of compression forever, and it adds complexity.
 
> - Easy ways to subclass or augment the import mechanism along
>   different dimensions.  For example, while none of the following
>   features should be part of the core implementation, it should be
>   easy to add any or all:
>
>  [ List of new features including hooks...]

Sigh, this proposal does not provide for this.  It seems
like a job for imputil.  But if the file format and import code
is available from the imp module, it can be used as part of the
solution.

>   - support for a new compression scheme to the zip importer

I guess compression should be easy to add if Python ships with
a compression module.
 
>   - a cache for file locations in directories/archives, to improve
>     startup time

If the Python library is available as an archive, I think
startup will be greatly improved anyway.
 
> Implementation:
> ---------------
> 
> - There must clearly be some code in C that can import certain
>   essential modules (to solve the chicken-or-egg problem), but I don't
>   mind if the majority of the implementation is written in Python.
>   Using Python makes it easy to subclass.

Yes.
 
> - In order to support importing from zip/jar files using compression,
>   we'd at least need the zlib extension module and hence libz itself,
>   which may not be available everywhere.

That's a good reason to omit compression.  At least for now.
 
> - I suppose that the bootstrap is solved using a mechanism very
>   similar to what freeze currently used (other solutions seem to be
>   platform dependent).

Yes, except that we need to be careful to preserve the freeze feature
for users.  We don't want to take it over.
 
> - I also want to still support importing *everything* from the
>   filesystem, if only for development.  (It's hard enough to deal with
>   the fact that exceptions.py is needed during Py_Initialize();
>   I want to be able to hack on the import code written in Python
>   without having to rebuild the executable all the time.

Yes, we need a function in imp to turn archives off:
  import imp
  imp.archiveEnable(0)
 
> Finally, to what extent does this impact the desire for dealing
> differently with the Python bytecode compiler (e.g. supporting
> optimizers written in Python)?  And does it affect the desire to
> implement the read-eval-print loop (the >>> prompt) in Python?

I don't think it impacts these at all.

Jim Ahlstrom


From guido@CNRI.Reston.VA.US  Thu Nov 18 19:55:02 1999
From: guido@CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 14:55:02 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: Your message of "Thu, 18 Nov 1999 14:23:53 EST."
 <38345249.4AFD91DA@interet.com>
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>
 <38345249.4AFD91DA@interet.com>
Message-ID: <199911181955.OAA04830@eric.cnri.reston.va.us>

> I think you can get 90% of where you want to be with something
> much simpler.  And the simpler implementation will be useful in
> the 100% solution, so it is not wasted time.

Agreed, but I'm not sure that it addresses the problems that started
this thread.  I can't really tell, since the message starting the
thread just requested imputil, without saying which parts of it were
needed.  A followup claimed that imputil was a fine prototype but too
slow for real work.

I inferred that flexibility was requested.  But maybe that was
projection since that was on my own list.  (I'm happy with the
performance and find manipulating zip or jar files clumsy, so I'm not
too concerned about all the nice things you can *do* with that
flexibility. :-)

> How about if we just design a Python archive file format; provide
> code in the core (in Python or C) to import from it; provide a
> Python program to create archive files; and provide a Standard
> Directory to put archives in so they can be found quickly.  For
> extensibility and control, we add functions to the imp module.
> Detailed comments follow:

> These tools go well beyond just an archive file format, but hopefully
> a file format will help.  Greg and Gordon should be able to control the
> format so it meets their needs.  We need a standard format.

I think the standard format should be a subclass of zip or jar (which
is itself a subclass of zip).  We have already written (at CNRI, as
yet unreleased) the necessary Python tools to manipulate zip archives;
moreover 3rd party tools are abundantly available, both on Unix and on
Windows (as well as in Java).  Zip files also lend themselves to
self-extracting archives and similar things, because the file index is
at the end, so I think that Greg & Gordon should be happy.

> I don't like sys.path at all.  It is currently part of the problem.

Eh?  That's the first thing I hear something bad about it.  Maybe
that's because you live on Windows -- on Unix, search paths are
ubiquitous.

> I suggest that archive files MUST be put into a known directory.

Why?  Maybe this works on Windows; on Unix this is asking for trouble
because it prevents users from augmenting the installation provided by
the sysadmin.  Even on newer Windows versions, users without admin
perms may not be allowed to add files to that privileged directory.

> On Windows this is the directory of the executable, sys.executable.
> On Unix this $PREFIX plus version, namely
>   "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]).
> Other platforms can have different rules.
> 
> We should also have the ability to append archive files to the
> executable or a shared library assuming the OS allows this
> (Windows and Linux do allow it).  This is the first location
> searched, nails the archive to the interpreter, insulates us
> from an erroneous sys.path, and enables single-file Python programs.

OK for the executable.  I'm not sure what the point is of appending an
archive to the shared library?  Anyway, does it matter (on Windows) if
you add it to python16.dll or to python.exe?

> We don't need compression.  The whole ./Lib is 1.2 Meg, and if we
> compress
> it to zero we save a Meg.  Irrelevant.  Installers provide compression
> anyway so when Python programs are shipped, they will be compressed
> then.
> 
> Problems are that Python does not ship with compression, we will
> have to add it, we will have to support it and its current method
> of compression forever, and it adds complexity.

OK, OK.  I think most zip tools have a way to turn off the
compression.  (Anyway, it's a matter of more I/O time vs. more CPU
time; hardare for both is getting better faster than we can tweak the
code :-)

> Sigh, this proposal does not provide for this.  It seems
> like a job for imputil.  But if the file format and import code
> is available from the imp module, it can be used as part of the
> solution.

Well, the question is really if we want flexibility or archive files.
I care more about the flexibility.  If we get a clear vote for archive
files, I see no problem with implementing that first.

> If the Python library is available as an archive, I think
> startup will be greatly improved anyway.

Really?  I know about all the system calls it makes, but I don't
really see much of a delay -- I have a prompt in well under 0.1
second.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gstein@lyra.org  Thu Nov 18 22:03:55 1999
From: gstein@lyra.org (Greg Stein)
Date: Thu, 18 Nov 1999 14:03:55 -0800 (PST)
Subject: [Python-Dev] file modes (was: just say no...)
In-Reply-To: <3833B588.1E31F01B@lemburg.com>
Message-ID: 

On Thu, 18 Nov 1999, M.-A. Lemburg wrote:
> Tim Peters wrote:
> > [MAL]
> > > File objects opened in text mode will use "t#" and binary
> > > ones use "s#".
> > 
> > [Greg Stein]
> > > ...
> > > The real annoying thing would be to assume that opening a file as 'r'
> > > means that I *meant* text mode and to start using "t#".
> > 
> > Isn't that exactly what MAL said would happen?  Note that a "t" flag for
> > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't
> > either; a lone "r" has always meant text mode.
> 
> Em, I think you've got something wrong here: "t#" refers to the
> parsing marker used for writing data to files opened in text mode.

Nope. We've got it right :-)

Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to
refer to the parse marker.

>...
> I guess you won't notice any difference: strings define both
> interfaces ("s#" and "t#") to mean the same thing. Only other
> buffer compatible types may now fail to write to text files
> -- which is not so bad, because it forces the programmer to
> rethink what he really intended when opening the file in text
> mode.

It *is* bad if it breaks my existing programs in subtle ways that are a
bitch to track down.

> Besides, if you are writing portable scripts you should pay
> close attention to "r" vs. "rb" anyway.

I'm not writing portable scripts. I mentioned that once before. I don't
want a difference between 'r' and 'rb' on my Linux box. It was never there
before, I'm lazy, and I don't want to see it added :-).

Honestly, I don't know offhand of any Python types that repond to "s#" and
"t#" in different ways, such that changing file.write would end up writing
something different (and thereby breaking existing code).

I just don't like introduce text/binary to *nix platforms where it didn't
exist before.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From skip@mojam.com (Skip Montanaro)  Thu Nov 18 22:15:43 1999
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Thu, 18 Nov 1999 16:15:43 -0600 (CST)
Subject: [Python-Dev] file modes (was: just say no...)
In-Reply-To: 
References: <3833B588.1E31F01B@lemburg.com>
 
Message-ID: <14388.31375.296388.973848@dolphin.mojam.com>

    Greg> I'm not writing portable scripts. I mentioned that once before. I
    Greg> don't want a difference between 'r' and 'rb' on my Linux box. It
    Greg> was never there before, I'm lazy, and I don't want to see it added
    Greg> :-).

    ...

    Greg> I just don't like introduce text/binary to *nix platforms where it
    Greg> didn't exist before.

I'll vote with Greg, Guido's cross-platform conversion not withstanding.  If
I haven't been writing portable scripts up to this point because I only care
about a single target platform, why break my scripts for me?  Forcing me to
use "rb" or "wb" on my open calls isn't going to make them portable anyway.

There are probably many other harder to identify and correct portability
issues than binary file access anyway.  Seems like requiring "b" is just
going to cause gratuitous breakage with no obvious increase in portability.

porta-nanny.py-anyone?-ly y'rs,

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...


From jim@interet.com  Thu Nov 18 22:40:05 1999
From: jim@interet.com (James C. Ahlstrom)
Date: Thu, 18 Nov 1999 17:40:05 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>
 <38345249.4AFD91DA@interet.com> <199911181955.OAA04830@eric.cnri.reston.va.us>
Message-ID: <38348045.BB95F783@interet.com>

Guido van Rossum wrote:

> I think the standard format should be a subclass of zip or jar (which
> is itself a subclass of zip).  We have already written (at CNRI, as
> yet unreleased) the necessary Python tools to manipulate zip archives;
> moreover 3rd party tools are abundantly available, both on Unix and on
> Windows (as well as in Java).  Zip files also lend themselves to
> self-extracting archives and similar things, because the file index is
> at the end, so I think that Greg & Gordon should be happy.

Think about multiple packages in multiple zip files.  The zip files
store file directories.  That means we would need a sys.zippath to
search the zip files.  I don't want another PYTHONPATH phenomenon.

Greg Stein and I once discussed this (and Gordon I think).  They
argued that the directories should be flattened.  That is, think of
all directories which can be reached on PYTHONPATH.  Throw
away all initial paths.  The resultant archive has *.pyc at the top
level,
as well as package directories only.  The search path is "." in every
archive file.  No directory information is stored, only module names,
some with dots.
 
> > I don't like sys.path at all.  It is currently part of the problem.
> 
> Eh?  That's the first thing I hear something bad about it.  Maybe
> that's because you live on Windows -- on Unix, search paths are
> ubiquitous.

On windows, just print sys.path.  It is junk.  A commercial
distribution has to "just work", and it fails if a second installation
(by someone else) changes PYTHONPATH to suit their app.  I am trying
to get to "just works", no excuses, no complications.
 
> > I suggest that archive files MUST be put into a known directory.
> 
> Why?  Maybe this works on Windows; on Unix this is asking for trouble
> because it prevents users from augmenting the installation provided by
> the sysadmin.  Even on newer Windows versions, users without admin
> perms may not be allowed to add files to that privileged directory.

It works on Windows because programs install themselves in their own
subdirectories, and can put files there instead of /windows/system32.
This holds true for Windows 2000 also.  A Unix-style installation
to /windows/system32 would (may?) require "administrator" privilege.

On Unix you are right.  I didn't think of that because I am the Unix
sysadmin here, so I can put things where I want.  The Windows
solution doesn't fit with Unix, because executables go in a ./bin
directory and putting library files there is a no-no.  Hmmmm...
This needs more thought.  Anyone else have ideas??

> > We should also have the ability to append archive files to the
> > executable or a shared library assuming the OS allows this
>
> OK for the executable.  I'm not sure what the point is of appending an
> archive to the shared library?  Anyway, does it matter (on Windows) if
> you add it to python16.dll or to python.exe?

The point of using python16.dll is to append the Python library to
it, and append to python.exe (or use files) for everything else.
That way, the 1.6 interpreter is linked to the 1.6 Lib, upgrading
to 1.7 means replacing only one file, and there is no wasted storage
in multiple Lib's.  I am thinking of multiple Python programs in
different directories.

But maybe you are right.  On Windows, if python.exe can be put in
/windows/system32 then it really doesn't matter.

> OK, OK.  I think most zip tools have a way to turn off the
> compression.  (Anyway, it's a matter of more I/O time vs. more CPU
> time; hardare for both is getting better faster than we can tweak the
> code :-)

Well, if Python now has its own compression that is built
in and comes with it, then that is different.  Maybe compression
is OK.
 
> Well, the question is really if we want flexibility or archive files.
> I care more about the flexibility.  If we get a clear vote for archive
> files, I see no problem with implementing that first.

I don't like flexibility, I like standardization and simplicity.
Flexibility just encourages users to do the wrong thing.

Everyone vote please.  I don't have a solid feeling about
what people want, only what they don't like.
 
> > If the Python library is available as an archive, I think
> > startup will be greatly improved anyway.
> 
> Really?  I know about all the system calls it makes, but I don't
> really see much of a delay -- I have a prompt in well under 0.1
> second.

So do I.  I guess I was just echoing someone else's complaint.

JimA


From mal@lemburg.com  Thu Nov 18 23:28:31 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 00:28:31 +0100
Subject: [Python-Dev] file modes (was: just say no...)
References: 
Message-ID: <38348B9F.A31B09C4@lemburg.com>

Greg Stein wrote:
> 
> On Thu, 18 Nov 1999, M.-A. Lemburg wrote:
> > Tim Peters wrote:
> > > [MAL]
> > > > File objects opened in text mode will use "t#" and binary
> > > > ones use "s#".
> > >
> > > [Greg Stein]
> > > > ...
> > > > The real annoying thing would be to assume that opening a file as 'r'
> > > > means that I *meant* text mode and to start using "t#".
> > >
> > > Isn't that exactly what MAL said would happen?  Note that a "t" flag for
> > > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't
> > > either; a lone "r" has always meant text mode.
> >
> > Em, I think you've got something wrong here: "t#" refers to the
> > parsing marker used for writing data to files opened in text mode.
> 
> Nope. We've got it right :-)
> 
> Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to
> refer to the parse marker.

Ah, ok. But "t" as file opener is non-portable anyways, so I'll
skip it here :-)
 
> >...
> > I guess you won't notice any difference: strings define both
> > interfaces ("s#" and "t#") to mean the same thing. Only other
> > buffer compatible types may now fail to write to text files
> > -- which is not so bad, because it forces the programmer to
> > rethink what he really intended when opening the file in text
> > mode.
> 
> It *is* bad if it breaks my existing programs in subtle ways that are a
> bitch to track down.
> 
> > Besides, if you are writing portable scripts you should pay
> > close attention to "r" vs. "rb" anyway.
> 
> I'm not writing portable scripts. I mentioned that once before. I don't
> want a difference between 'r' and 'rb' on my Linux box. It was never there
> before, I'm lazy, and I don't want to see it added :-).
> 
> Honestly, I don't know offhand of any Python types that repond to "s#" and
> "t#" in different ways, such that changing file.write would end up writing
> something different (and thereby breaking existing code).
> 
> I just don't like introduce text/binary to *nix platforms where it didn't
> exist before.

Please remember that up until now you were probably only using
strings to write to files. Python strings don't differentiate
between "t#" and "s#" so you wont see any change in function
or find subtle errors being introduced.

If you are already using the buffer feature for e.g. array which 
also implement "s#" but don't support "t#" for obvious reasons
you'll run into trouble, but then: arrays are binary data,
so changing from text mode to binary mode is well worth the
effort even if you just consider it a nuisance.

Since the buffer interface and its consequences haven't published
yet, there are probably very few users out there who would
actually run into any problems. And even if they do, its a
good chance to catch subtle bugs which would only have shown
up when trying to port to another platform.

I'll leave the rest for Guido to answer, since it was his idea ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Thu Nov 18 23:41:32 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 00:41:32 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.7
References: <382C0A54.E6E8328D@lemburg.com>
 <382D625B.DC14DBDE@lemburg.com>
 <38316685.7977448D@lemburg.com>
 <3834425A.8E9C3B7E@lemburg.com> <14388.20218.294814.234327@dolphin.mojam.com>
Message-ID: <38348EAC.82B41A4D@lemburg.com>

Skip Montanaro wrote:
> 
> I haven't been following this discussion closely at all, and have no
> previous experience with Unicode, so please pardon a couple stupid questions
> from the peanut gallery:
> 
>     1. What does U+0061 mean (other than 'a')?  That is, what is U?

U+XXXX means Unicode character with ordinal hex number XXXX. It is
basically just another way to say, hey I want the Unicode character
at position 0xXXXX in the Unicode spec.
 
>     2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter
>        description.  Given a Unicode object with encoding e1, how do I write
>        it to a file that is to be encoded with encoding e2?  Seems like I
>        would do something like
> 
>            u1 = unicode(s, encoding=e1)
>            f = open("somefile", "wb")
>            u2 = unicode(u1, encoding=e2)
>            f.write(u2)
> 
>        Is that how it would be done?  Does this question even make sense?

The unicode() constructor converts all input to Unicode as
basis for other conversions. In the above example, s would be
converted to Unicode using the assumption that the bytes in
s represent characters encoded using the encoding given in e1.
The line with u2 would raise a TypeError, because u1 is not
a string. To convert a Unicode object u1 to another encoding,
you would have to call the .encode() method with the intended
new encoding. The Unicode object will then take care of the
conversion of its internal Unicode data into a string using
the given encoding, e.g. you'd write:

f.write(u1.encode(e2))
 
>     3. What will the impact be on programmers such as myself currently
>        living with blinders on (that is, writing in plain old 7-bit ASCII)?

If you don't want your scripts to know about Unicode, nothing
will really change. In case you do use e.g. Latin-1 characters
in your scripts for strings, you are asked to include a pragma
in the comment lines at the beginning of the script (so that
programmers viewing your code using other encoding have a chance
to figure out what you've written).

Here's the text from the proposal:
"""
Note that you should provide some hint to the encoding you used to
write your programs as pragma line in one the first few comment lines
of the source file (e.g. '# source file encoding: latin-1'). If you
only use 7-bit ASCII then everything is fine and no such notice is
needed, but if you include Latin-1 characters not defined in ASCII, it
may well be worthwhile including a hint since people in other
countries will want to be able to read you source strings too.
"""

Other than that you can continue to use normal strings like
you always have.

Hope that clarifies things at least a bit,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mhammond@skippinet.com.au  Fri Nov 19 00:27:09 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 19 Nov 1999 11:27:09 +1100
Subject: [Python-Dev] file modes (was: just say no...)
In-Reply-To: <38348B9F.A31B09C4@lemburg.com>
Message-ID: <003401bf3224$d231be30$0501a8c0@bobcat>

[MAL]

> If you are already using the buffer feature for e.g. array which
> also implement "s#" but don't support "t#" for obvious reasons
> you'll run into trouble, but then: arrays are binary data,
> so changing from text mode to binary mode is well worth the
> effort even if you just consider it a nuisance.

Breaking existing code that works should be considered more than a
nuisance.

However, one answer would be to have "t#" _prefer_ to use the text
buffer, but not insist on it.  eg, the logic for processing "t#" could
check if the text buffer is supported, and if not move back to the
blob buffer.

This should mean that all existing code still works, except for
objects that support both buffers to mean different things.  AFAIK
there are no objects that qualify today, so it should work fine.

Unix users _will_ need to revisit their thinking about "text mode" vs
"binary mode" when writing these new objects (such as Unicode), but
IMO that is more than reasonable - Unix users dont bother qualifying
the open mode of their files, simply because it has no effect on their
files.  If for certain objects or requirements there _is_ a
distinction, then new code can start to think these issues through.
"Portable File IO" will simply be extended from simply "portable among
all platforms" to "portable among all platforms and objects".

Mark.



From gmcm@hypernet.com  Fri Nov 19 02:23:44 1999
From: gmcm@hypernet.com (Gordon McMillan)
Date: Thu, 18 Nov 1999 21:23:44 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: <38348045.BB95F783@interet.com>
Message-ID: <1269144272-21594530@hypernet.com>

[Guido]
> > I think the standard format should be a subclass of zip or jar
> > (which is itself a subclass of zip).  We have already written
> > (at CNRI, as yet unreleased) the necessary Python tools to
> > manipulate zip archives; moreover 3rd party tools are
> > abundantly available, both on Unix and on Windows (as well as
> > in Java).  Zip files also lend themselves to self-extracting
> > archives and similar things, because the file index is at the
> > end, so I think that Greg & Gordon should be happy.

No problem (I created my own formats for relatively minor 
reasons).
 
[JimA]
> Think about multiple packages in multiple zip files.  The zip
> files store file directories.  That means we would need a
> sys.zippath to search the zip files.  I don't want another
> PYTHONPATH phenomenon.

What if sys.path looked like:
 [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...]
 
> Greg Stein and I once discussed this (and Gordon I think).  They
> argued that the directories should be flattened.  That is, think
> of all directories which can be reached on PYTHONPATH.  Throw
> away all initial paths.  The resultant archive has *.pyc at the
> top level, as well as package directories only.  The search path
> is "." in every archive file.  No directory information is
> stored, only module names, some with dots.

While I do flat archives (no dots, but that's a different story), 
there's no reason the archive couldn't be structured. Flat 
archives are definitely simpler.
 
[JimA]
> > > I don't like sys.path at all.  It is currently part of the
> > > problem.
[Guido] 
> > Eh?  That's the first thing I hear something bad about it. 
> > Maybe that's because you live on Windows -- on Unix, search
> > paths are ubiquitous.
> 
> On windows, just print sys.path.  It is junk.  A commercial
> distribution has to "just work", and it fails if a second
> installation (by someone else) changes PYTHONPATH to suit their
> app.  I am trying to get to "just works", no excuses, no
> complications.

		Py_Initialize ();
		PyRun_SimpleString ("import sys; del sys.path[1:]");

Yeah, there's a hole there. Fixable if you could do a little pre- 
Py_Initialize twiddling.
 
> > > I suggest that archive files MUST be put into a known
> > > directory.

No way. Hard code a directory? Overwrite someone else's 
Python "standalone"? Write to a C: partition that is 
deliberately sized to hold nothing but Windows? Make 
network installations impossible?
 
> > Why?  Maybe this works on Windows; on Unix this is asking for
> > trouble because it prevents users from augmenting the
> > installation provided by the sysadmin.  Even on newer Windows
> > versions, users without admin perms may not be allowed to add
> > files to that privileged directory.
> 
> It works on Windows because programs install themselves in their
> own subdirectories, and can put files there instead of
> /windows/system32. This holds true for Windows 2000 also.  A
> Unix-style installation to /windows/system32 would (may?) require
> "administrator" privilege.

There's nothing Unix-style about installing to 
/Windows/system32. 'Course *they* have symbolic links that 
actually work...
 
> On Unix you are right.  I didn't think of that because I am the
> Unix sysadmin here, so I can put things where I want.  The
> Windows solution doesn't fit with Unix, because executables go in
> a ./bin directory and putting library files there is a no-no. 
> Hmmmm... This needs more thought.  Anyone else have ideas??

The official Windows solution is stuff in registry about app 
paths and such. Putting the dlls in the exe's directory is a 
workaround which works and is more managable than the 
official solution.
 
> > > We should also have the ability to append archive files to
> > > the executable or a shared library assuming the OS allows
> > > this

That's a handy trick on Windows, but it's got nothing to do 
with Python.

> > Well, the question is really if we want flexibility or archive
> > files. I care more about the flexibility.  If we get a clear
> > vote for archive files, I see no problem with implementing that
> > first.
> 
> I don't like flexibility, I like standardization and simplicity.
> Flexibility just encourages users to do the wrong thing.

I've noticed that the people who think there should only be one 
way to do things never agree on what it is.
 
> Everyone vote please.  I don't have a solid feeling about
> what people want, only what they don't like.

Flexibility. You can put Christian's favorite Einstein quote here 
too.
 
> > > If the Python library is available as an archive, I think
> > > startup will be greatly improved anyway.
> > 
> > Really?  I know about all the system calls it makes, but I
> > don't really see much of a delay -- I have a prompt in well
> > under 0.1 second.
> 
> So do I.  I guess I was just echoing someone else's complaint.

Install some stuff. Deinstall some of it. Repeat (mixing up the 
order) until your registry and hard drive are shattered into tiny 
little fragments. It doesn't take long (there's lots of stuff a 
defragmenter can't touch once it's there).


- Gordon


From mal@lemburg.com  Fri Nov 19 09:08:44 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 10:08:44 +0100
Subject: [Python-Dev] file modes (was: just say no...)
References: <003401bf3224$d231be30$0501a8c0@bobcat>
Message-ID: <3835139C.344F3EEE@lemburg.com>

Mark Hammond wrote:
> 
> [MAL]
> 
> > If you are already using the buffer feature for e.g. array which
> > also implement "s#" but don't support "t#" for obvious reasons
> > you'll run into trouble, but then: arrays are binary data,
> > so changing from text mode to binary mode is well worth the
> > effort even if you just consider it a nuisance.
> 
> Breaking existing code that works should be considered more than a
> nuisance.

Its an error that pretty easy to fix... that's what I was
referring to with "nuisance". All you have to do is open
the file in binary mode and you're done.

BTW, the change will only effect platforms that don't differ
between text and binary mode, e.g. Unix ones.
 
> However, one answer would be to have "t#" _prefer_ to use the text
> buffer, but not insist on it.  eg, the logic for processing "t#" could
> check if the text buffer is supported, and if not move back to the
> blob buffer.

I doubt that this is conform to what the buffer interface want's
to reflect: if the getcharbuf slot is not implemented this means
"I am not text". If you would write non-text to a text file,
this may cause line breaks to be interpreted in ways that are
incompatible with the binary data, i.e. when you read the data
back in, it may fail to load because e.g. '\n' was converted to
'\r\n'.
 
> This should mean that all existing code still works, except for
> objects that support both buffers to mean different things.  AFAIK
> there are no objects that qualify today, so it should work fine.

Well, even though the code would work, it might break badly
someday for the above reasons. Better fix that now when there
aren't too many possible cases around than at some later
point where the user has to figure out the problem for himself
due to the system not warning him about this.
 
> Unix users _will_ need to revisit their thinking about "text mode" vs
> "binary mode" when writing these new objects (such as Unicode), but
> IMO that is more than reasonable - Unix users dont bother qualifying
> the open mode of their files, simply because it has no effect on their
> files.  If for certain objects or requirements there _is_ a
> distinction, then new code can start to think these issues through.
> "Portable File IO" will simply be extended from simply "portable among
> all platforms" to "portable among all platforms and objects".

Right.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    42 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal@lemburg.com  Fri Nov 19 09:56:03 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 10:56:03 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us>
 <383427ED.45A01BBB@lemburg.com> <199911181637.LAA04260@eric.cnri.reston.va.us>
Message-ID: <38351EB3.153FCDFC@lemburg.com>

Guido van Rossum wrote:
> 
> > Like a path of search functions ? Not a bad idea... I will still
> > want the internal dict for caching purposes though. I'm not sure
> > how often these encodings will be, but even a few hundred function
> > call will slow down the Unicode implementation quite a bit.
> 
> Of course.  (It's like sys.modules caching the results of an import).

I've fixed the "path of search functions" approach in the latest
version of the spec.
 
> [...]
> >     def flush(self):
> >
> >       """ Flushed the codec buffers used for keeping state.
> >
> >           Returns values are not defined. Implementations are free to
> >           return None, raise an exception (in case there is pending
> >           data in the buffers which could not be decoded) or
> >           return any remaining data from the state buffers used.
> >
> >       """
> 
> I don't know where this came from, but a flush() should work like
> flush() on a file. 

It came from Fredrik's proposal.

> It doesn't return a value, it just sends any
> remaining data to the underlying stream (for output).  For input it
> shouldn't be supported at all.
> 
> The idea is that flush() should do the same to the encoder state that
> close() followed by a reopen() would do.  Well, more or less.  But if
> the process were to be killed right after a flush(), the data written
> to disk should be a complete encoding, and not have a lingering shift
> state.

Ok. I've modified the API as follows:

StreamWriter:
    def flush(self):

	""" Flushes and resets the codec buffers used for keeping state.

	    Calling this method should ensure that the data on the
	    output is put into a clean state, that allows appending
	    of new fresh data without having to rescan the whole
	    stream to recover state.

	"""
	pass

StreamReader:
    def read(self,chunksize=0):

	""" Decodes data from the stream self.stream and returns a tuple 
	    (Unicode object, bytes consumed).

	    chunksize indicates the approximate maximum number of
	    bytes to read from the stream for decoding purposes. The
	    decoder can modify this setting as appropriate. The default
	    value 0 indicates to read and decode as much as possible.
	    The chunksize is intended to prevent having to decode huge
	    files in one step.

	    The method should use a greedy read strategy meaning that
	    it should read as much data as is allowed within the
	    definition of the encoding and the given chunksize, e.g.
            if optional encoding endings or state markers are
	    available on the stream, these should be read too.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...

    def reset(self):

	""" Resets the codec buffers used for keeping state.

	    Note that no stream repositioning should take place.
	    This method is primarely intended to recover from
	    decoding errors.

	"""
	pass

The .reset() method replaces the .flush() method on StreamReaders.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    42 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mal@lemburg.com  Fri Nov 19 09:22:48 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 10:22:48 +0100
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>
Message-ID: <383516E8.EE66B527@lemburg.com>

Guido van Rossum wrote:
>
> Let's first complete the requirements gathering.  Are these
> requirements reasonable?  Will they make an implementation too
> complex?  Am I missing anything?

Since you were asking: I would like functionality equivalent
to my latest import patch for a slightly different lookup scheme
for module import inside packages to become a core feature.

If it becomes a core feature I promise to never again start
threads about relative imports :-)

Here's the summary again:
"""
[The patch] changes the default import mechanism to work like this:

>>> import d # from directory a/b/c/
try a.b.c.d
try a.b.d
try a.d
try d
fail

instead of just doing the current two-level lookup:

>>> import d # from directory a/b/c/
try a.b.c.d
try d
fail

As a result, relative imports referring to higher level packages
work out of the box without any ugly underscores in the import name.
Plus the whole scheme is pretty simple to explain and straightforward.
"""

You can find the patch attached to the message "Walking up the package
hierarchy" in the python-dev mailing list archive.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    42 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From andy@robanal.demon.co.uk  Fri Nov 19 13:01:04 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Fri, 19 Nov 1999 05:01:04 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
Message-ID: <19991119130104.21726.rocketmail@ web605.yahoomail.com>


--- "M.-A. Lemburg"  wrote:
> Guido van Rossum wrote:
> > I don't know where this came from, but a flush()
> should work like
> > flush() on a file. 
> 
> It came from Fredrik's proposal.
> 
> > It doesn't return a value, it just sends any
> > remaining data to the underlying stream (for
> output).  For input it
> > shouldn't be supported at all.
> > 
> > The idea is that flush() should do the same to the
> encoder state that
> > close() followed by a reopen() would do.  Well,
> more or less.  But if
> > the process were to be killed right after a
> flush(), the data written
> > to disk should be a complete encoding, and not
> have a lingering shift
> > state.
> 
This could be useful in real life.  
For example, iso-2022-jp has a 'single-byte-mode'
and a 'double-byte-mode' with shift-sequences to
separate them.  The rule is that each line in the 
text file or email message or whatever must begin
and end in single-byte mode.  So I would take flush()
to mean 'shift back to ASCII now'.

Calling flush and reopen would thus "almost" get the
same data across.

I'm trying to think if it would be dangerous.  Do web
and ftp servers often call flush() in the middle of
transmitting a block of text?

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com


From fredrik@pythonware.com  Fri Nov 19 13:33:50 1999
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Fri, 19 Nov 1999 14:33:50 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <19991119130104.21726.rocketmail@ web605.yahoomail.com>
Message-ID: <000701bf3292$b7c49130$f29b12c2@secret.pythonware.com>

Andy Robinson  wrote:
> So I would take flush() to mean 'shift back to
> ASCII now'.

if we're still talking about my "just one
codec, please" proposal, that's exactly
what encoder.flush should do.

while decoder.flush should raise an ex-
ception if you're still in double byte mode
(at least if running in 'strict' mode).

> Calling flush and reopen would thus "almost" get the
> same data across.
> 
> I'm trying to think if it would be dangerous.  Do web
> and ftp servers often call flush() in the middle of
> transmitting a block of text?

again, if we're talking about my proposal,
these flush methods are only called by the
string or stream wrappers, never by the
applications.  see the original post for de-
tails.





From gstein@lyra.org  Fri Nov 19 13:29:50 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 05:29:50 -0800 (PST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us>
Message-ID: 

On Thu, 18 Nov 1999, Guido van Rossum wrote:
> Gordon McMillan wrote:
>...
> > I think imputil's emulation of the builtin importer is more of a 
> > demonstration than a serious implementation. As for speed, it 
> > depends on the test. 
> 
> Agreed.  I like some of imputil's features, but I think the API
> need to be redesigned.

It what ways? It sounds like you've applied some thought. Do you have any
concrete ideas yet, or "just a feeling" :-)  I'm working through some
changes from JimA right now, and would welcome other suggestions. I think
there may be some outstanding stuff from MAL, but I'm not sure (Marc?)

>...
> So here's a challenge: redesign the import API from scratch.

I would suggest starting with imputil and altering as necessary. I'll use
that viewpoint below.

> Let me start with some requirements.
> 
> Compatibility issues:
> ---------------------
> 
> - the core API may be incompatible, as long as compatibility layers
> can be provided in pure Python

Which APIs are you referring to? The "imp" module? The C functions? The
__import__ and reload builtins?

I'm guessing some of imp, the two builtins, and only one or two C
functions.

> - support for rexec functionality

No problem. I can think of a number of ways to do this.

> - support for freeze functionality

No problem. A function in "imp" must be exposed to Python to support this
within the imputil framework.

> - load .py/.pyc/.pyo files and shared libraries from files

No problem. Again, a function is needed for platform-specific loading of
shared libraries.

> - support for packages

No problem. Demo's in current imputil.

> - sys.path and sys.modules should still exist; sys.path might
> have a slightly different meaning

I would suggest that both retain their *exact* meaning. We introduce
sys.importers -- a list of importers to check, in sequence. The first
importer on that list uses sys.path to look for and load modules. The
second importer loads builtins and frozen code (i.e. modules not on
sys.path).

Users can insert/append new importers or alter sys.path as before.

sys.modules continues to record name:module mappings.

> - $PYTHONPATH and $PYTHONHOME should still be supported

No problem.

> (I wouldn't mind a splitting up of importdl.c into several
> platform-specific files, one of which is chosen by the configure
> script; but that's a bit of a separate issue.)

Easy enough. The standard importer can select the appropriate
platform-specific module/function to perform the load. i.e. these can move
to Modules/ and be split into a module-per-platform.

> New features:
> -------------
> 
> - Integrated support for Greg Ward's distribution utilities (i.e. a
>   module prepared by the distutil tools should install painlessly)

I don't know the specific requirements/functionality that would be
required here (does Greg? :-), but I can't imagine any problem with this.

> - Good support for prospective authors of "all-in-one" packaging tool
>   authors like Gordon McMillan's win32 installer or /F's squish.  (But
>   I *don't* require backwards compatibility for existing tools.)

Um. *No* problem. :-)

> - Standard import from zip or jar files, in two ways:
> 
>   (1) an entry on sys.path can be a zip/jar file instead of a directory;
>       its contents will be searched for modules or packages

While this could easily be done, I might argue against it. Old
apps/modules that process sys.path might get confused.

If compatibility is not an issue, then "No problem."

An alternative would be an Importer instance added to sys.importers that
is configured for a specific archive (in other words, don't add the zip
file to sys.path, add ZipImporter(file) to sys.importers).

Another alternative is an Importer that looks at a "sys.py_archives" list.
Or an Importer that has a py_archives instance attribute.

>   (2) a file in a directory that's on sys.path can be a zip/jar file;
>       its contents will be considered as a package (note that this is
>       different from (1)!)

No problem. This will slow things down, as a stat() for *.zip and/or *.jar
must be done, in addition to *.py, *.pyc, and *.pyo.

>   I don't particularly care about supporting all zip compression
>   schemes; if Java gets away with only supporting gzip compression
>   in jar files, so can we.

I presume we would support whatever zlib gives us, and no more.

> - Easy ways to subclass or augment the import mechanism along
>   different dimensions.  For example, while none of the following
>   features should be part of the core implementation, it should be
>   easy to add any or all:
> 
>   - support for a new compression scheme to the zip importer

Presuming ZipImporter is a class (derived from Importer), then this
ability is wholly dependent upon the author of ZipImporter providing the
hook.

The Importer class is already designed for subclassing (and its interface 
is very narrow, which means delegation is also *very* easy; see
imputil.FuncImporter).

>   - support for a new archive format, e.g. tar

A cakewalk. Gordon, JimA, and myself each have archive formats. :-)

>   - a hook to import from URLs or other data sources (e.g. a
>     "module server" imported in CORBA) (this needn't be supported
>     through $PYTHONPATH though)

No problem at all.

>   - a hook that imports from compressed .py or .pyc/.pyo files

No problem at all.

>   - a hook to auto-generate .py files from other filename
>     extensions (as currently implemented by ILU)

No problem at all.

>   - a cache for file locations in directories/archives, to improve
>     startup time

No problem at all.

>   - a completely different source of imported modules, e.g. for an
>     embedded system or PalmOS (which has no traditional filesystem)

No problem at all.

In each of the above cases, the Importer.get_code() method just needs to
grab the byte codes from the XYZ data source. That data source can be
cmopressed, across a network, on-the-fly generated, or whatever. Each
importer can certainly create a cache based on its concept of "location".
In some cases, that would be a mapping from module name to filesystem
path, or to a URL, or to a compiled-in, frozen module.

> - Note that different kinds of hooks should (ideally, and within
>   reason) properly combine, as follows: if I write a hook to recognize
>   .spam files and automatically translate them into .py files, and you
>   write a hook to support a new archive format, then if both hooks are
>   installed together, it should be possible to find a .spam file in an
>   archive and do the right thing, without any extra action.  Right?

Ack. Very, very difficult.

The imputil scheme combines the concept of locating/loading into one step.
There is only one "hook" in the imputil system. Its semantic is "map this
name to a code/module object and return it; if you don't have it, then
return None."

Your compositing example is based on the capabilities of the
find-then-load paradigm of the existing "ihooks.py". One module finds
something (foo.spam) and the other module loads it (by generating a .py).

All is not lost, however. I can easily envision the get_code() hook as
allowing any kind of return type. If it isn't a code or module object,
then another hook is called to transform it.
[ actually, I'd design it similarly: a *series* of hooks would be called
  until somebody transforms the foo.spam into a code/module object. ]

The compositing would be limited ony by the (Python-based) Importer
classes. For example, my ZipImporter might expect to zip up .pyc files
*only*. Obviously, you would want to alter this to support zipping any
file, then use the suffic to determine what to do at unzip time.

> - It should be possible to write hooks in C/C++ as well as Python

Use FuncImporter to delegate to an extension module.

This is one of the benefits of imputil's single/narrow interface.

> - Applications embedding Python may supply their own implementations,
>   default search path, etc., but don't have to if they want to piggyback
>   on an existing Python installation (even though the latter is
>   fraught with risk, it's cheaper and easier to understand).

An application would have full control over the contents of sys.importers.

For a restricted execution app, it might install an Importer that loads
files from *one* directory only which is configured from a specific
Win32 Registry entry. That importer could also refuse to load shared
modules. The BuiltinImporter would still be present (although the app
would certainly omit all but the necessary builtins from the build).
Frozen modules could be excluded.

> Implementation:
> ---------------
> 
> - There must clearly be some code in C that can import certain
>   essential modules (to solve the chicken-or-egg problem), but I don't
>   mind if the majority of the implementation is written in Python.
>   Using Python makes it easy to subclass.

I posited once before that the cost of import is mostly I/O rather than
CPU, so using Python should not be an issue. MAL demonstrated that a good
design for the Importer classes is also required. Based on this, I'm a
*strong* advocate of moving as much as possible into Python (to get
Python's ease-of-coding with little relative cost).

The (core) C code should be able to search a path for a module and import
it. It does not require dynamic loading or packages. This will be used to
import exceptions.py, then imputil.py, then site.py.

The platform-specific module that perform dynamic-loading must be a
statically linked module (in Modules/ ... it doesn't have to be in the
Python/ directory).

site.py can complete the bootstrap by setting up sys.importers with the
appropriate Importer instances (this is where an application can define
its own policy). sys.path was initially set by the import.c bootstrap code
(from the compiled-in path and environment variables).

Note that imputil.py would not install any hooks when it is loaded. That
is up to site.py. This implies the core C code will import a total of
three modules using its builtin system. After that, the imputil mechanism
would be importing everything (site.py would .install() an Importer which
then takes over the __import__ hook).

Further note that the "import" Python statement could be simplified to use
only the hook. However, this would require the core importer to inject
some module names into the imputil module's namespace (since it couldn't
use an import statement until a hook was installed). While this
simplification is "neat", it complicates the run-time system (the import
statement is broken until a hook is installed).

Therefore, the core C code must also support importing builtins. "sys" and
"imp" are needed by imputil to bootstrap.

The core importer should not need to deal with dynamic-load modules.

To support frozen apps, the core importer would need to support loading
the three modules as frozen modules.

The builtin/frozen importing would be exposed thru "imp" for use by
imputil for future imports. imputil would load and use the (builtin)
platform-specific module to do dynamic-load imports.

> - In order to support importing from zip/jar files using compression,
>   we'd at least need the zlib extension module and hence libz itself,
>   which may not be available everywhere.

Yes. I don't see this as a requirement, though. We wouldn't start to use
these by default, would we? Or insist on zlib being present? I see this as
more along the lines of "we have provided a standardized Importer to do
this, *provided* you have zlib support."

> - I suppose that the bootstrap is solved using a mechanism very
>   similar to what freeze currently used (other solutions seem to be
>   platform dependent).

The bootstrap that I outlined above could be done in C code. The import
code would be stripped down dramatically because you'll drop package
support and dynamic loading.

Alternatively, you could probably do the path-scanning in Python and
freeze that into the interpreter. Personally, I don't like this idea as it
would not buy you much at all (it would still need to return to C for
accessing a number of scanning functions and module importing funcs).

> - I also want to still support importing *everything* from the
>   filesystem, if only for development.  (It's hard enough to deal with
>   the fact that exceptions.py is needed during Py_Initialize();
>   I want to be able to hack on the import code written in Python
>   without having to rebuild the executable all the time.

My outline above does not freeze anything. Everything resides in the
filesystem. The C code merely needs a path-scanning loop and functions to
import .py*, builtin, and frozen types of modules.

If somebody nukes their imputil.py or site.py, then they return to Python
1.4 behavior where the core interpreter uses a path for importing (i.e. no
packages). They lose dynamically-loaded module support.

> Let's first complete the requirements gathering.  Are these
> requirements reasonable?  Will they make an implementation too
> complex?  Am I missing anything?

I'm not a fan of the compositing due to it requiring a change to semantics
that I believe are very useful and very clean. However, I outlined a
possible, clean solution to do that (a secondary set of hooks for
transforming get_code() return values).

The requirements are otherwise reasonable to me, as I see that they can
all be readily solved (i.e. they aren't burdensome).

While this email may be long, I do not believe the resulting system would
be complex. From the user-visible side of things, nothing would be
changed. sys.path is still present and operates as before. They *do* have
new functionality they can grow into, though (sys.importers). The
underlying C code is simplified, and the platform-specific dynamic-load
stuff can be distributed to distinct modules, as needed
(e.g. BeOS/dynloadmodule.c and PC/dynloadmodule.c).

> Finally, to what extent does this impact the desire for dealing
> differently with the Python bytecode compiler (e.g. supporting
> optimizers written in Python)?  And does it affect the desire to
> implement the read-eval-print loop (the >>> prompt) in Python?

If the three startup files require byte-compilation, then you could have
some issues (i.e. the byte-compiler must be present).

Once you hit site.py, you have a "full" environment and can easily detect
and import a read-eval-print loop module (i.e. why return to Python? just 
start things up right there).

site.py can also install new optimizers as desired, a new Python-based
parser or compiler, or whatever...  If Python is built without a parser or
compiler (I hope that's an option!), then the three startup modules would
simply be frozen into the executable.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From bwarsaw@cnri.reston.va.us (Barry A. Warsaw)  Fri Nov 19 16:30:15 1999
From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw)
Date: Fri, 19 Nov 1999 11:30:15 -0500 (EST)
Subject: [Python-Dev] CVS log messages with diffs
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <14389.31511.706588.20840@anthem.cnri.reston.va.us>

There was a suggestion to start augmenting the checkin emails to
include the diffs of the checkin.  This would let you keep a current
snapshot of the tree without having to do a direct `cvs update'.

I think I can add this without a ton of pain.  It would not be
optional however, and the emails would get larger (and some checkins
could be very large).  There's also the question of whether to
generate unified or context diffs.  Personally, I find context diffs
easier to read; unified diffs are smaller but not by enough to really
matter.

So here's an informal poll.  If you don't care either way, you don't
need to respond.  Otherwise please just respond to me and not to the
list.

1. Would you like to start receiving diffs in the checkin messages?

2. If you answer `yes' to #1 above, would you prefer unified or
   context diffs?

-Barry


From bwarsaw@cnri.reston.va.us (Barry A. Warsaw)  Fri Nov 19 17:04:51 1999
From: bwarsaw@cnri.reston.va.us (Barry A. Warsaw) (Barry A. Warsaw)
Date: Fri, 19 Nov 1999 12:04:51 -0500 (EST)
Subject: [Python-Dev] Another 1.6 wish
Message-ID: <14389.33587.947368.547023@anthem.cnri.reston.va.us>

We had some discussion a while back about enabling thread support by
default, if the underlying OS supports it obviously.  I'd like to see
that happen for 1.6.  IIRC, this shouldn't be too hard -- just a few
tweaks of the configure script (and who knows what for those minority
platforms that don't use configure :).

-Barry


From akuchlin@mems-exchange.org  Fri Nov 19 17:07:07 1999
From: akuchlin@mems-exchange.org (Andrew M. Kuchling)
Date: Fri, 19 Nov 1999 12:07:07 -0500 (EST)
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <14389.33587.947368.547023@anthem.cnri.reston.va.us>
References: <14389.33587.947368.547023@anthem.cnri.reston.va.us>
Message-ID: <14389.33723.270207.374259@amarok.cnri.reston.va.us>

Barry A. Warsaw writes:
>We had some discussion a while back about enabling thread support by
>default, if the underlying OS supports it obviously.  I'd like to see

That reminds me... what about the free threading patches?  Perhaps
they should be added to the list of issues to consider for 1.6.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Oh, my fingers! My arms! My legs! My everything! Argh...
    -- The Doctor, in "Nightmare of Eden"



From petrilli@amber.org  Fri Nov 19 17:23:02 1999
From: petrilli@amber.org (Christopher Petrilli)
Date: Fri, 19 Nov 1999 12:23:02 -0500
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <14389.33723.270207.374259@amarok.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Fri, Nov 19, 1999 at 12:07:07PM -0500
References: <14389.33587.947368.547023@anthem.cnri.reston.va.us> <14389.33723.270207.374259@amarok.cnri.reston.va.us>
Message-ID: <19991119122302.B23400@trump.amber.org>

Andrew M. Kuchling [akuchlin@mems-exchange.org] wrote:
> Barry A. Warsaw writes:
> >We had some discussion a while back about enabling thread support by
> >default, if the underlying OS supports it obviously.  I'd like to see

Yes pretty please!  One of the biggest problems we have in the Zope world
is that for some unknown reason, most of hte Linux RPMs don't have threading
on in them, so people end up having to compile it anyway... while this
is a silly thing, it does create problems, and means that we deal with
a lot of "dumb" problems.

> That reminds me... what about the free threading patches?  Perhaps
> they should be added to the list of issues to consider for 1.6.

My recolection was that unfortunately MOST of the time, they actually
slowed down things because of the number of locks involved...  Guido
can no doubt shed more light onto this, but... there was a reason.

Chris
-- 
| Christopher Petrilli
| petrilli@amber.org


From gmcm@hypernet.com  Fri Nov 19 18:22:37 1999
From: gmcm@hypernet.com (Gordon McMillan)
Date: Fri, 19 Nov 1999 13:22:37 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us>
References: Your message of "Thu, 18 Nov 1999 09:19:48 EST."             <1269187709-18981857@hypernet.com>
Message-ID: <1269086690-25057991@hypernet.com>

[Guido]
> Compatibility issues:
> ---------------------
> 
> - the core API may be incompatible, as long as compatibility
> layers can be provided in pure Python

Good idea. Question: we have keyword import, __import__, 
imp and PyImport_*. Which of those (if any) define the "core 
API"?

[rexec, freeze: yes]

> - load .py/.pyc/.pyo files and shared libraries from files

Shared libraries? Might that not involve some rather shady 
platform-specific magic? If it can be kept kosher, I'm all for it; 
but I'd say no if it involved, um, undocumented features.
 
> support for packages

Absolutely. I'll just comment that the concept of 
package.__path__ is also affected by the next point.
> 
> - sys.path and sys.modules should still exist; sys.path might
> have a slightly different meaning
> 
> - $PYTHONPATH and $PYTHONHOME should still be supported

If sys.path changes meaning, should not $PYTHONPATH 
also?

> New features:
> -------------
> 
> - Integrated support for Greg Ward's distribution utilities (i.e.
> a
>   module prepared by the distutil tools should install
>   painlessly)

I assume that this is mostly a matter of $PYTHONPATH and 
other path manipulation mechanisms?
 
> - Good support for prospective authors of "all-in-one" packaging
> tool
>   authors like Gordon McMillan's win32 installer or /F's squish. 
>   (But I *don't* require backwards compatibility for existing
>   tools.)

I guess you've forgotten: I'm that *really* tall guy .
 
> - Standard import from zip or jar files, in two ways:
> 
>   (1) an entry on sys.path can be a zip/jar file instead of a
>   directory;
>       its contents will be searched for modules or packages

I don't mind this, but it depends on whether sys.path changes 
meaning.
 
>   (2) a file in a directory that's on sys.path can be a zip/jar
>   file;
>       its contents will be considered as a package (note that
>       this is different from (1)!)

But it's affected by the same considerations (eg, do we start 
with filesystem names and wrap them in importers, or do we 
just start with importer instances / specifications for importer 
instances).
 
>   I don't particularly care about supporting all zip compression
>   schemes; if Java gets away with only supporting gzip
>   compression in jar files, so can we.

I think this is a matter of what zip compression is officially 
blessed. I don't mind if it's none; providing / creating zipped 
versions for platforms that support it is nearly trivial.

> - Easy ways to subclass or augment the import mechanism along
>   different dimensions.  For example, while none of the following
>   features should be part of the core implementation, it should
>   be easy to add any or all:
> 
>   - support for a new compression scheme to the zip importer
> 
>   - support for a new archive format, e.g. tar
> 
>   - a hook to import from URLs or other data sources (e.g. a
>     "module server" imported in CORBA) (this needn't be supported
>     through $PYTHONPATH though)

Which begs the question of the meaning of sys.path; and if it's 
still filesystem names, how do you get one of these in there?
 
>   - a hook that imports from compressed .py or .pyc/.pyo files
> 
>   - a hook to auto-generate .py files from other filename
>     extensions (as currently implemented by ILU)
> 
>   - a cache for file locations in directories/archives, to
>   improve
>     startup time
> 
>   - a completely different source of imported modules, e.g. for
>   an
>     embedded system or PalmOS (which has no traditional
>     filesystem)
> 
> - Note that different kinds of hooks should (ideally, and within
>   reason) properly combine, as follows: if I write a hook to
>   recognize .spam files and automatically translate them into .py
>   files, and you write a hook to support a new archive format,
>   then if both hooks are installed together, it should be
>   possible to find a .spam file in an archive and do the right
>   thing, without any extra action.  Right?

A bit of discussion: I've got 2 kinds of archives. One can 
contain anything & is much like a zip (and probably should be 
a zip). The other contains only compressed .pyc or .pyo. The 
latter keys contents by logical name, not filesystem name. No 
extensions, and when a package is imported, the code object 
returned is the __init__ code object, (vs returning None and 
letting the import mechanism come back and ask for 
package.__init__).

When you're building an archive, you have to go thru the .py / 
.pyc / .pyo / is it a package / maybe compile logic anyway. 
Why not get it all over with, so that at runtime there's no 
choices to be made.

Which means (for this kind of archive) that including 
somebody's .spam in your archive isn't a matter of a hook, but 
a matter of adding to the archive's build smarts.
 
> - It should be possible to write hooks in C/C++ as well as Python
> 
> - Applications embedding Python may supply their own
> implementations,
>   default search path, etc., but don't have to if they want to
>   piggyback on an existing Python installation (even though the
>   latter is fraught with risk, it's cheaper and easier to
>   understand).

A way of tweaking that which will become sys.path before 
Py_Initialize would be *most* welcome.

> Implementation:
> ---------------
> 
> - There must clearly be some code in C that can import certain
>   essential modules (to solve the chicken-or-egg problem), but I
>   don't mind if the majority of the implementation is written in
>   Python. Using Python makes it easy to subclass.
> 
> - In order to support importing from zip/jar files using
> compression,
>   we'd at least need the zlib extension module and hence libz
>   itself, which may not be available everywhere.
> 
> - I suppose that the bootstrap is solved using a mechanism very
>   similar to what freeze currently used (other solutions seem to
>   be platform dependent).

There are other possibilites here, but I have only half-
formulated ideas at the moment. The critical part for 
embedding is to be able to *completely* control all path 
related logic.
 
> - I also want to still support importing *everything* from the
>   filesystem, if only for development.  (It's hard enough to deal
>   with the fact that exceptions.py is needed during
>   Py_Initialize(); I want to be able to hack on the import code
>   written in Python without having to rebuild the executable all
>   the time.
> 
> Let's first complete the requirements gathering.  Are these
> requirements reasonable?  Will they make an implementation too
> complex?  Am I missing anything?

I'll summarize as follows:
 1) What "sys.path" means (and how it's construction can be 
manipulated) is critical.
 2) See 1.
 
> Finally, to what extent does this impact the desire for dealing
> differently with the Python bytecode compiler (e.g. supporting
> optimizers written in Python)?  And does it affect the desire to
> implement the read-eval-print loop (the >>> prompt) in 
Python?

I can assure you that code.py runs fine out of an archive :-).

- Gordon


From gstein@lyra.org  Fri Nov 19 21:06:14 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 13:06:14 -0800 (PST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: 
Message-ID: 

[ taking the liberty to CC: this back to python-dev ]

On Fri, 19 Nov 1999, David Ascher wrote:
> > >   (2) a file in a directory that's on sys.path can be a zip/jar file;
> > >       its contents will be considered as a package (note that this is
> > >       different from (1)!)
> > 
> > No problem. This will slow things down, as a stat() for *.zip and/or *.jar
> > must be done, in addition to *.py, *.pyc, and *.pyo.
> 
> Aside: it strikes me that for Python programs which import lots of files,
> 'front-loading' the stat calls could make sense.  When you first look at a
> directory in sys.path, you read the entire directory in memory, and
> successive imports do a stat on the directory to see if it's changed, and
> if not use the in-memory data.  Or am I completely off my rocker here?

Not at all. I thought of this last night after my email. Since the
Importer can easily retain state, it can hold a cache of the directory
listings. If it doesn't find the file in its cached state, then it can
reload the information from disk. If it finds it in the cache, but not on
disk, then it can remove the item from its cache.

The problem occurs when you path is [A, B], the file is in B, and you add
something to A on-the-fly. The cache might direct the importer at B,
missing your file.

Of course, with the appropriate caveats/warnings, the system would work
quite well. It really only breaks during development (which is one reason 
why I didn't accept some caching changes to imputil from MAL; but that
was for the Importer in there; Python's new Importer could have a cache).

I'm also not quite sure what the cost of reading a directory is, compared
to issuing a bunch of stat() calls. Each directory read is an
opendir/readdir(s)/closedir. Note that the DBM approach is kind of
similar, but will amortize this cost over many processes.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From Jasbahr@origin.EA.com  Fri Nov 19 20:59:11 1999
From: Jasbahr@origin.EA.com (Asbahr, Jason)
Date: Fri, 19 Nov 1999 14:59:11 -0600
Subject: [Python-Dev] Another 1.6 wish
Message-ID: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com>

My first Python-Dev post.  :-)

>We had some discussion a while back about enabling thread support by
>default, if the underlying OS supports it obviously.  

What's the consensus about Python microthreads -- a likely candidate
for incorporation in 1.6 (or later)?

Also, we have a couple minor convenience functions for Python in an 
MSDEV environment, an exposure of OutputDebugString for writing to 
the DevStudio log window and a means of tripping DevStudio C/C++ layer
breakpoints from Python code (currently experimental).  The msvcrt 
module seems like a likely candidate for these, would these be 
welcome additions?

Thanks,

Jason Asbahr
Origin Systems, Inc.
jasbahr@origin.ea.com


From gstein@lyra.org  Fri Nov 19 21:35:34 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 13:35:34 -0800 (PST)
Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs
In-Reply-To: <14389.31511.706588.20840@anthem.cnri.reston.va.us>
Message-ID: 

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.
  Send mail to mime@docserver.cac.washington.edu for more info.

--1658348780-1256090628-943047334=:10639
Content-Type: TEXT/PLAIN; charset=US-ASCII

On Fri, 19 Nov 1999, Barry A. Warsaw wrote:
> There was a suggestion to start augmenting the checkin emails to
> include the diffs of the checkin.  This would let you keep a current
> snapshot of the tree without having to do a direct `cvs update'.

I've been using diffs-in-checkin for review, rather than to keep a local
snapshot updated. I guess you use the email for this (procmail truly is
frightening), but I think for most people it would be for purposes of
review.

>...context vs unifed...
> So here's an informal poll.  If you don't care either way, you don't
> need to respond.  Otherwise please just respond to me and not to the
> list.
> 
> 1. Would you like to start receiving diffs in the checkin messages?

Absolutely.

> 2. If you answer `yes' to #1 above, would you prefer unified or
>    context diffs?

Don't care.

I've attached an archive of the files that I use in my CVS repository to
do emailed diffs. These came from Ken Coar (an Apache guy) as an
extraction from the Apache repository. Yes, they do use Perl. I'm not a
Perl guy, so I probably would break things if I tried to "fix" the scripts
by converting them to Python (in fact, Greg Ward helped to improve
log_accum.pl for me!). I certainly would not be adverse to Python versions
of these files, or other cleanups.

I trimmed down the "avail" file, leaving a few examples. It works with
cvs_acls.pl to provide per-CVS-module read/write access control.

I'm currently running mod_dav, PyOpenGL, XML-SIG, PyWin32, and two other
small projects out of this repository. It has been working quite well.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

--1658348780-1256090628-943047334=:10639
Content-Type: APPLICATION/octet-stream; name="cvs-for-barry.tar.gz"
Content-Transfer-Encoding: BASE64
Content-ID: 
Content-Description: 
Content-Disposition: attachment; filename="cvs-for-barry.tar.gz"

H4sIADvBNTgAA+xce3fbNrLvv9anQGXFkho9naTdWLE3bl71uc7j2m57d2NH
hxIpiRuKZPmwoqa+n/3+ZgCQICU/2niTc8+GJycWAcxgMJgZzAwAjs/j9iSI
2iMripbdb/4tj7jf++HBA/GNoKdX+qtexA+9/oN793v37/eF6OPn9jfiwb+H
nOKTxokVCfFNFATJVe2uq/9/+owL8+8FU9efBLfcR7+Hab1/6fz3H3x/L5v/
7T5koX9vu3f/G9G7ZTrWPv/h878pTmaOqKqJr4qJ6znCjUUaO7ZIAjEO/CQK
PLGYORHaQVxQNJ+7SVUARhBQNLcSN/ArmwQXO37SEYx04kZxIvAeLUXgC0t4
rs+4LRE509QD250PYeTEMaDRgTueUW3ixIljA5s1tVwfGBLgst3IGScBMCUz
SxaNZ5Y/ZXwjx/WnYm7ZDihuAbkHgs7phRoCU+3JL8dHr1+fgLCDCboHwbKv
SZD6dota+Ywzcubo03YiEUy4ICcZjEmcCMjCKJhG1lwSEs+C1LNpHCCvzBEa
tZuAJ4nl21Zkoy5Mk05ls7KpGEQoM4Rza0lsB5QjAIC/3pJf7oh5YLsTF20b
lmcBwPWTSZNHA1R3YmY8KKA21AuTXsTewhBGketgWG4ip9j1x15qA6Dh4FcQ
M9MBHIObaPFbGmAqmjTEBfGXcCo6bCkmvjV3Yjmcg4niXxjELs8TVQo7cGLh
B4niuOUvNWNXJSCWhLsx0BF6nhZRffrs+f7PhyfVbCpIMlvCnYC1hXEXCOHe
9w8PhRWGjhVhAi8RO4nE8hbWUqKWPLBs2+UplDKkZJlHQbwAIueDCCKhqKO+
1c8NLWxd/RdSMbTG43TeCT1MVuVLq3zhWbH/Gam31wfs/4PL1/9797bvbZfs
//3+vXtf7f/neDa/7aZx1B25fjd0Io916A1+aPMB+YedtaHtbA5h4KDzsTWF
Xk+iYK4ssTN+D6WBapPikiKT+c6NNi8IpKvjyA0TsXA9T0yjIA2VjY1hJDPg
0ZK6AQLVUYuNIeyyB5TKNmFVigPPtS2sFCZRQq0NDplPso1MHq9X2uyS1Sfj
ZcVxCihB5tRpyzZ6JNOikfcc6xwt5VrCRk9bMWkWiD7Pwg+FRY/bZVagh8SZ
h0FkwSpS15KSJ7SwuqOURoAhP7XOXVv8ZM3DBFbn0Uz+eDx243HQAd49gBwF
S/HcdTxbmqB5cA5Y2CxwjVZlWy4bMF2SKV3mGSj1nYXkLXD8F1a6JwEkXjZj
62l5NN4xWUQgbrgdp9OC1cY4JhMsADAPAvKhyY+Z/BcwgeI4cchmLxzrPXBh
UibulJtLWRkDMVaMoW2dC1rJLc+T62LjpRVhOeg/fPiwCWyVzU94NDPRdRpZ
I8xsEFInTOan4v2VvR47EMsgFQvLl9J19ORYHDxlZtuOl1i85v8dzXtiF3zz
nRZ+9/EbzJFzgJWcyrZRFo1j12a5GAXJTMm1zxPGa2mNGxBGNN4e3A5zyANJ
PpkjteOT/ZNnw1evXz0jw70regNd9uSn/Vcvnj1FWT8r23/6FCVCDkOVHT17
+foXbncvKzt8/UJIfPcx3NrJyzdPD450ZLgras9e/fKxLkvrF+KPP0S9m8zD
OuCfHxw+G745evb84H+4aX0Ty1mnTlgO949PhlSvsFQV3m7towF1QWoLda0C
mRqCAroUglWgI51Pm+B4mFlX18Cx1hGUYsTNelO6TnBgVj6s6+AgUwTz49H+
qyc/abBrYEaR5Y9nBHb888uX+0f/uBmJMKZzWIgq8V45PqUZVKX1C5Jp0iO4
urwSjAy/crSUfiatCcO5FcrFB2apU6m93D84HJ68ztDWpzHZn8feMrI6QTSt
M1vfHP4ja7S2DTp/CaUjU0e9QAPDEC9xR7yJgn+RG7+7J+qIY+x2EgzIpfWW
+FWv3Hl5eAChern/Bogb2rBxa/xoq7UjzvoaUKntnOedtyqi8GSMIhwlSgdl
ysvAb5avQ8d/ccjA4TLAy9Tj/sJlMgt8mPSYVo5BXXRKoOZzFehKnx/mnuIs
+sRLPmgJyIRTeexOjaIVPOH2OMeDl3Vkry9ew4ZfXZ8cSEK1oJ9lojKMV9au
IPZjg0a8tL3H+L8zCiazjhX7HSsdrC0tI2rehg0/TkfwlxKEQJ9sxStxOhIy
Rh1KZ+YjE8zrdaMGa9gSj1mpm5Dzx8OBUfuYgrCYRkRlNX572ztDu1gGpY2q
uNPu/S2utgRhakpg9OZYWPBr7Hk1NPaPlQ2EcQ0P0pfMGgpbbZP/njXFXaFr
qH1T7Invewxk9H33rm5fpIEpqIoqCNi4qGyUcIvOLlV2GLEk8YL/l+MbVC4k
l8bw/Pw0HGK5kb6pySk1CsUL0iEMuAH7iKFLQ6lGH6bxTLVuwe11wkb3XdGE
dr477dQ+uvZFrUvpCwo+GVOzqVDI8FwVFnlqcjP1Qf97w1APq3p0akSLyE2c
IdYFnonCxFMJyUNLcUHPfja8BpEMru5lTatNWo5tF3NafWL5FOhTQ/bIuYOs
5Y6ofXvqVzVHaJ4YnfhX4PqNKqqybtFD3pIHzi2b2axQSO/bwyQYXjWIdYLM
LUnkDJ4VBBvttkzVKKoDZKnAh+sZcRkTNm7GgY3C8NdNJOVELmMAVf6ZSbwZ
6aX5415uOmWfQO7npZd08ApqVw2jpH4N4Y8+hW7Gir4eEbK9NeRywSyYhxkB
GVhxLFdqCqyc7URrrH3ifEiouHGTkUVOkkY+t6Wmixnbekk6axoTSrLP9pBw
twyTjXWj34vxh9cOSRJ+DMkEbqh3ckKrpr1eYQajLU3j1cbulkb/543gzRik
yOuAETccOLkLiFgtX9QpW9/+b58S0Eka1ykfzWuGTPLLNLEVTdO5g+hQ5lmA
JbLgA1MASuEoL4Mc7AyVi19kJJwSZR/ZYK5hY+ScG7Jp8B+DQx39P47Vu6lF
ij8mbvBIuhBY62fuJDE6BhA7E1VuUBXObxCVzFvwmTnwBJQFRwSPOWz/IefO
+eCMmVn1lqiDX/RHsawllI+AnkGsEsCNmoz79ZsagHpHWz236GcvI4LI6757
K06Ts++O8kQ50LqUiu5yw0wINjTj2LHx3KRRF3VWCKpUxNRUm7fbZ7I0IySr
uVeq+V8Rt95lOepWa6W22zqvdbsDSfMFu09S2jCYJo+OBsIEgLh6XWxtZVMo
S7IR35DX0JQ6NYkIaTVj+YapJRknNyQbKfW3o1iWWcGBesmmpzYsF9EA33W+
G/AIV2vensZnd3kV3pEtLjI+5GonGcHSNF+Kx5QmIwHZxD+5/+DGcs/IyJq1
oJN+PRG0GZYEvFESJZwjBTBvkfB2jxsTEhdRKTxPznXGUMxRmlCZ5cUB5VCd
RSzSUMTB3OH8kRMheBW/0qYK7xgRCsrh1mNRpw01okeSUgeOaQDXYTYfACMa
hDDbyVJMg8Du8BCIve0fhfK5PxZiGR4qWcTqafLo0Y/56Pb2tKfteLGjp39T
/REvHL2LN5k4keOPKc3rJAvHkTs+bHmoQUjKEKRxphQtjULu8VkLK3JkUjZL
aurNm5l17mRwot7v9OuEVyOg9jK3kYNQmhhmsN7x0bhTIDqTcchF912j813z
tNN422s/PLvbrCm5q4VKC7dFm/JeqmSoSvuiA9SIviFjVE4yYrYggTvtSJyn
nR4p3aY4iVJ/DNkmAmR2mdpiQJL2ilIAJg02joeZawEr3NOD58+VxqF44xKl
S0NSIS4I+f+IUOXKp3SDJpyipQPfdj7syNpTf/fap0pdd0T12oZYEJWeZbLz
Z4dCRPIrxd4bElywPdHMrq4zL6TAovRoCX9EPe8NBBlk1nl6b2ZG0YAo4tgU
pddnHzivfk6a6QUQYe6gsbcr+r1eT0iH/++8YY7F2XNthzR9tCyhSQKZnMdy
C+H12CKs2YDoK4ws+EUMlMeyxmojdy4WsAG0E673TUaEeRYsxDyFg7CwYu6S
dlWLaH6l/RTa8mZjhJ8+7ZpY5DPADHGaHw4L7TpIQqgfoF4ZTiA7YndDLILo
fUuNyY3rHu9HjKwReGbL/dckcsfvy7QUXmGJa346H8pud7VZNpuw4uRt1ByU
jZxGNo+nYndd4kz5rKL6ttPp8JyGvLvNs2fviDu2Gnswd5MELOx0zii8uyIJ
Z5De3u4NVlqSAzB2GjymFqhuFQFaQgWQRLQKbppFLBeX914xFYFGrnQAaIAI
5osiVNGoQ8WYgCaKZB30YcsI7ZQ3mPn0d9oPY3j12/GdWHr10t3j9ZYVlbTJ
8P4YuRnosvc5Sl3PHs5kAFDwPWWZAgBWB2ZT7Rb8k9LMRtPYGbdqc9dv1SCN
EX7a1hL/Y4GpLR2LvX9umbhzp0H/ZTmumY48jGH9LaaqO71tu5v9x6872X/V
VkUbIri1dLoGA6Su7kDmaKIC/26f/hIhRlMmTzCpgqjOY9IsJ14pMAGyNzO8
bkk0FUpXr9v5jnw8Lt7UMVoCpx9LsmXbEfsXQSRS2PcJaYiEJYcEk17LU94z
K54xFoUjr/rIIOD3BUsTE2sHQxk1xBx6Fkmm9Q1eyLIlshirFE0r2GwS1Dtv
QpSPV8hNV9o+62aYq0auhxcSublzLDMJCht0xFB9mSNQ7QpJGaaxpUTeUCu5
MCiIQnQmF7LcPsFr8UUhONQ0qJCwlNwhp26Itu7EHcuzVYXEo8m0PL8hqnpb
I9/01vYdJoh7ycNY2kzBmP6gA1s2b0Lq/ZVi0oQKRfUk2Mnqy2kV2eI4HdHe
yQ7taaDt/tGLX972zta3PeItFcap92tM3hpNV6ahkLOhJlI/PiUhLhPs4J0v
fgzs5W1sGx9DRUJ4l+duFPgUXKM0nVvxe9Ho9bZ5Q4COKvlu4lqe+7sjRlbs
jsW5Fbm0g00p/pprY4ahpeE0CikDUaOIlCO7fBcWpWxasj22n4+fqT1SQMoq
/G4QmkXq2o3ao2aT8vUozMwZyjfvwF6hblCRmc5SyKkmk6qlYdG1Xe0vxrK6
Zpz/2pWWBBWDCi+8mxJ2V2QrLiVXSaU7VbIdyvszKnj2uRPu922/05FYznjW
N6WgVIk8AQ8cQiLlZYfEhUrzTKpuK4dXbivTuyuNmeCVxmxwV9oSxaqtzBmX
6jGdWb1rZ9VK9sLy6TnjvGMAN42Ps9HBitzY8VkBXhMoaZ0tDg1zPq6QZHT7
s0/YHHWoYRJ4njzoh9laOMI5x6rHMVXoRG2DNEoqkcFA/zSvWDaw9EO0mITM
pD6WFIVWlBjiNCCBkQ1VqlJv6u4WIFhquD7b0C016J+RCChDKxcvjWsAQ1qp
ZZwa6oyIZsxgtdJcKa/R7Se0fchzYXFMmc8U++L6qBPvbKswU55B5COIQKCO
MLkJvPBSDo43pGUezkI0MA+TpXm6qaNYrhWSY9NXJg3dTLUyt2XL9J+MXKFK
bLLBNfONBafKrKlW1xUK2kbLDQTVb5UW/0aJ3dqWq9Yri51e4DDUTLjU1CoY
5wOWtl6W7TxA6EG2MSCRJX6OYMf1ES0d4Ywh32ASC3h+RJdYqlNMxydPD17t
aQ7KhBJ81E3xNAqkgmK+yc/NXIvuuyOVbzjt/sjx+U7X9CniVtZA15OIZS4H
GCkaj2VkL535ltQUw8uQqUvpHmzqAjOXePfEmu50OQsnl4hvszXi8PULhUm2
f6mPWTwnVSBSRXlZUadgBrJf5c9I6H0+MaZB8axC81mYgaa6AH2kzqtl8KvQ
6kzMur4PoQUv5Uxe1jcGO8g5pqDjLvHo1D+7S+kVmkwVJyDslM5SHFpjOaMV
Y07U+Z6h2rWVkyIDSNVtmWOGcgCez/loaHElPPOsCK3O+9ykd8UzY5pXmpAU
6L327rs3Rzu1rgsPQOYypAg7Cz57U6qBTycj2NWq16PEIs3kQ6BUt5pq31CD
IbVXss2WKd/IUOp7jOA+pEOWfK6RU4F6ckae5b9X0XR23NS0iELse3EAJDAx
fKZbzFMEcSEU2oRVdnadXbCDBZ/yNkxzDgnzIBqvgoQPo8Oqc46WD7UzCn12
T4cnoKzFoROv5mMuaBompraZMUPsiXZf2wrO3fDcZfVk3n0nN7q832Ewkz2g
v4DWaFvsIAzCMnpa5Bo1l1QshxrAf0EvPfrbbuuuqJdCNy6lRM/0NgyMU7FW
17DY6OyGISk1tyX6ZUF57qo0cba1xplgPpXvqDy3IRyAyUcAeoniu3dXmPOt
aIMN2cG6Ts3FP1uzRa+VW+ZeYmO1ubE+5mzflMC7NCelOulQCt70z5hOTDEq
dMSj9STEqheoxJ95zWQKj9SJ9M0PMCG0YvYPtwrHPdZSXZjyreLRCgAYRwdz
GH22wly41kLnRyMNYA1tmMi1wOZ5zBxcARfs81pw84DlCnjBwDZVgJIdvNUi
slXc+ARS80hkjnRluSiMLZ9A6TouZg5NV7ZNxAJNYmEelz+YkB1pid9SeXKd
d2VYTPW5VimkktLaMJdQysOtNGPhq2XHmArOcLGGPOHG23dW+/f99j977YfD
7lmzdXpa67em3FRqjDzFlIHRsSXaqSdl1rmIINJJiFjeI5tnWYgN6b9VTN0+
Udxw9e0xDUsbPpTxhiPoxjPHRvVhOg85PlkmfBsnCabMUr66w/GSMuTZDQK4
5sACnCHZ/4m+1TM3zoFKt9uGq1fhDbjAN24S8M8nysrrM/yX+tjyEkUU2OnY
UVl63/J4iXLVCXh9vcy4TwEo0zFf75SvOOSfbuFMlFvZYZHLdL+q/NgswXwp
+Hr9rWof9AYY1pmPKnuhNwBer/9V5YauR4BoBo7mzuWcudL+IxRqmn6YcZB/
V2xnLhjPxTpDkrlQ5QBL7zuVynVoITLJxPMGa8FagPIA1psyPYiNCyOduSJ0
Fyzh+2xyy3KsdctIyz6aQxU8Z481h5alGweI1M2rYCGmji9DPFZJ3jNixuq7
JuyJZblOHSYXJiDzimgXYmZH8voCFVyiQhs30J8bTigjUf0q5IVDPtXrp/Li
L8/o6nzqDWRQI012aX43NvmClUu3RsZLWO3MDVOZUi9zQ8y4nT2OPxnNb5WP
2A4qOrj/0jfj/jOe4v1PqYdDRFLhLd4Avfr+p+h/v32/fP/zQa/39f7n53hu
fv9z3TVJfWlTZz3VtckK7TSOg8he592qTwlo/05fAUosefHSs8ac6gYOugyv
LgQV7lDLa6R/+Rol5xtC+XmDObzKeTonUPN25SfflPsyl9gErPPPxyfqtr3J
s890aPw2zl/TrB6x9MgQKRccyxAn+i4CiyFHBHnKHZKkbvGXJSf1yIFQV5Ax
9bYD+Z4TLxZ09kZ+CCC/B6suD0vXPach4O9HUI6pnRfmN4xXtu2yGHxdXFbY
W+OFr9Frft6Vr2T/iaNY2ykgur0+rvv+S+/7ftn+b9/vf7X/n+NR338xJ371
IzBxGsrjquqrGNAIRN/ssKrvw3jyiy022dMY7jl/f0V/DECUz2y0gJE+BxLz
zXjWS7mMGJ9GufTbMdrEqC/GaK1XxoA9U0oVZ5f/o8iJw8Dn5DLd3DYPv5cJ
M3IwipDLvwjjCyeKEDoYcQ8tWK7+MgB9poBNI9miAn0dlc30HHUPb4cKNt4+
wogShzck9s4e6XHumcWPCn3i1bfbwaRNRO0xVr2lK7PBI4eOAPBZv/pmHQ2s
c9q5Lbp56pUJrMAewFx6MZef06WzpFJYSL60wH59bvVZ5/+zJNxiH9f4/2K7
/0PZ/vf6X7//8lke+P/k+8ezCuVMVg/X5eZA1D7271Zrj6sXA/5glsoGX3Ie
TxuXdVBC7+DTy8St4J+M+vtfrcvnftbp/21/AvA6/+9+r6z/937of43/P8uj
/b9s4i//BGA5ARB3KoXv2Kkdk8idzhL5fbnzgDwedj9K59v4iBPvuwBH9ukn
6oww07Y+fUin/bsTBdJWZB9cKnwyj5w/cnos8ryMY69ANOJLDVGiPwz3Bb5I
KIlJmIv/ns8S5p+kKjGIrpil/tdv86lv81FvlyxU0u390mr4xZ6S/c+X+1vs
42r/r7/dR2XZ//ua//08z7r87/54LL8oJ+2+/D4f7ZJBc2Cr7On08fs4onSq
aMiE64uOeBGlo1HclEnZX45LSwrFrP/X3tU2tZEc4c/SrxjWqkjihGTwxR9E
SMLZnEOdQS6M7z6cbNcirbDQyyr7AqYC+e3pp7tndlbCL6ly+ZKqnSobkGbn
tadnpvvpZ4sFWlJEaQeqamkDF6xqp8U/IZ0HjaMbSwq9wagsYxhe4GNgoChw
t2UGOL3doxkLuqev4F4XFYSIQ3N8+vqcNMLh+fHgtC+bg9zdM8+GDa10DDwB
XbGh1DQQrYBJk5JmEMfyMpVCHnfNsbDcmVdHZy+plf0dGY/drnmVi81SzJ9q
ydwGnmybdzr/3t5BB25RfDGGrHbFTmCJRms8c2zC5fmzi5er3Csak3mch+YT
zyjH4oyjDxFjmWcYSRm0J13zLIlgpQkL4tfxpkLlCfDG+efB2cnhuRn8bM7/
cSTTY2D77LsNWT7jMp05NnUIGZDugRRXN3XraTjOWDSxFYuZg90N4VgRgsDq
xiDXy+JFx8yiaMWilYSjmVhvFRwSXEyzQMEm/Dt4E6JJmM8FsRLEy0BrIxmC
hyJPgHgM4skkgHU5yJfc/qCIueSHEL35yAT+d1TOwcFfzUtnuUbcpIrkIW28
vH9iL2aDDa8FPjXZQmj5FLXhaHW5pAWg2+rLwtTzhScZS8O8l2M9IjXvmjtp
tArhXMf2CtfIHGFJrUOSHLY5aXR2eCFYHK1aiC15NYqMG/Mvmf/tjtbX3b43
v99hYSUd/q/b7eIDXv8d+R+xkW/1eSwT+GYAyiWdnH22L115hMR8sBKTZIfl
JCy6I/RuNOWo3HIV46EnX/OQp6UQ8WGfr50zcR1Gwn1cOtp5hzrKemuFx+cE
hXOr5pgr+LBbnCb5dBRCydB4iyIGpGk+t9HuS5qnml/cRUSZjWU4PfoYLmgK
+8a0zoEmcrDZ5v5+Uw7nN+CixsKCxGlwQkEarRNa05Gu1Wr7++YEquHmQzwv
nSM1i2qKGv9xR9sDP4LI51XGqh8TYAL6IiiyTRJo1av4w/IOimiees+wQyZA
Dpl05AqKsz2VIa1SSup8zm1LS22jVRJIwYE4Nc8GL84OT8zLwYvjZ6JIsT8A
Kik3D44YglmXJD2e55l/xLdTbVrlkzR6wygYWVCssx39oQ3nand01xAXZehE
zIkAFQNB5zkIs5IRmr76jXoXzmf0XRLnl3KV8hTnPI5nsh8l9kZTgsWCZpRK
wVN2IfCKLnoh1RwWX7uLSDolWRI/Gr5spiqeLDMSSiFbc0QaGXI5opFbOHib
tFIuB1AXrE926LTJi87YRccDYhvhTaFrRjTFftDHvO+YAZUWzSPeLbUeGroE
/LNct+23da1xeCu4MvC4oalM8GTJo2ezdrUG1HYzJXnapnW3rXun5cmx8Tcc
jYbf3rbNgk6RtA6lBm8pyVSDGnnpjwbqgSLaoT3axLzX3cBb3qFRnjJHgK8N
HHy91E3QG2vU8JpAqPLQ26e66XG9429bNBrCveyV4Dw1HsMynm67qpMoUkWz
IaK2K3TrW46ndBzMw7lHEM+6km0J/1U39PorLdPx9W7URVWgPElcIYWiLBHi
r/VLCkRep/68ZpSaVzxamr96vTGOLvJLZcGlcxQuCA+RnTb4ORuEZjN2TVA+
N4FxdXHLy0sLCRDNGdzjOr3VWv+O1Njp4clRcM+e42sa8UCQt6/Pnx+dnZlh
YzyNDppvlrMlAhyw0BY4YZnG7nDZ5DIVjkVZG7v7/P/BsNHcD3z2pVKo2bvW
8OaH9gHHGmk8AkdZthW0tUf3OFiEqWYOtVFnOsueVYYH1NScFMYN37zTeh3E
WcEJpiONMsuICt4BNNEO1369JbGllvmpzUjl/jubodfv79cZlcxYaM3cNc0e
+Fca721AKA2WApQbDcMkTH3NjCBjfMgn16BjI4+LTzSiNNC2yfQDpYG+v8cM
QBRqHDH2Io7HtJnRmHB3KVO+nMJlGs7fx0yuQELDiInW4a8ciV1IidIkCRYA
jEJsBYk15rBY5hxnwtaxNHMxbFyaH8O2EpCFDRrrvRum28NHvQc+beiHrcZk
Hl6mGWgIcoRrtl38Zu/3Ydp5uz28k589paNScoHXs+kKcy2gbDTvy8c4CZ+J
4QSeLgVn2NIJ+olO9MWSo2l6D8xIhxutHFTaULMF4eS8EgdX+lyr6nkNjTI1
7tGxH7GufJqbh7e2qV3Taj5uYtgHvwA7emhPOW1pIlcAnHar8ScOZJHOtM3f
zGPT58AVreqYY2lxkSU9EjghMHyH4FKYbp2VDI5txci0YfWLFhe0ZKdCn7Ih
Q7vGjQK6vdXI5ceiqN+qDcZzjjlIJCB1yAcy8fVTEeEoEz1K38jZJEHfT9+8
fCkVT5fv+YkDwFhzDi+nNWOw1qj3WkenJCQ/9DqN3PGXFmtuqLmtQuMdxjVH
Z7lYXzydWr3r1Ikc+exG78yNITfZngjt1ls+iIitVbJocZJRbKHLfE53ght1
10uQQRGeS4IKHi0BuxqjwN4WGggdQuODwWdM7d+zxWpt3WBIMDU1LOTGlWkh
j8MIl+20cmyXZnPP5NiCzTv6J01WGzswH13kHkWCqry4RVt2O14okpSK2brC
/FltigXSuBr2ekrfJUhlLaMg2/UKVTyy6wNrRUY028pKA1ISlCsgu6nPHqi5
hDP3pITbp81mGfHOhZ+VFOSztCGeZpZVwkNhpdl7YENMWecPSwXY34s9YWj1
AP9c2xfu6wJsY42MyI2vK7tUhj6zTcka5cbREsETMO7kk8l0NIVc/hImdJy2
S+pOR80WZgu3ZATlAkO5mfSB+kwZ2iOlRXQVi8apOSpYLOx3iOKh4xLKLqsk
1j5FdbyJub+/M7CtSl+VyvZ/VjjfvI4v+H/N7uPdtff/7T3drd7/9l2SHjcE
GHZ3eZWSQNzpuw30Q/T8Tg/m+lG4mCHfx0WFB/s/Txv+PyD+vnEdX8B//fjk
6Qb+Y2+v8v99l7Th/zM2GHeU5OPIfIDbZMoG0jQEYZUCLZIoZUcJHUFvLP6B
T1WwK4M2CC6QcbyMAjUlQK7WyMl8IocunGUkDK/YSPFbtFh01z1znQccVBK1
C8sB216lkklsff9ydaRi7N/G9GDL7WVxD0X2XMAKXgvDDsYkZyt7ClME3e3H
OHIeBG6QrqfBvs97pPaeo+fH54Oz5j2dhHH/b8jfzEiT0L0rtZ6lek2LdJYi
+yROjJvlUiP8ogUzUioc1mutbrNw73GOreDjfXFkZ9vLG4x/3+jSd0ZpjtS+
r7uIDpSpxqD9uo38tpHknKFrAvoYLHPpbZpFi1Zgm+OFjdC3HFFyfAoSdVMO
KNEWaVP6psSTT+UkwhHS34gzsaUO3pwjTMUFtH9VqQjQkFL1qaJQkLTtSEgS
D04h+db0rGZx/jb2fCj1hrCIsE0IfrXcQcP5nsRGcdCscjYMc8JMXqXcN5Ew
cKVRZIsWRmA6iSOqUsw+rb8obRF4n5/HcCKLJa2vjsl8lTIBM1zVQrMKOBEc
qMIFSz9HYRrhcY6vlUmbwq+1Q8d+WA3o2nw6MMp1uLVlyWRQiZIhy9WCJgDM
uh7fOtugsCZSpnlx1C6WS6mW9ph0m/l5NnIV0HfuHUicQurNgywvnU9yx3Rt
ey1z8yPzU8HxYs6coUUGwiyALuBYfHHcoICam9Bdv4fGkRbT2Au1kEzzrZDF
8PwVvTDhJBOSqkXXcZk/xMDzOQqeT3Hw+F8+e+ZYeVxjmWSZr8DSFTaVQaI0
G0gLlH+bw5Xo4u0ktGP5KpbNrGue4WrL6DN+zE4LNPFIrCM1TyI42svWX14X
EibICAhZCLWaWwq7Xgb5TATGL5iNACxp9r59fEo6Rn6lHNA38lacVklR/VvC
nE/zGQxFM7Azq9hf0IZHC6bbffC57kU4+4pHHdkHts2IRQJGcI//nBSE/Qt9
tRpztFiZndRpojXN+Uj8sKG8mEiAkSJtiRgxl7HlImKvl3zBpi3ThDw3qYgd
VnqRuInxWmDetkE04q9+5W2mPfGa374Ga35EGaGNsjBrtSUAHx1b0SZKUxJm
pPXEflT07O7ObHkyZodUu9dWLhCOeiNlkXBXad+w6puU+QM7RMvT5cxFDXUu
zxYjB1e8v0sEf1iI3f90Kp//7fnp29bxhfv/3pMfn66f/5/8uTr/f5ek+G87
8Zvob7YPItzPMTowuXt8eSlBsj6ZIqOlQOYOUEqqOI6wCPJrhZvvKQd4a5Q6
iFtbo3CvwaXuvUuWqxOP4jhPYMZ/Mb1G6QAt2fLZzQRwG78AFk/za8/zy50p
KYkce3zHw1+songl0BFGkOkmLBeUMdzdE0YZOgCYDeEDN7fl6PyYATMxrxfe
ToDYwyzE5Wb9nRfsOr+R6Pdx5Krs6DgU2EIgzm2tpF/nU5kL27S+IqZOJGY+
ze1LMYSxiE9s5W5jVDkUcpQJCOrXkF+fLDcyi19Ho6SNYVoMCEP5x1F5aGTy
JRXjYloCYGCyvVL9DE5Jo5XQuEjZrgDNQz2kWV9m08mt5TFLsumIYeFcLR1j
u22HJfcQ6+FFDPQVRm2CF6L44Hm9bN7wTrEeFXA4ElGx8aiWEcpxRTmXTCLe
Vrw+SUcmwZbLwZ1Zxly5oEkQiM9SgwzU/cRMzCrgWu8c26uAeVJt+ci2sFwS
IKh4vQLwZRq9bod/FCLw1lxPw1KVBWq0I/BGHygrfeHZwpWX+23LJYlrMs4i
CtMpnWsWUXLJOVX25FnpwYBxRk5+0g+29bQZ0zOW2Qz/gL6fRbeAuKifjWte
SUO6aGISWXHnUzPgMtKdEpQpNJdY9LxaffCSRcQ+FL7GEax/tJatUpWqVKUq
ValKVapSlapUpSpVqUpVqlKVqlSlKlWpSlX6fuk/kVTZJACgAAA=
--1658348780-1256090628-943047334=:10639--


From bwarsaw@python.org  Fri Nov 19 21:45:14 1999
From: bwarsaw@python.org (Barry A. Warsaw)
Date: Fri, 19 Nov 1999 16:45:14 -0500 (EST)
Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs
References: <14389.31511.706588.20840@anthem.cnri.reston.va.us>
 
Message-ID: <14389.50410.358686.637483@anthem.cnri.reston.va.us>

>>>>> "GS" == Greg Stein  writes:

    GS> I've been using diffs-in-checkin for review, rather than to
    GS> keep a local snapshot updated.

Interesting; I hadn't though about this use for the diffs.

    GS> I've attached an archive of the files that I use in my CVS
    GS> repository to do emailed diffs. These came from Ken Coar (an
    GS> Apache guy) as an extraction from the Apache repository. Yes,
    GS> they do use Perl. I'm not a Perl guy, so I probably would
    GS> break things if I tried to "fix" the scripts by converting
    GS> them to Python (in fact, Greg Ward helped to improve
    GS> log_accum.pl for me!). I certainly would not be adverse to
    GS> Python versions of these files, or other cleanups.

Well, we all know Greg Ward's one of those subversive types, but then
again it's great to have (hopefully now-loyal) defectors in our camp,
just to keep us honest :)

Anyway, thanks for sending the code, it'll come in handy if I get
stuck.  Of course, my P**l skills are so rusted I don't think even an
oilcan-armed Dorothy could lube 'em up, so I'm not sure how much use I
can put them to.  Besides, I already have a huge kludge that gets run
on each commit, and I don't think it'll be too hard to add diff
generation... IF the informal vote goes that way.

-Barry


From gmcm@hypernet.com  Fri Nov 19 21:56:20 1999
From: gmcm@hypernet.com (Gordon McMillan)
Date: Fri, 19 Nov 1999 16:56:20 -0500
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: 
References: 
Message-ID: <1269073918-25826188@hypernet.com>

[David Ascher got involuntarily forwarded]
> > Aside: it strikes me that for Python programs which import lots
> > of files, 'front-loading' the stat calls could make sense. 
> > When you first look at a directory in sys.path, you read the
> > entire directory in memory, and successive imports do a stat on
> > the directory to see if it's changed, and if not use the
> > in-memory data.  Or am I completely off my rocker here?

I posted something here about dircache not too long ago. 
Essentially, I found it completely unreliable on NT and on 
Linux to stat the directory. There was some test code 
attached.
 


- Gordon


From gstein@lyra.org  Fri Nov 19 22:09:36 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 14:09:36 -0800 (PST)
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <19991119122302.B23400@trump.amber.org>
Message-ID: 

On Fri, 19 Nov 1999, Christopher Petrilli wrote:
> Andrew M. Kuchling [akuchlin@mems-exchange.org] wrote:
> > Barry A. Warsaw writes:
> > >We had some discussion a while back about enabling thread support by
> > >default, if the underlying OS supports it obviously.  I'd like to see

Definitely.

I think you still want a --disable-threads option, but the default really
ought to include them.

> Yes pretty please!  One of the biggest problems we have in the Zope world
> is that for some unknown reason, most of hte Linux RPMs don't have threading
> on in them, so people end up having to compile it anyway... while this
> is a silly thing, it does create problems, and means that we deal with
> a lot of "dumb" problems.

Yah. It's a pain. My RedHat 6.1 box has 1.5.2 with threads. I haven't
actually had to build my own Python(!). Man... imagine that. After almost
five years of using Linux/Python, I can actually rely on the OS getting it
right! :-)

> > That reminds me... what about the free threading patches?  Perhaps
> > they should be added to the list of issues to consider for 1.6.
> 
> My recolection was that unfortunately MOST of the time, they actually
> slowed down things because of the number of locks involved...  Guido
> can no doubt shed more light onto this, but... there was a reason.

Yes, there were problems in the first round with locks and lock
contention. The main issue is that a list must always use a lock to keep
itself consistent. Always. There is no way for an application to say "hey,
list object! I've got a higher-level construct here that guarantees there
will be no cross-thread use of this list. Ignore the locking." Another
issue that can't be avoided is using atomic increment/decrement for the
object refcounts.

Guido has already asked me about free threading patches for 1.6. I don't
know if his intent was to include them, or simply to have them available
for those who need them.

Certainly, this time around they will be simpler since Guido folded in
some of the support stuff (e.g. PyThreadState and per-thread exceptions).
There are some other supporting changes that could definitely go into the
core interpreter. The slow part comes when you start to add integrity
locks to list, dict, etc. That is when the question on whether to include
free threading comes up.

Design-wise, there is a change or two that I would probably make.

Note that shoving free-threading into the standard interpreter would get
more eyeballs at the thing, and that people may have great ideas for
reducing the overheads.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Fri Nov 19 22:11:02 1999
From: gstein@lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 14:11:02 -0800 (PST)
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com>
Message-ID: 

On Fri, 19 Nov 1999, Asbahr, Jason wrote:
> >We had some discussion a while back about enabling thread support by
> >default, if the underlying OS supports it obviously.  
> 
> What's the consensus about Python microthreads -- a likely candidate
> for incorporation in 1.6 (or later)?

microthreads? eh?

> Also, we have a couple minor convenience functions for Python in an 
> MSDEV environment, an exposure of OutputDebugString for writing to 
> the DevStudio log window and a means of tripping DevStudio C/C++ layer
> breakpoints from Python code (currently experimental).  The msvcrt 
> module seems like a likely candidate for these, would these be 
> welcome additions?

Sure. I don't see why not. I know that I've use OutputDebugString a
bazillion times from the Python layer. The breakpoint thingy... dunno, but
I don't see a reason to exclude it.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From skip@mojam.com (Skip Montanaro)  Fri Nov 19 22:11:38 1999
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Fri, 19 Nov 1999 16:11:38 -0600 (CST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: 
References: 
 
Message-ID: <14389.51994.809130.22062@dolphin.mojam.com>

    Greg> The problem occurs when you path is [A, B], the file is in B, and
    Greg> you add something to A on-the-fly. The cache might direct the
    Greg> importer at B, missing your file.

Typically your path will be relatively short (< 20 directories), right?
Just stat the directories before consulting the cache.  If any changed since
the last time the cache was built, then invalidate the entire cache (or that
portion of the cached information that is downstream from the first modified
directory).  It's still going to be cheaper than performing listdir for each
directory in the path, and like you said, only require flushes during
development or installation actions.

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From skip@mojam.com (Skip Montanaro)  Fri Nov 19 22:15:14 1999
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Fri, 19 Nov 1999 16:15:14 -0600 (CST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <1269073918-25826188@hypernet.com>
References: 
 <1269073918-25826188@hypernet.com>
Message-ID: <14389.52210.833368.249942@dolphin.mojam.com>

    Gordon> I posted something here about dircache not too long ago.
    Gordon> Essentially, I found it completely unreliable on NT and on Linux
    Gordon> to stat the directory. There was some test code attached.

The modtime of the directory's stat info should only change if you add or
delete entries in the directory.  Were you perhaps expecting changes when
other operations took place, like rewriting an existing file? 

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...


From skip@mojam.com (Skip Montanaro)  Fri Nov 19 22:34:42 1999
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Fri, 19 Nov 1999 16:34:42 -0600
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <1269073918-25826188@hypernet.com>
References: 
 <1269073918-25826188@hypernet.com>
Message-ID: <199911192234.QAA24710@dolphin.mojam.com>

Gordon wrote:

    Gordon> I posted something here about dircache not too long ago.
    Gordon> Essentially, I found it completely unreliable on NT and on Linux
    Gordon> to stat the directory. There was some test code attached.

to which I replied:

    Skip> The modtime of the directory's stat info should only change if you
    Skip> add or delete entries in the directory.  Were you perhaps
    Skip> expecting changes when other operations took place, like rewriting
    Skip> an existing file?

I took a couple minutes to write a simple script to check things.  It
created a file, changed its mode, then unlinked it.  I was a bit surprised
that deleting a file didn't appear to change the directory's mod time.  Then
I realized that since file times are only recorded with one-second
precision, you might see no change to the directory's mtime in some
circumstances.  Adding a sleep to the script between directory operations
resolved the apparent inconsistency.  Still, as Gordon stated, you probably
can't count on directory modtimes to tell you when to invalidate the cache.
It's consistent, just not reliable...

if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs,

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...


From mhammond@skippinet.com.au  Sat Nov 20 00:04:28 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 20 Nov 1999 11:04:28 +1100
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com>
Message-ID: <005f01bf32ea$d0b82b90$0501a8c0@bobcat>

> Also, we have a couple minor convenience functions for Python in an
> MSDEV environment, an exposure of OutputDebugString for writing to
> the DevStudio log window and a means of tripping DevStudio C/C++
layer
> breakpoints from Python code (currently experimental).  The msvcrt
> module seems like a likely candidate for these, would these be
> welcome additions?

These are both available in the win32api module.  They dont really fit
in the "msvcrt" module, as they are not part of the C runtime library,
but the win32 API itself.

This is really a pointer to the fact that some or all of the win32api
should be moved into the core - registry access is the thing people
most want, but there are plenty of other useful things that people
reguarly use...

Guido objects to the coding style, but hopefully that wont be a big
issue.  IMO, the coding style isnt "bad" - it is just more an "MS"
flavour than a "Python" flavour - presumably people reading the code
will have some experience with Windows, so it wont look completely
foreign to them.  The good thing about taking it "as-is" is that it
has been fairly well bashed on over a few years, so is really quite
stable.  The final "coding style" issue is that there are no "doc
strings" - all documentation is embedded in C comments, and extracted
using a tool called "autoduck" (similar to "autodoc").  However, Im
sure we can arrange something there, too.

Mark.



From jcw@equi4.com  Sat Nov 20 00:21:43 1999
From: jcw@equi4.com (Jean-Claude Wippler)
Date: Sat, 20 Nov 1999 01:21:43 +0100
Subject: [Python-Dev] Import redesign [LONG]
References: 
 <1269073918-25826188@hypernet.com> <199911192234.QAA24710@dolphin.mojam.com>
Message-ID: <3835E997.8A4F5BC5@equi4.com>

Skip Montanaro wrote:
>
[dir stat cache times]
> I took a couple minutes to write a simple script to check things.  It
> created a file, changed its mode, then unlinked it.  I was a bit
> surprised that deleting a file didn't appear to change the directory's
> mod time.  Then I realized that since file times are only recorded
> with one-second

Or two, on Windows with older (FAT, as opposed to VFAT) file systems.

> precision, you might see no change to the directory's mtime in some
> circumstances.  Adding a sleep to the script between directory
> operations resolved the apparent inconsistency.  Still, as Gordon
> stated, you probably can't count on directory modtimes to tell you
> when to invalidate the cache. It's consistent, just not reliable...
> 
> if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs,

If the dir stat time is less than 2 seconds ago, flush - always.

If the dir stat time says it hasn't been changed for at least 2 seconds
then you can cache all entries and trust that any change is detected.
In other words: take the *current* time into account, then it can work.

I think.  Maybe.  Until you get into network drives and clock skew...

-- Jean-Claude


From gmcm@hypernet.com  Sat Nov 20 03:43:32 1999
From: gmcm@hypernet.com (Gordon McMillan)
Date: Fri, 19 Nov 1999 22:43:32 -0500
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <3835E997.8A4F5BC5@equi4.com>
Message-ID: <1269053086-27079185@hypernet.com>

Jean-Claude wrote:
> Skip Montanaro wrote:
> >
> [dir stat cache times]
> > ...  Then I realized that since
> > file times are only recorded with one-second
> 
> Or two, on Windows with older (FAT, as opposed to VFAT) file
> systems.

Oh lordy, it gets worse. 

With a time.sleep(1.0) between new files, Linux detects the 
change in the dir's mtime immediately. Cool.

On NT, I get an average 2.0 sec delay. But sometimes it 
doesn't detect a delay in 100 secs (and my script quits). Then 
I added a stat of some file in the directory before the stat of 
the directory, (not the file I added). Now it acts just like Linux - 
no delay (on both FAT and NTFS partitions). OK...

> I think.  Maybe.  Until you get into network drives and clock
> skew...

No success whatsoever in either direction across Samba. In 
fact the mtime of my Linux home directory as seen from NT is 
Jan 1, 1980.

- Gordon


From gstein@lyra.org  Sat Nov 20 12:06:48 1999
From: gstein@lyra.org (Greg Stein)
Date: Sat, 20 Nov 1999 04:06:48 -0800 (PST)
Subject: [Python-Dev] updated imputil
Message-ID: 

I've updated imputil... The main changes is that I added SysPathImporter
and BuiltinImporter. I also did some restructing to help with
bootstrapping the module (remove dependence on os.py).

For testing a revamped Python import system, you can importing the thing
and call imputil._test_revamp() to set it up. This will load normal,
builtin, and frozen modules via imputil. Dynamic modules are still
handled by Python, however.

I ran a timing comparisons of importing all modules in /usr/lib/python1.5
(using standard and imputil-based importing). The standard mechanism can
do it in about 8.8 seconds. Through imputil, it does it in about 13.0
seconds. Note that I haven't profiled/optimized any of the Importer stuff
(yet).

The point about dynamic modules actually discovered a basic problem that I
need to resolve now. The current imputil assumes that if a particular
Importer loaded the top-level module in a package, then that Importer is
responsible for loading all other modules within that package. In my
particular test, I tried to import "xml.parsers.pyexpat". The two package
modules were handled by SysPathImporter. The pyexpat module is a dynamic
load module, so it is *not* handled by the Importer -- bam. Failure.

Basically, each part of "xml.parsers.pyexpat" may need to use a different
Importer...

Off to ponder,
-g

-- 
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Sat Nov 20 12:11:37 1999
From: gstein@lyra.org (Greg Stein)
Date: Sat, 20 Nov 1999 04:11:37 -0800 (PST)
Subject: [Python-Dev] updated imputil
In-Reply-To: 
Message-ID: 

oops... forgot:

   http://www.lyra.org/greg/python/imputil.py

-g

On Sat, 20 Nov 1999, Greg Stein wrote:
> I've updated imputil... The main changes is that I added SysPathImporter
> and BuiltinImporter. I also did some restructing to help with
> bootstrapping the module (remove dependence on os.py).
> 
> For testing a revamped Python import system, you can importing the thing
> and call imputil._test_revamp() to set it up. This will load normal,
> builtin, and frozen modules via imputil. Dynamic modules are still
> handled by Python, however.
> 
> I ran a timing comparisons of importing all modules in /usr/lib/python1.5
> (using standard and imputil-based importing). The standard mechanism can
> do it in about 8.8 seconds. Through imputil, it does it in about 13.0
> seconds. Note that I haven't profiled/optimized any of the Importer stuff
> (yet).
> 
> The point about dynamic modules actually discovered a basic problem that I
> need to resolve now. The current imputil assumes that if a particular
> Importer loaded the top-level module in a package, then that Importer is
> responsible for loading all other modules within that package. In my
> particular test, I tried to import "xml.parsers.pyexpat". The two package
> modules were handled by SysPathImporter. The pyexpat module is a dynamic
> load module, so it is *not* handled by the Importer -- bam. Failure.
> 
> Basically, each part of "xml.parsers.pyexpat" may need to use a different
> Importer...
> 
> Off to ponder,
> -g
> 
> -- 
> Greg Stein, http://www.lyra.org/
> 
> 
> _______________________________________________
> Python-Dev maillist  -  Python-Dev@python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/



From skip@mojam.com (Skip Montanaro)  Sat Nov 20 14:16:58 1999
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Sat, 20 Nov 1999 08:16:58 -0600 (CST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <1269053086-27079185@hypernet.com>
References: <3835E997.8A4F5BC5@equi4.com>
 <1269053086-27079185@hypernet.com>
Message-ID: <14390.44378.83128.546732@dolphin.mojam.com>

    Gordon> No success whatsoever in either direction across Samba. In fact
    Gordon> the mtime of my Linux home directory as seen from NT is Jan 1,
    Gordon> 1980.

Ain't life grand? :-(

Ah, well, it was a nice idea...

S


From jim@interet.com  Mon Nov 22 16:43:39 1999
From: jim@interet.com (James C. Ahlstrom)
Date: Mon, 22 Nov 1999 11:43:39 -0500
Subject: [Python-Dev] Import redesign [LONG]
References: 
Message-ID: <383972BB.C65DEB26@interet.com>

Greg Stein wrote:
> 
> I would suggest that both retain their *exact* meaning. We introduce
> sys.importers -- a list of importers to check, in sequence. The first
> importer on that list uses sys.path to look for and load modules. The
> second importer loads builtins and frozen code (i.e. modules not on
> sys.path).

We should retain the current order.  I think is is:
first builtin, next frozen, next sys.path.
I really think frozen modules should be loaded in preference
to sys.path.  After all, they are compiled in.
 
> Users can insert/append new importers or alter sys.path as before.

I agree with Greg that sys.path should remain as it is.  A list
of importers can add the extra functionality.  Users will
probably want to adjust the order of the list.

> > Implementation:
> > ---------------
> >
> > - There must clearly be some code in C that can import certain
> >   essential modules (to solve the chicken-or-egg problem), but I don't
> >   mind if the majority of the implementation is written in Python.
> >   Using Python makes it easy to subclass.
> 
> I posited once before that the cost of import is mostly I/O rather than
> CPU, so using Python should not be an issue. MAL demonstrated that a good
> design for the Importer classes is also required. Based on this, I'm a
> *strong* advocate of moving as much as possible into Python (to get
> Python's ease-of-coding with little relative cost).

Yes, I agree.  And I think the main() should be written in Python.  Lots
of Python should be written in Python.

> The (core) C code should be able to search a path for a module and import
> it. It does not require dynamic loading or packages. This will be used to
> import exceptions.py, then imputil.py, then site.py.

But these can be frozen in (as you mention below).  I dislike depending
on sys.path to load essential modules.  If they are not frozen in,
then we need a command line argument to specify their path, with
sys.path used otherwise.
 
Jim Ahlstrom


From jim@interet.com  Mon Nov 22 17:25:46 1999
From: jim@interet.com (James C. Ahlstrom)
Date: Mon, 22 Nov 1999 12:25:46 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269144272-21594530@hypernet.com>
Message-ID: <38397C9A.DF6B7112@interet.com>

Gordon McMillan wrote:

> [JimA]
> > Think about multiple packages in multiple zip files.  The zip
> > files store file directories.  That means we would need a
> > sys.zippath to search the zip files.  I don't want another
> > PYTHONPATH phenomenon.
> 
> What if sys.path looked like:
>  [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...]

Well, that changes the current meaning of sys.path.
 
> > > > I suggest that archive files MUST be put into a known
> > > > directory.
> 
> No way. Hard code a directory? Overwrite someone else's
> Python "standalone"? Write to a C: partition that is
> deliberately sized to hold nothing but Windows? Make
> network installations impossible?

Ooops.  I didn't mean a known directory you couldn't change.
But I did mean a directory you shouldn't change.

But you are right.  The directory should be configurable.  But
I would still like to see a highly encouraged directory.  I
don't yet have a good design for this.  Anyone have ideas on an
official way to find library files?

I think a Python library file is a Good Thing, but it is not useful if
the archive can't be found.

I am thinking of a busy SysAdmin with someone nagging him/her to
install Python.  SysAdmin doesn't want another headache.  What if
Python becomes popular and users want it on Unix and PC's?  More
work!  There should be a standard way to do this that just works
and is dumb-stupid-simple.  This is a Python promotion issue.  Yes
everyone here can make sys.path work, but that is not the point.

> The official Windows solution is stuff in registry about app
> paths and such. Putting the dlls in the exe's directory is a
> workaround which works and is more managable than the
> official solution.

I agree completely.
 
> > > > We should also have the ability to append archive files to
> > > > the executable or a shared library assuming the OS allows
> > > > this
> 
> That's a handy trick on Windows, but it's got nothing to do
> with Python.

It also works on Linux.  I don't know about other systems.
 
> Flexibility. You can put Christian's favorite Einstein quote here
> too.

I hope we can still have ease of use with all this flexibility.
As I said, we need to promote Python.
 
Jim Ahlstrom


From mal@lemburg.com  Tue Nov 23 13:32:42 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 23 Nov 1999 14:32:42 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.8
References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com>
Message-ID: <383A977A.C20E6518@lemburg.com>

FYI, I've uploaded a new version of the proposal which includes
the encodings package, definition of the 'raw unicode escape'
encoding (available via e.g. ur""), Unicode format strings and
a new method .breaklines().

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

ˇ Stream readers:

  What about .readline(), .readlines() ? These could be implemented
  using .read() as generic functions instead of requiring their
  implementation by all codecs. Also see Line Breaks.

ˇ Python interface for the Unicode property database

ˇ What other special Unicode formatting characters should be
  enhanced to work with Unicode input ? Currently only the
  following special semantics are defined:

    u"%s %s" % (u"abc", "abc") should return u"abc abc".


Pretty quiet around here lately...
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    38 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From jcw@equi4.com  Tue Nov 23 15:17:36 1999
From: jcw@equi4.com (Jean-Claude Wippler)
Date: Tue, 23 Nov 1999 16:17:36 +0100
Subject: [Python-Dev] New thread ideas in Perl-land
Message-ID: <383AB010.DD46A1FB@equi4.com>

Just got a note about a paper on a new way of dealing with threads, as
presented to the Perl-Porters list.  The idea is described in:
	http://www.cpan.org/modules/by-authors/id/G/GB/GBARTELS/thread_0001.txt

I have no time to dive in, comment, or even judge the relevance of this,
but perhaps someone else on this list wishes to check it out.

The author of this is Greg London .

-- Jean-Claude


From mhammond@skippinet.com.au  Tue Nov 23 22:45:14 1999
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 24 Nov 1999 09:45:14 +1100
Subject: [Python-Dev] Unicode Proposal: Version 0.8
In-Reply-To: <383A977A.C20E6518@lemburg.com>
Message-ID: <002301bf3604$68fd8f00$0501a8c0@bobcat>

> Pretty quiet around here lately...

My guess is that most positions and opinions have been covered.  It is
now probably time for less talk, and more code!

It is time to start an implementation plan?  Do we start with /F's
Unicode implementation (which /G *smirk* seemed to approve of)?  Who
does what?  When can we start to play with it?

And a key point that seems to have been thrust in our faces at the
start and hardly mentioned recently - does the proposal as it stands
meet our sponsor's (HP) requirements?

Mark.



From gstein@lyra.org  Wed Nov 24 00:40:44 1999
From: gstein@lyra.org (Greg Stein)
Date: Tue, 23 Nov 1999 16:40:44 -0800 (PST)
Subject: [Python-Dev] Re: updated imputil
In-Reply-To: 
Message-ID: 

 :-)

On Sat, 20 Nov 1999, Greg Stein wrote:
>...
> The point about dynamic modules actually discovered a basic problem that I
> need to resolve now. The current imputil assumes that if a particular
> Importer loaded the top-level module in a package, then that Importer is
> responsible for loading all other modules within that package. In my
> particular test, I tried to import "xml.parsers.pyexpat". The two package
> modules were handled by SysPathImporter. The pyexpat module is a dynamic
> load module, so it is *not* handled by the Importer -- bam. Failure.
> 
> Basically, each part of "xml.parsers.pyexpat" may need to use a different
> Importer...

I've thought about this and decided the issue is with my particular
Importer, rather than the imputil design. The PathImporter traverses a set
of paths and establishes a package hierarchy based on a filesystem layout.
It should be able to load dynamic modules from within that filesystem
area.

A couple alternatives, and why I don't believe they work as well:

* A separate importer to just load dynamic libraries: this would need to
  replicate PathImporter's mapping of Python module/package hierarchy onto
  the filesystem. There would also be a sequencing issue because one
  Importer's paths would be searched before the other's paths. Current
  Python import rules establishes that a module earlier in sys.path
  (whether a dyn-lib or not) is loaded before one later in the path. This
  behavior could be broken if two Importers were used.

* A design whereby other types of modules can be placed into the
  filesystem and multiple Importers are used to load parts of the path
  (e.g. PathImporter for xml.parsers and DynLibImporter for pyexpat). This
  design doesn't work well because the mapping of Python module/package to
  the filesystem is established by PathImporter -- try to mix a "private"
  mapping design among Importers creates too much coupling.


There is also an argument that the design is fundamentally incorrect :-).
I would argue against that, however. I'm not sure what form an argument
*against* imputil would be, so I'm not sure how to preempty it :-). But we
can get an idea of various arguments by hypothesizing different scenarios
and requireing that the imputil design satisifies them.

In the above two alternatives, they were examing the use of a secondary
Importer to load things out of the filesystem (and it explained why two
Importers in whatever configuration is not a good thing). Let's state for
argument's sake that files of some type T must be placable within the
filesystem (i.e. according to the layout defined by PathImporter). We'll
also say that PathImporter doesn't understand T, since the latter was
designed later or is private to some app. The way to solve this is to
allow PathImporter to recognize it through some configuration of the
instance (e.g. self.recognized_types). A set of hooks in the PathImporter
would then understand how to map files of type T to a code or module
object. (alternatively, a generalized set of hooks at the Importer class
level) Note that you could easily have a utility function that scans
sys.importers for a PathImporter instance and adds the data to recognize a
new type -- this would allow for simple installation of new types.

Note that PathImporter inherently defines a 1:1 mapping from a module to a
file. Archives (zip or jar files) cannot be recognized and handled by
PathImporter. An archive defines an entirely different style of mapping
between a module/package and a file in the filesystem. Of course, an
Importer that uses archives can certainly look for them in sys.path.

The imputil design is derived directly from the "import" statement. "Here
is a module/package name, give me a module."  (this is embodied in the
get_code() method in Importer)

The find/load design established by ihooks is very filesystem-based. In
many situations, a find/load is very intertwined. If you want to take the
URL case, then just examine the actual network activity -- preferably, you
want a single transaction (e.g. one HTTP GET). Find/load implies two
transactions. With nifty context handling between the two steps, you can
get away with a single transaction. But the point is that the design
requires you to get work around its inherent two-step mechanism and
establish a single step. This is weird, of course, because importing is
never *just* a find or a load, but always both.

Well... since I've satisfied to myself that PathImporter needs to load
dynamic lib modules, I'm off to code it...

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From gstein@lyra.org  Wed Nov 24 01:45:29 1999
From: gstein@lyra.org (Greg Stein)
Date: Tue, 23 Nov 1999 17:45:29 -0800 (PST)
Subject: [Python-Dev] breaking out code for dynamic loading
Message-ID: 

Guido,

I can't find the message, but it seems that at some point you mentioned
wanting to break out importdl.c into separate files. The configure process
could then select the appropriate one to use for the platform.

Sounded great until I looked at importdl.c. There are a 13 variants of
dynamic loading. That would imply 13 separate files/modules.

I'd be happy to break these out, but are you actually interested in that
many resulting modules? If so, then any suggestions for naming?
(e.g. aix_dynload, win32_dynload, mac_dynload)

Here are the variants:

* NeXT, using FVM shlibs             (USE_RLD)
* NeXT, using frameworks             (USE_DYLD)
* dl / GNU dld                       (USE_DL)
* SunOS, IRIX 5 shared libs          (USE_SHLIB)
* AIX dynamic linking                (_AIX)
* Win32 platform                     (MS_WIN32)
* Win16 platform                     (MS_WIN16)
* OS/2 dynamic linking               (PYOS_OS2)
* Mac CFM                            (USE_MAC_DYNAMIC_LOADING)
* HP/UX dyn linking                  (hpux)
* NetBSD shared libs                 (__NetBSD__)
* FreeBSD shared libs                (__FreeBSD__)
* BeOS shared libs                   (__BEOS__)


Could I suggest a new top-level directory in the Python distribution named
"Platform"? Move BeOS, PC, and PCbuild in there (bring back Mac?). Add new
directories for each of the above platforms and move the appropriate
portion of importdl.c into there as a Python C Extension Module. (the
module would still be statically linked into the interpreter!)

./configure could select the module and write a Setup.dynload, much like
it does with Setup.thread.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein@lyra.org  Wed Nov 24 02:43:50 1999
From: gstein@lyra.org (Greg Stein)
Date: Tue, 23 Nov 1999 18:43:50 -0800 (PST)
Subject: [Python-Dev] another round of imputil work completed
In-Reply-To: 
Message-ID: 

On Tue, 23 Nov 1999, Greg Stein wrote:
>...
> Well... since I've satisfied to myself that PathImporter needs to load
> dynamic lib modules, I'm off to code it...

All right. imputil.py now comes with code to emulate the builtin Python
import mechanism. It loads all the same types of files, uses sys.path, and
(pointed out by JimA) loads builtins before looking on the path.

The only "feature" it doesn't support is using package.__path__ to look
for submodules. I never liked that thing, so it isn't in there.
(imputil *does* set the __path__ attribute, tho)

Code is available at:

   http://www.lyra.org/greg/python/imputil.py


Next step is to add a "standard" library/archive format. JimA and I have
been tossing some stuff back and forth on this.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/



From mal@lemburg.com  Wed Nov 24 08:34:52 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 24 Nov 1999 09:34:52 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.8
References: <002301bf3604$68fd8f00$0501a8c0@bobcat>
Message-ID: <383BA32C.2E6F4780@lemburg.com>

Mark Hammond wrote:
> 
> > Pretty quiet around here lately...
> 
> My guess is that most positions and opinions have been covered.  It is
> now probably time for less talk, and more code!

Or that everybody is on holidays... like Guido.
 
> It is time to start an implementation plan?  Do we start with /F's
> Unicode implementation (which /G *smirk* seemed to approve of)?  Who
> does what?  When can we start to play with it?

This depends on whether HP agrees on the current specs. If they
do, there should be code by mid December, I guess.
 
> And a key point that seems to have been thrust in our faces at the
> start and hardly mentioned recently - does the proposal as it stands
> meet our sponsor's (HP) requirements?

Haven't heard anything from them yet (this is probably mainly
due to Guido being offline).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    37 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mal@lemburg.com  Wed Nov 24 09:32:46 1999
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 24 Nov 1999 10:32:46 +0100
Subject: [Python-Dev] Import Design
Message-ID: <383BB0BE.BF116A28@lemburg.com>

Before hooking on to some more PathBuiltinImporters ;-), I'd like
to spawn a thread leading in a different direction...

There has been some discussion on what we really expect of the
import mechanism to be able to do. Here's a summary of what I
think we need:

* compatibility with the existing import mechanism

* imports from library archives (e.g. .pyl or .par-files)

* a modified intra package import lookup scheme (the thingy
  which I call "walk-me-up-Scotty" patch -- see previous posts)

And for some fancy stuff:

* imports from URLs (e.g. these could be put on the path for
  automatic inclusion in the import scan or be passed explicitly
  to __import__)

* a (file based) static lookup cache to enhance lookup
  performance which is enabled via a command line switch
  (rather than being enabled per default), so that the
  user can decide whether to apply this optimization or
  not

The point I want to make is: there aren't all that many features
we are really looking for, so why not incorporate these into
the builtin importer and only *then* start thinking about
schemes for hooks, managers, etc. ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    37 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From andy@robanal.demon.co.uk  Wed Nov 24 11:40:16 1999
From: andy@robanal.demon.co.uk (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 24 Nov 1999 03:40:16 -0800 (PST)
Subject: [Python-Dev] Unicode Proposal: Version 0.8
Message-ID: <19991124114016.7706.rocketmail@web601.mail.yahoo.com>

--- Mark Hammond  wrote:
> > Pretty quiet around here lately...
> 
> My guess is that most positions and opinions have
> been covered.  It is
> now probably time for less talk, and more code!
> 
> It is time to start an implementation plan?  Do we
> start with /F's
> Unicode implementation (which /G *smirk* seemed to
> approve of)?  Who
> does what?  When can we start to play with it?
> 
> And a key point that seems to have been thrust in
> our faces at the
> start and hardly mentioned recently - does the
> proposal as it stands
> meet our sponsor's (HP) requirements?
> 
> Mark.

I had a long chat with them on Friday :-)  They want
it done, but nobody is actively working on it now as
far as I can tell, and they are very busy.

The per-thread thing was a red herring - they just
want to be able to do (for example) web servers
handling different encodings from a central unicode
database, so per-output-stream works just fine.

They will be at IPC8; I'd suggest that a round of
prototyping, we insist they read it and then discuss
it at IPC8, and be prepared to rework things
thereafter are important.  Hopefully then we'll have a
plan on how to tackle the much larger (but less
interesting to python-dev) job of writing and
verifying all the codecs and utilities.


Andy Robinson



=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Thousands of Stores.  Millions of Products.  All in one place.
Yahoo! Shopping: http://shopping.yahoo.com


From jim@interet.com  Wed Nov 24 14:43:57 1999
From: jim@interet.com (James C. Ahlstrom)
Date: Wed, 24 Nov 1999 09:43:57 -0500
Subject: [Python-Dev] Re: updated imputil
References: 
Message-ID: <383BF9AD.E183FB98@interet.com>

Greg Stein wrote:
> * A separate importer to just load dynamic libraries: this would need to
>   replicate PathImporter's mapping of Python module/package hierarchy onto
>   the filesystem. There would also be a sequencing issue because one
>   Importer's paths would be searched before the other's paths. Current
>   Python import rules establishes that a module earlier in sys.path
>   (whether a dyn-lib or not) is loaded before one later in the path. This
>   behavior could be broken if two Importers were used.

I would like to argue that on Windows, import of dynamic libraries is
broken.  If a file something.pyd is imported, then sys.path is searched
to find the module.  If a file something.dll is imported, the same thing
happens.  But Windows defines its own search order for *.dll files which
Python ignores.  I would suggest that this is wrong for files named
*.dll,
but OK for files named *.pyd.

A SysAdmin should be able to install and maintain *.dll as she has
been trained to do.  This makes maintaining Python installations
simpler and more un-surprising.

I have no solution to the backward compatibilty problem.  But the
code is only a couple lines.  A LoadLibrary() call does its own
path searching.

Jim Ahlstrom


From jim@interet.com  Wed Nov 24 15:06:17 1999
From: jim@interet.com (James C. Ahlstrom)
Date: Wed, 24 Nov 1999 10:06:17 -0500
Subject: [Python-Dev] Import Design
References: <383BB0BE.BF116A28@lemburg.com>
Message-ID: <383BFEE9.B4FE1F19@interet.com>

"M.-A. Lemburg" wrote:

> The point I want to make is: there aren't all that many features
> we are really looking for, so why not incorporate these into
> the builtin importer and only *then* start thinking about
> schemes for hooks, managers, etc. ?!

Marc has made this point before, and I think it should be
considered carefully.  It is a lot of work to re-create the
current import logic in Python and it is almost guaranteed
to be slower.  So why do it?

I like imputil.py because it leads
to very simple Python installations.  I view this as
a Python promotion issue.  If we have a boot mechanism plus
archive files, we can have few-file Python installations
with package addition being just adding another file.

But at least some of this code must be in C.  I volunteer to
write the rest of it in C if that is what people want.  But it
would add two hundred more lines of code to import.c.  So
maybe now is the time to switch to imputil, instead of waiting
for later.

But I am indifferent as long as I can tell a Python user
to just put an archive file libpy.pyl in his Python directory
and everything will Just Work.

Jim Ahlstrom


From bwarsaw@python.org (Barry Warsaw)  Tue Nov 30 20:23:40 1999
From: bwarsaw@python.org (Barry Warsaw) (Barry Warsaw)
Date: Tue, 30 Nov 1999 15:23:40 -0500 (EST)
Subject: [Python-Dev] CFP Developers' Day - 8th International Python Conference
Message-ID: <14404.12876.847116.288848@anthem.cnri.reston.va.us>

Hello Python Developers!

Thursday January 27 2000, the final day of the 8th International
Python Conference is Developers' Day, where Python hackers get
together to discuss and reach agreements on the outstanding issues
facing Python.  This is also your once-a-year chance for face-to-face
interactions with Python's creator Guido van Rossum and other
experienced Python developers.

To make Developers' Day a success, we need you!  We're looking for a
few good champions to lead topic sessions.  As a champion, you will
choose a topic that fires you up and write a short position paper for
publication on the web prior to the conference.  You'll also prepare
introductory material for the topic overview session, and lead a 90
minute topic breakout group.

We've had great champions and topics in previous years, and many
features of today's Python had their start at past Developers' Days.
This is your chance to help shape the future of Python for 1.6,
2.0 and beyond.

If you are interested in becoming a topic champion, you must email me
by Wednesday December 15, 1999.  For more information, please visit
the IPC8 Developers' Day web page at

    

This page has more detail on schedule, suggested topics, important
dates, etc.  To volunteer as a champion, or to ask other questions,
you can email me at bwarsaw@python.org.

-Barry


From mal at lemburg.com  Mon Nov  1 00:00:55 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 01 Nov 1999 00:00:55 +0100
Subject: [Python-Dev] Misleading syntax error text
References: <1270838575-13870925@hypernet.com>
Message-ID: <381CCA27.59506CF6@lemburg.com>

[Extracted from the psa-members list...]

Gordon McMillan wrote:
> 
> Chris Fama wrote,
> > And now the rub: the exact same function definition has passed
> > through byte-compilation perfectly OK many times before with no
> > problems... of course, this points rather clearly to the
> > preceding code, but it illustrates a failing in Python's syntax
> > error messages, and IMHO a fairly serious one at that, if this is
> > indeed so.
> 
> My simple experiments refuse to compile a "del getattr(..)" at
> all.

Hmm, it seems to be a failry generic error:

>>> del f(x,y)
SyntaxError: can't assign to function call

How about chainging the com_assign_trailer function in Python/compile.c
to:

static void
com_assign_trailer(c, n, assigning)
        struct compiling *c;
        node *n;
        int assigning;
{
        REQ(n, trailer);
        switch (TYPE(CHILD(n, 0))) {
        case LPAR: /* '(' [exprlist] ')' */
                com_error(c, PyExc_SyntaxError,
                          assigning ? "can't assign to function call":
			              "can't delete expression");
                break;
        case DOT: /* '.' NAME */
                com_assign_attr(c, CHILD(n, 1), assigning);
                break;
        case LSQB: /* '[' subscriptlist ']' */
                com_subscriptlist(c, CHILD(n, 1), assigning);
                break;
        default:
                com_error(c, PyExc_SystemError, "unknown trailer type");
        }
}

or something along those lines...

BTW, has anybody tried my import patch recently ? I haven't heard
any citicism since posting it and wonder what made the list fall
asleep over the topic :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    61 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mhammond at skippinet.com.au  Mon Nov  1 02:51:56 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon, 1 Nov 1999 12:51:56 +1100
Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee?
Message-ID: <002301bf240b$ae61fa00$0501a8c0@bobcat>

I have for some time been wondering about the usefulness of this
mailing list.  It seems to have produced staggeringly few results
since inception.

This is not a critisism of any individual, but of the process.  It is
proof in my mind of how effective the benevolent dictator model is,
and how ineffective a language run by committee would be.

This "committee" never seems to be capable of reaching a consensus on
anything.  A number of issues dont seem to provoke any responses.  As
a result, many things seem to die a slow and lingering death.  Often
there is lots of interesting discussion, but still precious few
results.

In the pre python-dev days, the process seemed easier - we mailed
Guido directly, and he either stated "yea" or "nay" - maybe we didnt
get the response we hoped for, but at least we got a response.  Now,
we have the result that even if Guido does enter into a thread, the
noise seems to drown out any hope of getting anything done.  Guido
seems to be faced with the dilemma of asserting his dictatorship in
the face of many dissenting opinions from many people he respects, or
putting it in the too hard basket.  I fear the latter is the easiest
option.  At the end of this mail I list some of the major threads over
the last few months, and can't see a single thread that has resulted
in a CVS checkin, and only one that has resulted in agreement.  This,
to my mind at least, is proof that things are really not working.

I long for the "good old days" - take the replacement of "ni" with
built-in functionality, for example.  I posit that if this was
discussed on python-dev, it would have caused a huge flood of mail,
and nothing remotely resembling a consensus.  Instead, Guido simply
wrote an essay and implemented some code that he personally liked.  No
debate, no discussion.  Still an excellent result.  Maybe not a
perfect result, but a result nonetheless.

However, Guido's time is becoming increasingly limited.  So should we
consider moving to a "benevolent lieutenent" model, in conjunction
with re-ramping up the SIGS?  This would provide 2 ways to get things
done:

* A new SIG.  Take relative imports, for example.  If we really do
need a change in this fairly fundamental area, a SIG would be
justified ("import-sig").  The responsibility of the SIG is to form a
consensus (and code that reflects it), and report back to Guido (and
the main newsgroup) with the result of this.  It worked well for RE,
and allowed those of us not particularly interested to keep out of the
debate.  If the SIG can not form consensus, then tough - it dies - and
should not be mourned.  Presumably Guido would keep a watchful eye
over the SIG, providing direction where necessary, but in general stay
out of the day to day traffic.  New SIGs seem to have stopped since
this list creation, and it seems that issues that should be discussed
in new SIGS are now discussed here.

*  Guido could delegate some of his authority to a single individual
responsible for a certain limited area - a benevolent lieutenent.  We
may have a lieutentant responsible for different areas, and could only
exercise their authority with small, trivial changes.  Eg, the "getopt
helper" thread - if a lieutenant was given authority for the "standard
library", they could simply make a yea or nay decision, and present it
to Guido.  Presumably Guido trusts this person he delegated to enough
that the majority of the lieutenant's recommendations would be
accepted.  Presumably there would be a small number of lieutentants,
and they would then become the new "python-dev" - say up to 5 people.
This list then discusses high level strategies and seek direction from
each other when things get murky.  This select group of people may not
(indeed, probably would not) include me, but I would have no problem
with that - I would prefer to see results achieved than have my own
ego stroked by being included in a select, but ineffective group.

In parting, I repeat this is not a direct critisism, simply an
observation of the last few months.  I am on this list, so I am
definately as guilty as any one else - which is "not at all" - ie, no
one is guilty, I simply see it as endemic to a committee with people
of diverse backgrounds, skills and opinions.

Any thoughts?

Long live the dictator! :-)

Mark.

Recent threads, and my take on the results:

* getopt helper?
Too much noise regarding semantic changes.

* Alternative Approach to Relative Imports
* Relative package imports
* Path hacking
* Towards a Python based import scheme
Too much noise - no one could really agree on the semantics.
Implementation thrown in the ring, and promptly forgotten.

* Corporate installations
Very young, but no result at all.

* Embedding Python when using different calling conventions
Quite young, but no result as yet, and I have no reason to believe
there will be.

* Catching "return" and "return expr" at compile time
Seemed to be blessed - yay!  Dont believe I have seen a check-in yet.

* More Python command-line features
Seemed general agreement, but nothing happened?

* Tackling circular dependencies in 2.0?
Lots of noise, but no results other than "GC may be there in 2.0"

* Buffer interface in abstract.c
Determined it could break - no solution proposed.  Lots of noise
regarding if is is a good idea at all!

* mmapfile module
No result.

* Quick-and-dirty weak references
No result.

* Portable "spawn" module for core?
No result.

* Fake threads
Seemed to spawn stackless Python, but in the face of Guido being "at
best, lukewarm" about this issue, I would again have to conclude "no
result".  An authorative "no" in this area may have saved lots of
effort and heartache.

* add Expat to 1.6
No result.

* I'd like list.pop to accept an optional second argument giving a
default value
No result

* etc
No result.




From jack at oratrix.nl  Mon Nov  1 10:56:48 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Mon, 01 Nov 1999 10:56:48 +0100
Subject: [Python-Dev] Embedding Python when using different calling 
 conventions.
In-Reply-To: Message by "M.-A. Lemburg"  ,
	     Sat, 30 Oct 1999 10:46:30 +0200 , <381AB066.B54A47E0@lemburg.com> 
Message-ID: <19991101095648.DC2E535BB1E@snelboot.oratrix.nl>


> OTOH, we could take chance to reorganize these macros from bottom
> up: when I started coding extensions I found them not very useful
> mostly because I didn't have control over them meaning "export
> this symbol" or "import the symbol". Especially the DL_IMPORT
> macro is strange because it seems to handle both import *and*
> export depending on whether Python is compiled or not.

This would be very nice. The DL_IMPORT/DL_EXPORT stuff is really weird unless 
you're working with it all the time. We were trying to build a plugin DLL for 
PythonWin and first you spend hours finding out that you have to set DL_IMPORT 
(and how to set it), and the you spend another few hours before you realize 
that you can't simply copy the DL_IMPORT and DL_EXPORT from, say, timemodule.c 
because timemodule.c is going to be in the Python core (and hence can use 
DL_IMPORT for its init() routine declaration) while your module is going to be 
a plugin so it can't.

I would opt for a scheme where the define shows where the symbols is expected 
to live (DL_CORE and DL_THISMODULE would be needed at least, but probably one 
or two more for .h files).
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From jack at oratrix.nl  Mon Nov  1 11:12:37 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Mon, 01 Nov 1999 11:12:37 +0100
Subject: [Python-Dev] Benevolent dictator versus the bureaucratic 
 committee?
In-Reply-To: Message by "Mark Hammond"  ,
	     Mon, 1 Nov 1999 12:51:56 +1100 , <002301bf240b$ae61fa00$0501a8c0@bobcat> 
Message-ID: <19991101101238.3D6FA35BB1E@snelboot.oratrix.nl>

I think I agree with Mark's post, although I do see a little more light (the 
relative imports dicussion resulted in working code, for instance).

The benevolent lieutenant idea may work, _if_ the lieutenants can be found. I 
myself will quickly join Mark in wishing the new python-dev well and 
abandoning ship (half a :-).

If that doesn't work maybe we should try at the very least to create a 
"memory". If you bring up a subject for discussion and you don't have working 
code that's fine the first time. But if anyone brings it up a second time 
they're supposed to have code. That way at least we won't be rehashing old 
discussions (as happend on the python-list every time, with subjects like GC 
or optimizations).

And maybe we should limit ourselves in our replies: don't speak up too much in 
discussions if you're not going to write code. I know that I'm pretty good at 
answering with my brilliant insights to everything myself:-). It could well be 
that refining and refining the design (as in the getopt discussion) results in 
such a mess of opinions that no-one has the guts to write the code anymore.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From mal at lemburg.com  Mon Nov  1 12:09:21 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 01 Nov 1999 12:09:21 +0100
Subject: [Python-Dev] dircache.py
References: <1270737688-19939033@hypernet.com>
Message-ID: <381D74E0.1AE3DA6A@lemburg.com>

Gordon McMillan wrote:
> 
> Pursuant to my volunteering to implement Guido's plan to
> combine cmp.py, cmpcache.py, dircmp.py and dircache.py
> into filecmp.py, I did some investigating of dircache.py.
> 
> I find it completely unreliable. On my NT box, the mtime of the
> directory is updated (on average) 2 secs after a file is added,
> but within 10 tries, there's always one in which it takes more
> than 100 secs (and my test script quits). My Linux box hardly
> ever detects a change within 100 secs.
> 
> I've tried a number of ways of testing this ("this" being
> checking for a change in the mtime of the directory), the latest
> of which is below. Even if dircache can be made to work
> reliably and surprise-free on some platforms, I doubt it can be
> done cross-platform. So I'd recommend that it just get dropped.
> 
> Comments?

Note that you'll have to flush and close the tmp file to actually
have it written to the file system. That's why you are not seeing
any new mtimes on Linux.

Still, I'd suggest declaring it obsolete. Filesystem access is
usually cached by the underlying OS anyway, so adding another layer of
caching on top of it seems not worthwhile (plus, the OS knows
better when and what to cache).

Another argument against using stat() time entries for caching
purposes is the resolution of 1 second. It makes the dircache.py
unreliable per se for fast changing directories.

The problem is most probably even worse for NFS and on Samba mounted
WinXX filesystems the mtime trick doesn't work at all (stat()
returns the creation time for atime, mtime and ctime).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    60 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gward at cnri.reston.va.us  Mon Nov  1 14:28:51 1999
From: gward at cnri.reston.va.us (Greg Ward)
Date: Mon, 1 Nov 1999 08:28:51 -0500
Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee?
In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat>; from mhammond@skippinet.com.au on Mon, Nov 01, 1999 at 12:51:56PM +1100
References: <002301bf240b$ae61fa00$0501a8c0@bobcat>
Message-ID: <19991101082851.A16952@cnri.reston.va.us>

On 01 November 1999, Mark Hammond said:
> I have for some time been wondering about the usefulness of this
> mailing list.  It seems to have produced staggeringly few results
> since inception.

Perhaps this is an indication of stability rather than stagnation.  Of
course we can't have *total* stability or Python 1.6 will never appear,
but...

> * Portable "spawn" module for core?
> No result.

...I started this little thread to see if there was any interest, and to
find out the easy way if VMS/Unix/DOS-style "spawn sub-process with list
of strings as command-line arguments" makes any sense at all on the Mac
without actually having to go learn about the Mac.

The result: if 'spawn()' is added to the core, it should probably be
'os.spawn()', but it's not really clear if this is necessary or useful
to many people; and, no, it doesn't make sense on the Mac.  That
answered my questions, so I don't really see the thread as a failure.  I
might still turn the distutils.spawn module into an appendage of the os
module, but there doesn't seem to be a compelling reason to do so.

Not every thread has to result in working code.  In other words,
negative results are results too.

        Greg



From skip at mojam.com  Mon Nov  1 17:58:41 1999
From: skip at mojam.com (Skip Montanaro)
Date: Mon, 1 Nov 1999 10:58:41 -0600 (CST)
Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee?
In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat>
References: <002301bf240b$ae61fa00$0501a8c0@bobcat>
Message-ID: <14365.50881.778143.590205@dolphin.mojam.com>

    Mark> * Catching "return" and "return expr" at compile time
    Mark> Seemed to be blessed - yay!  Dont believe I have seen a check-in
    Mark> yet. 

I did post a patch to compile.c here and to the announce list.  I think the
temporal distance between the furor in the main list and when it appeared
"in print" may have been a problem.  Also, as the author of that code I
surmised that compile.c was the wrong place for it.  I would have preferred
to see it in some Python code somewhere, but there's no obvious place to put
it.  Finally, there is as yet no convention about how to handle warnings.
(Maybe some sort of PyLint needs to be "blessed" and made part of the
distribution.)

Perhaps python-dev would be good to generate SIGs, sort of like a hurricane
spinning off tornadoes.

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...




From guido at CNRI.Reston.VA.US  Mon Nov  1 19:41:32 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 01 Nov 1999 13:41:32 -0500
Subject: [Python-Dev] Misleading syntax error text
In-Reply-To: Your message of "Mon, 01 Nov 1999 00:00:55 +0100."
             <381CCA27.59506CF6@lemburg.com> 
References: <1270838575-13870925@hypernet.com>  
            <381CCA27.59506CF6@lemburg.com> 
Message-ID: <199911011841.NAA06233@eric.cnri.reston.va.us>

> How about chainging the com_assign_trailer function in Python/compile.c
> to:

Please don't use the python-dev list for issues like this.  The place
to go is the python-bugs database
(http://www.python.org/search/search_bugs.html) or you could just send
me a patch (please use a context diff and include the standard disclaimer
language).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Mon Nov  1 20:06:39 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 01 Nov 1999 20:06:39 +0100
Subject: [Python-Dev] Misleading syntax error text
References: <1270838575-13870925@hypernet.com>  
	            <381CCA27.59506CF6@lemburg.com> <199911011841.NAA06233@eric.cnri.reston.va.us>
Message-ID: <381DE4BF.951B03F0@lemburg.com>

Guido van Rossum wrote:
> 
> > How about chainging the com_assign_trailer function in Python/compile.c
> > to:
> 
> Please don't use the python-dev list for issues like this.  The place
> to go is the python-bugs database
> (http://www.python.org/search/search_bugs.html) or you could just send
> me a patch (please use a context diff and include the standard disclaimer
> language).

This wasn't really a bug report... I was actually looking for some
feedback prior to sending a real (context) patch.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    60 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From jim at interet.com  Tue Nov  2 16:43:56 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Tue, 02 Nov 1999 10:43:56 -0500
Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee?
References: <002301bf240b$ae61fa00$0501a8c0@bobcat>
Message-ID: <381F06BC.CC2CBFBD@interet.com>

Mark Hammond wrote:
> 
> I have for some time been wondering about the usefulness of this
> mailing list.  It seems to have produced staggeringly few results
> since inception.

I appreciate the points you made, but I think this list is still
a valuable place to air design issues.  I don't want to see too
many Python core changes anyway.  Just my 2.E-2 worth.

Jim Ahlstrom



From Vladimir.Marangozov at inrialpes.fr  Wed Nov  3 23:34:44 1999
From: Vladimir.Marangozov at inrialpes.fr (Vladimir Marangozov)
Date: Wed, 3 Nov 1999 23:34:44 +0100 (NFT)
Subject: [Python-Dev] paper available
Message-ID: <199911032234.XAA26442@pukapuka.inrialpes.fr>

I've OCR'd Saltzer's paper. It's available temporarily (in MS Word
format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip

Since there may be legal problems with LNCS, I will disable the
link shortly (so those of you who have not received a copy and are
interested in reading it, please grab it quickly)

If prof. Saltzer agrees (and if he can, legally) put it on his web page,
I guess that the paper will show up at http://mit.edu/saltzer/

Jeremy, could you please check this with prof. Saltzer? (This version
might need some corrections due to the OCR process, despite that I've
made a significant effort to clean it up)

-- 
       Vladimir MARANGOZOV          | Vladimir.Marangozov at inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252



From guido at CNRI.Reston.VA.US  Thu Nov  4 21:58:53 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 04 Nov 1999 15:58:53 -0500
Subject: [Python-Dev] wish list
Message-ID: <199911042058.PAA15437@eric.cnri.reston.va.us>

I got the wish list below.  Anyone care to comment on how close we are
on fulfilling some or all of this?

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Thu, 04 Nov 1999 20:26:54 +0700
From:    "Claudio Ram?n" 
To:      guido at python.org

Hello,
  I'm a python user (excuse my english, I'm spanish and...). I think it is a 
very complete language and I use it in solve statistics, phisics, 
mathematics, chemistry and biology problemns. I'm not an
experienced programmer, only a scientific with problems to solve.
The motive of this letter is explain to you a needs that I have in
the python use and I think in the next versions...
* GNU CC for Win32 compatibility (compilation of python interpreter and 
"Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative 
eviting the cygwin dll user.
* Add low level programming capabilities for system access and speed of code 
fragments eviting the C-C++ or Java code use. Python, I think, must be a 
complete programming language in the "programming for every body" philosofy.
* Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI 
in the standard distribution. For example, Wxpython permit an html browser. 
It is very importan for document presentations. And Wxwindows and Gtk+ are 
faster than tk.
* Incorporate a database system in the standard library distribution. To be 
possible with relational and documental capabilites and with import facility 
of DBASE, Paradox, MSAccess files.
* Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to 
be possible with XML how internal file format). And to be possible with 
Microsoft Word import export facility. For example, AbiWord project can be 
an alternative but if lacks programming language. If we can make python the 
programming language for AbiWord project...

Thanks.
Ram?n Molina.
rmn70 at hotmail.com

______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com

------- End of Forwarded Message




From skip at mojam.com  Thu Nov  4 22:06:53 1999
From: skip at mojam.com (Skip Montanaro)
Date: Thu, 4 Nov 1999 15:06:53 -0600 (CST)
Subject: [Python-Dev] wish list
In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us>
References: <199911042058.PAA15437@eric.cnri.reston.va.us>
Message-ID: <14369.62829.389307.377095@dolphin.mojam.com>

     * Incorporate a database system in the standard library
       distribution. To be possible with relational and documental
       capabilites and with import facility of DBASE, Paradox, MSAccess
       files.

I know Digital Creations has a dbase module knocking around there somewhere.
I hacked on it for them a couple years ago.  You might see if JimF can
scrounge it up and donate it to the cause.

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...




From fdrake at acm.org  Thu Nov  4 22:08:26 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 4 Nov 1999 16:08:26 -0500 (EST)
Subject: [Python-Dev] wish list
In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us>
References: <199911042058.PAA15437@eric.cnri.reston.va.us>
Message-ID: <14369.62922.994300.233350@weyr.cnri.reston.va.us>

Guido van Rossum writes:
 > I got the wish list below.  Anyone care to comment on how close we are
 > on fulfilling some or all of this?

Claudio Ram?n  wrote:
 > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI 
 > in the standard distribution. For example, Wxpython permit an html browser. 
 > It is very importan for document presentations. And Wxwindows and Gtk+ are 
 > faster than tk.

  And GTK+ looks better, too.  ;-)
  None the less, I don't think GTK+ is as solid or mature as Tk.
There are still a lot of oddities, and several warnings/errors get
messages printed on stderr/stdout (don't know which) rather than
raising exceptions.  (This is a failing of GTK+, not PyGTK.)  There
isn't an equivalent of the Tk text widget, which is a real shame.
There are people working on something better, but it's not a trivial
project and I don't have any idea how its going.

 > * Incorporate a database system in the standard library distribution. To be 
 > possible with relational and documental capabilites and with import facility 
 > of DBASE, Paradox, MSAccess files.

  Doesn't sound like part of a core library really, though I could see 
combining the Win32 extensions with the core package to produce a
single installable.  That should at least provide access to MSAccess,
and possible the others, via ODBC.

 > * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to 
 > be possible with XML how internal file format). And to be possible with 
 > Microsoft Word import export facility. For example, AbiWord project can be 
 > an alternative but if lacks programming language. If we can make python the 
 > programming language for AbiWord project...

  I think this would be great to have.  But I wouldn't put the
editor/browser in the core.  I would stick something like the
XML-SIG's package in, though, once that's better polished.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From jim at interet.com  Fri Nov  5 01:09:40 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Thu, 04 Nov 1999 19:09:40 -0500
Subject: [Python-Dev] wish list
References: <199911042058.PAA15437@eric.cnri.reston.va.us>
Message-ID: <38222044.46CB297E@interet.com>

Guido van Rossum wrote:
> 
> I got the wish list below.  Anyone care to comment on how close we are
> on fulfilling some or all of this?

> * GNU CC for Win32 compatibility (compilation of python interpreter and
> "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative
> eviting the cygwin dll user.

I don't know what this means.

> * Add low level programming capabilities for system access and speed of code
> fragments eviting the C-C++ or Java code use. Python, I think, must be a
> complete programming language in the "programming for every body" philosofy.

I don't know what this means in practical terms either.  I use
the C interface for this.

> * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI
> in the standard distribution. For example, Wxpython permit an html browser.
> It is very importan for document presentations. And Wxwindows and Gtk+ are
> faster than tk.

As a Windows user, I don't feel comfortable publishing GUI code
based on these tools.  Maybe they have progressed and I should
look at them again.  But I doubt the Python world is going to
standardize on a single GUI anyway.

Does anyone out there publish Windows Python code with a Windows
Python GUI?  If so, what GUI toolkit do you use?

Jim Ahlstrom



From rushing at nightmare.com  Fri Nov  5 08:22:22 1999
From: rushing at nightmare.com (Sam Rushing)
Date: Thu, 4 Nov 1999 23:22:22 -0800 (PST)
Subject: [Python-Dev] wish list
In-Reply-To: <668469884@toto.iv>
Message-ID: <14370.34222.884193.260990@seattle.nightmare.com>

James C. Ahlstrom writes:
 > Guido van Rossum wrote:
 > > I got the wish list below.  Anyone care to comment on how close we are
 > > on fulfilling some or all of this?
 > 
 > > * GNU CC for Win32 compatibility (compilation of python interpreter and
 > > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative
 > > eviting the cygwin dll user.
 > 
 > I don't know what this means.

mingw32: 'minimalist gcc for win32'.  it's gcc on win32 without trying
to be unix. It links against crtdll, so for example it can generate
small executables that run on any win32 platform.  Also, an
alternative to plunking down money ever year to keep up with MSVC++

I used to use mingw32 a lot, and it's even possible to set up egcs to
cross-compile to it.  At one point using egcs on linux I was able to
build a stripped-down python.exe for win32...

  http://agnes.dida.physik.uni-essen.de/~janjaap/mingw32/

-Sam




From jim at interet.com  Fri Nov  5 15:04:59 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Fri, 05 Nov 1999 09:04:59 -0500
Subject: [Python-Dev] wish list
References: <14370.34222.884193.260990@seattle.nightmare.com>
Message-ID: <3822E40B.99BA7CA0@interet.com>

Sam Rushing wrote:

> mingw32: 'minimalist gcc for win32'.  it's gcc on win32 without trying
> to be unix. It links against crtdll, so for example it can generate

OK, thanks.  But I don't believe this is something that
Python should pursue.  Binaries are available for Windows
and Visual C++ is widely available and has a professional
debugger (etc.).

Jim Ahlstrom



From skip at mojam.com  Fri Nov  5 18:17:58 1999
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 5 Nov 1999 11:17:58 -0600 (CST)
Subject: [Python-Dev] paper available
In-Reply-To: <199911032234.XAA26442@pukapuka.inrialpes.fr>
References: <199911032234.XAA26442@pukapuka.inrialpes.fr>
Message-ID: <14371.4422.96832.498067@dolphin.mojam.com>

    Vlad> I've OCR'd Saltzer's paper. It's available temporarily (in MS Word
    Vlad> format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip

I downloaded it and took a very quick peek at it, but it's applicability to
Python wasn't immediately obvious to me.  Did you download it in response to
some other thread I missed somewhere?

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From gstein at lyra.org  Fri Nov  5 23:19:49 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 5 Nov 1999 14:19:49 -0800 (PST)
Subject: [Python-Dev] wish list
In-Reply-To: <3822E40B.99BA7CA0@interet.com>
Message-ID: 

On Fri, 5 Nov 1999, James C. Ahlstrom wrote:
> Sam Rushing wrote:
> > mingw32: 'minimalist gcc for win32'.  it's gcc on win32 without trying
> > to be unix. It links against crtdll, so for example it can generate
> 
> OK, thanks.  But I don't believe this is something that
> Python should pursue.  Binaries are available for Windows
> and Visual C++ is widely available and has a professional
> debugger (etc.).

If somebody is willing to submit patches, then I don't see a problem with
it. There are quite a few people who are unable/unwilling to purchase
VC++. People may also need to build their own Python rather than using the
prebuilt binaries.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sun Nov  7 14:24:24 1999
From: gstein at lyra.org (Greg Stein)
Date: Sun, 7 Nov 1999 05:24:24 -0800 (PST)
Subject: [Python-Dev] updated modules
Message-ID: 

Hi all...

I've updated some of the modules at http://www.lyra.org/greg/python/.

Specifically, there is a new httplib.py, davlib.py, qp_xml.py, and
a new imputil.py. The latter will be updated again RSN with some patches
from Jim Ahlstrom.

Besides some tweaks/fixes/etc, I've also clarified the ownership and
licensing of the things. httplib and davlib are (C) Guido, licensed under
the Python license (well... anything he chooses :-). qp_xml and imputil
are still Public Domain. I also added some comments into the headers to
note where they come from (I've had a few people remark that they ran
across the module but had no idea who wrote it or where to get updated
versions :-), and I inserted a CVS Id to track the versions (yes, I put
them into CVS just now).

Note: as soon as I figure out the paperwork or whatever, I'll also be
skipping the whole "wetsign.txt" thingy and just transfer everything to
Guido. He remarked a while ago that he will finally own some code in the
Python distribution(!) despite not writing it :-)

I might encourage others to consider the same...

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Mon Nov  8 10:33:30 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 08 Nov 1999 10:33:30 +0100
Subject: [Python-Dev] wish list
References: <199911042058.PAA15437@eric.cnri.reston.va.us>
Message-ID: <382698EA.4DBA5E4B@lemburg.com>

Guido van Rossum wrote:
> 
> * GNU CC for Win32 compatibility (compilation of python interpreter and
> "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative
> eviting the cygwin dll user.

I think this would be a good alternative for all those not having MS VC
for one reason or another. Since Mingw32 is free this might be an
appropriate solution for e.g. schools which don't want to spend lots
of money for VC licenses.

> * Add low level programming capabilities for system access and speed of code
> fragments eviting the C-C++ or Java code use. Python, I think, must be a
> complete programming language in the "programming for every body" philosofy.

Don't know what he meant here...

> * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI
> in the standard distribution. For example, Wxpython permit an html browser.
> It is very importan for document presentations. And Wxwindows and Gtk+ are
> faster than tk.

GUIs tend to be fast moving targets, better leave them out of the
main distribution.

> * Incorporate a database system in the standard library distribution. To be
> possible with relational and documental capabilites and with import facility
> of DBASE, Paradox, MSAccess files.

Database interfaces are usually way to complicated and largish for the standard
dist. IMHO, they should always be packaged separately. Note that simple
interfaces such as a standard CSV file import/export module would be
neat extensions to the dist.

> * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to
> be possible with XML how internal file format). And to be possible with
> Microsoft Word import export facility. For example, AbiWord project can be
> an alternative but if lacks programming language. If we can make python the
> programming language for AbiWord project...

I'm getting the feeling that Ramon is looking for a complete
visual programming environment here. XML support in the standard
dist (faster than xmllib.py) would be nice. Before that we'd need solid builtin
Unicode support though...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    53 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From captainrobbo at yahoo.com  Tue Nov  9 14:57:46 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 9 Nov 1999 05:57:46 -0800 (PST)
Subject: [Python-Dev] Internationalisation Case Study
Message-ID: <19991109135746.20446.rocketmail@web608.mail.yahoo.com>

Guido has asked me to get involved in this discussion,
as I've been working practically full-time on i18n for
the last year and a half and have done quite a bit
with Python in this regard.  I thought the most
helpful thing would be to describe the real-world
business problems I have been tackling so people can
understand what one might want from an encoding
toolkit.  In this (long) post I have included:
1. who I am and what I want to do
2. useful sources of info
3. a real world i18n project
4. what I'd like to see in an encoding toolkit


Grab a coffee - this is a long one.

1. Who I am
--------------
Firstly, credentials.  I'm a Python programmer by
night, and when I can involve it in my work which
happens perhaps 20% of the time.  More relevantly, I
did a postgrad course in Japanese Studies and lived in
Japan for about two years; in 1990 when I returned, I
was speaking fairly fluently and could read a
newspaper with regular reference tio a dictionary. 
Since then my Japanese has atrophied badly, but it is
good enough for IT purposes.  For the last year and a
half I have been internationalizing a lot of systems -
more on this below.

My main personal interest is that I am hoping to
launch a company using Python for reporting, data
cleaning and transformation.  An encoding library is
sorely needed for this.

2. Sources of Knowledge
------------------------------
We should really go for world class advice on this. 
Some people who could really contribute to this
discussion are:
- Ken Lunde, author of "CJKV Information Processing"
and head of Asian Type Development at Adobe.  
- Jeffrey Friedl, author of "Mastering Regular
Expressions", and a long time Japan resident and
expert on things Japanese
- Maybe some of the Ruby community?

I'll list up books URLs etc. for anyone who needs them
on request.

3. A Real World Project
----------------------------
18 months ago I was offered a contract with one of the
world's largest investment management companies (which
I will nickname HugeCo) , who (after many years having
analysts out there) were launching a business in Japan
to attract savers; due to recent legal changes,
Japanese people can now freely buy into mutual funds
run by foreign firms.  Given the 2% they historically
get on their savings, and the 12% that US equities
have returned for most of this century, this is a
business with huge potential.  I've been there for a
while now, 
rotating through many different IT projects.

HugeCo runs its non-US business out of the UK.  The
core deal-processing business runs on IBM AS400s. 
These are kind of a cross between a relational
database and a file system, and speak their own
encoding called EBCDIC.    Five years ago the AS400
had limited
connectivity to everything else, so they also started
deploying Sybase databases on Unix to support some
functions.  This means 'mirroring' data between the
two systems on a regular basis.  IBM has always
included encoding information on the AS400 and it
converts from EBCDIC to ASCII on request with most of
the transfer tools (FTP, database queries etc.)

To make things work for Japan, everyone realised that
a double-byte representation would be needed. 
Japanese has about 7000 characters in most IT-related
character sets, and there are a lot of ways to store
it.  Here's a potted language lesson.  (Apologies to
people who really know this field -- I am not going to
be fully pedantic or this would take forever).

Japanese includes two phonetic alphabets (each with
about 80-90 characters), the thousands of Kanji, and
English characters, often all in the same sentence.  
The first attempt to display something was to
make a single -byte character set which included
ASCII, and a simplified (and very ugly) katakana
alphabet in the upper half of the code page.  So you
could spell out the sounds of Japanese words using
'half width katakana'. 

The basic 'character set' is Japan Industrial Standard
0208 ("JIS"). This was defined in 1978, the first
official Asian character set to be defined by a
government.   This can be thought of as a printed
chart
showing the characters - it does not define their
storage on a computer.   It defined a logical 94 x 94
grid, and each character has an index in this grid.

The "JIS" encoding was a way of mixing ASCII and
Japanese in text files and emails.  Each Japanese
character had a double-byte value. It had 'escape
sequences' to say 'You are now entering ASCII
territory' or the opposite.   In 1978 Microsoft
quickly came up with Shift-JIS, a smarter encoding. 
This basically said "Look at the next byte.  If below
127, it is ASCII; if between A and B, it is a
half-width
katakana; if between B and C, it is the first half of
a double-byte character and the next one is the second
half".  Extended Unix Code (EUC) does similar tricks. 
Both have the property that there are no control
characters, and ASCII is still ASCII.  There are a few
other encodings too.

Unfortunately for me and HugeCo, IBM had their own
standard before the Japanese government did, and it
differs; it is most commonly called DBCS (Double-Byte
Character Set).  This involves shift-in and shift-out
sequences (0x16 and 0x17, cannot remember which way
round), so you can mix single and double bytes in a
field.  And we used AS400s for our core processing.

So, back to the problem.  We had a FoxPro system using
ShiftJIS on the desks in Japan which we wanted to
replace in stages, and an AS400 database to replace it
with.  The first stage was to hook them up so names
and addresses could be uploaded to the AS400, and data
files consisting of daily report input could be
downloaded to the PCs.  The AS400 supposedly had a
library which did the conversions, but no one at IBM
knew how it worked.  The people who did all the
evaluations had basically proved that 'Hello World' in
Japanese could be stored on an AS400, but never looked
at the conversion issues until mid-project. Not only
did we need a conversion filter, we had the problem
that the character sets were of different sizes.  So
it was possible - indeed, likely - that some of our
ten thousand customers' names and addresses would
contain characters only on one system or the other,
and fail to
survive a round trip.  (This is the absolute key issue
for me - will a given set of data survive a round trip
through various encoding conversions?)

We figured out how to get the AS400 do to the
conversions during a file transfer in one direction,
and I wrote some Python scripts to make up files with
each official character in JIS on a line; these went
up with conversion, came back binary, and I was able
to build a mapping table and 'reverse engineer' the
IBM encoding.  It was straightforward in theory, "fun"
in practice.  I then wrote a python library which knew
about the AS400 and Shift-JIS encodings, and could
translate a string between them.  It could also detect
corruption and warn us when it occurred.  (This is
another key issue - you will often get badly encoded
data, half a kanji or a couple of random bytes, and
need to be clear on your strategy for handling it in
any library).  It was slow, but it got us our gateway
in both directions, and it warned us of bad input. 360
characters in the DBCS encoding actually appear twice,
so perfect round trips are impossible, but practically
you can survive with some validation of input at both
ends.  The final story was that our names and
addresses were mostly safe, but a few obscure symbols
weren't.

A big issue was that field lengths varied.  An address
field 40 characters long on a PC might grow to 42 or
44 on an AS400 because of the shift characters, so the
software would truncate the address during import, and
cut a kanji in half.  This resulted in a string that
was illegal DBCS, and errors in the database.  To
guard against this, you need really picky input
validation.  You not only ask 'is this string valid
Shift-JIS', you check it will fit on the other system
too.

The next stage was to bring in our Sybase databases. 
Sybase make a Unicode database, which works like the
usual one except that all your SQL code suddenly
becomes case sensitive - more (unrelated) fun when
you have 2000 tables.  Internally it stores data in
UTF8, which is a 'rearrangement' of Unicode which is
much safer to store in conventional systems.
Basically, a UTF8 character is between one and three
bytes, there are no nulls or control characters, and
the ASCII characters are still the same ASCII
characters.  UTF8<->Unicode involves some bit
twiddling but is one-to-one and entirely algorithmic.

We had a product to 'mirror' data between AS400 and
Sybase, which promptly broke when we fed it Japanese. 
The company bought a library called Unilib to do
conversions, and started rewriting the data mirror
software.  This library (like many) uses Unicode as a
central point in all conversions, and offers most of
the world's encodings.  We wanted to test it, and used
the Python routines to put together a regression
test.  As expected, it was mostly right but had some
differences, which we were at least able to document. 

We also needed to rig up a daily feed from the legacy
FoxPro database into Sybase while it was being
replaced (about six months).  We took the same
library, built a DLL wrapper around it, and I
interfaced to this with DynWin , so we were able to do
the low-level string conversion in compiled code and
the high-level 
control in Python. A FoxPro batch job wrote out
delimited text in shift-JIS; Python read this in, ran
it through the DLL to convert it to UTF8, wrote that
out as UTF8 delimited files, ftp'ed them to an 
in directory on the Unix box ready for daily import. 
At this point we had a lot of fun with field widths -
Shift-JIS is much more compact than UTF8 when you have
a lot of kanji (e.g. address fields).

Another issue was half-width katakana.  These were the
earliest attempt to get some form of Japanese out of a
computer, and are single-byte characters above 128 in
Shift-JIS - but are not part of the JIS0208 standard. 

They look ugly and are discouraged; but when you ar
enterinh a long address in a field of a database, and
it won't quite fit, the temptation is to go from
two-bytes-per -character to one (just hit F7 in
windows) to save space.  Unilib rejected these (as
would Java), but has optional modes to preserve them
or 'expand them out' to their full-width equivalents.


The final technical step was our reports package. 
This is a 4GL using a really horrible 1980s Basic-like
language which reads in fixed-width data files and
writes out Postscript; you write programs saying 'go
to x,y' and 'print customer_name', and can build up
anything you want out of that.  It's a monster to
develop in, but when done it really works - 
million page jobs no problem.  We had bought into this
on the promise that it supported Japanese; actually, I
think they had got the equivalent of 'Hello World' out
of it, since we had a lot of problems later.  

The first stage was that the AS400 would send down
fixed width data files in EBCDIC and DBCS.  We ran
these through a C++ conversion utility, again using
Unilib.  We had to filter out and warn about corrupt 
fields, which the conversion utility would reject. 
Surviving records then went into the reports program.

It then turned out that the reports program only
supported some of the Japanese alphabets. 
Specifically, it had a built in font switching system 
whereby when it encountered ASCII text, it would flip
to the most recent single byte text, and when it found
a byte above 127, it would flip to a double byte font.
 This is because many Chinese fonts do (or did) 
not include English characters, or included really
ugly ones.  This was wrong for Japanese, and made the
half-width katakana unprintable.  I found out that I
could control fonts if I printed one character at a
time with a special escape sequence, so wrote my own
bit-scanning code (tough in a language without ord()
or bitwise operations) to examine a string, classify
every byte, and control the fonts the way I wanted. 
So a special subroutine is used for every name or
address field.  This is apparently not unusual in GUI
development (especially web browsers) - you rarely
find a complete Unicode font, so you have to switch
fonts on the fly as you print a string.

After all of this, we had a working system and knew
quite a bit about encodings.  Then the curve ball
arrived:  User Defined Characters!

It is not true to say that there are exactly 6879
characters in Japanese, and more than counting the
number of languages on the Indian sub-continent or the
types of cheese in France.  There are historical
variations and they evolve.  Some people's names got
missed out, and others like to write a kanji in an
unusual way.   Others arrived from China where they
have more complex variants of the same characters.  
Despite the Japanese government's best attempts, these
people have dug their heels in and want to keep their
names the way they like them.  My first reaction was
'Just Say No' - I basically said that it one of these
customers (14 out of a database of 8000) could show me
a tax form or phone bill with the correct UDC on it,
we would implement it but not otherwise (the usual
workaround is to spell their name phonetically in
katakana).  But our marketing people put their foot
down.  

A key factor is that Microsoft has 'extended the
standard' a few times.  First of all, Microsoft and
IBM include an extra 360 characters in their code page
which are not in the JIS0208 standard.   This is well
understood and most encoding toolkits know what 'Code
Page 932' is Shift-JIS plus a few extra characters. 
Secondly, Shift-JIS has a User-Defined region of a
couple of thousand characters.  They have lately been
taking Chinese variants of Japanese characters (which
are readable but a bit old-fashioned - I can imagine
pipe-smoking professors using these forms as an
affectation) and adding them into their standard
Windows fonts; so users are getting used to these
being available.  These are not in a standard. 
Thirdly, they include something called the 'Gaiji
Editor' in Japanese Win95, which lets you add new
characters to the fonts on your PC within the
user-defined region.  The first step was to review all
the PCs in the Tokyo office, and get one centralized
extension font file on a server.  This was also fun as
people had assigned different code points to
characters on differene machines, so what looked
correct on your word processor was a black square on
mine.   Effectively, each company has its own custom
encoding a bit bigger than the standard.

Clearly, none of these extensions would convert
automatically to the other platforms.

Once we actually had an agreed list of code points, we
scanned the database by eye and made sure that the
relevant people were using them.  We decided that
space for 128 User-Defined Characters would  be
allowed.  We thought we would need a wrapper around
Unilib to intercept these values and do a special
conversion; but to our amazement it worked!  Somebody
had already figured out a mapping for at least 1000
characters for all the Japanes encodings, and they did
the round trips from Shift-JIS to Unicode to DBCS and
back.  So the conversion problem needed less code than
we thought.  This mapping is not defined in a standard
AFAIK (certainly not for DBCS anyway).  

We did, however, need some really impressive
validation.  When you input a name or address on any
of the platforms, the system should say 
(a) is it valid for my encoding?
(b) will it fit in the available field space in the
other platforms?
(c) if it contains user-defined characters, are they
the ones we know about, or is this a new guy who will
require updates to our fonts etc.?

Finally, we got back to the display problems.  Our
chosen range had a particular first byte. We built a
miniature font with the characters we needed starting
in the lower half of the code page.  I then
generalized by name-printing routine to say 'if the
first character is XX, throw it away, and print the
subsequent character in our custom font'.  This worked
beautifully - not only could we print everything, we
were using type 1 embedded fonts for the user defined
characters, so we could distill it and also capture it
for our internal document imaging systems.

So, that is roughly what is involved in building a
Japanese client reporting system that spans several
platforms.

I then moved over to the web team to work on our
online trading system for Japan, where I am now -
people will be able to open accounts and invest on the
web.  The first stage was to prove it all worked. 
With HTML, Java and the Web, I had high hopes, which
have mostly been fulfilled - we set an option in the
database connection to say 'this is a UTF8 database',
and Java converts it to Unicode when reading the
results, and we set another option saying 'the output
stream should be Shift-JIS' when we spew out the HTML.
 There is one limitations:  Java sticks to the JIS0208
standard, so the 360 extra IBM/Microsoft Kanji and our
user defined characters won't work on the web.  You
cannot control the fonts on someone else's web
browser; management accepted this because we gave them
no alternative.  Certain customers will need to be
warned, or asked to suggest a standard version of a
charactere if they want to see their name on the web. 
I really hope the web actually brings character usage
in line with the standard in due course, as it will
save a fortune.

Our system is multi-language - when a customer logs
in, we want to say 'You are a Japanese customer of our
Tokyo Operation, so you see page X in language Y'. 
The language strings all all kept in UTF8 in XML
files, so the same file can hold many languages.  This
and the database are the real-world reasons why you
want to store stuff in UTF8.  There are very few tools
to let you view UTF8, but luckily there is a free Word
Processor that lets you type Japanese and save it in
any encoding; so we can cut and paste between
Shift-JIS and UTF8 as needed.

And that's it.  No climactic endings and a lot of real
world mess, just like life in IT.  But hopefully this
gives you a feel for some of the practical stuff
internationalisation projects have to deal with.  See
my other mail for actual suggestions

- Andy Robinson







=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From captainrobbo at yahoo.com  Tue Nov  9 14:58:39 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 9 Nov 1999 05:58:39 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>

Here are the features I'd like to see in a Python
Internationalisation Toolkit.  I'm very open to
persuasion about APIs and how to do it, but this is
roughly the functionality I would have wanted for the
last year (see separate post "Internationalization
Case Study"):

Built-in types:
---------------
"Unicode String" and "Normal String".  The normal
string is can hold all 256 possible byte values and is
analogous to java's Byte Array - in other words an
ordinary Python string.  

Unicode strings iterate (and are manipulated) per
character, not per byte. You knew that already.  To
manipulate anything in a funny encoding, you convert
it to Unicode, manipulate it there, then convert it
back.

Easy Conversions
----------------------
This is modelled on Java which I think has it right. 
When you construct a Unicode string, you may supply an
optional encoding argument.  I'm not bothered if
conversion happens in a global function, a constructor
method or whatever.

MyUniString = ToUnicode('hello')   # assumes ASCII
MyUniString = ToUnicode('pretend this is Japanese',
'ShiftJIS')  #specified

The converse applies when converting back.

The encoding designators should agree with Java.  If
data is encountered which is not valid for the
encoding, there are several strategies, and it would
be nice if they could be specified explicitly:
1. replace offending characters with a question mark
2. try to recover intelligently (possible in some
cases)
3. raise an exception

A 'Unicode' designator is needed which performs a
dummy conversion.

File Opening:  
---------------
It should be possible to work with files as we do now
- just streams of binary data.  It should also be
possible to    read, say, a file of locally endoded
addresses into a Unicode string. e.g. open(myfile,
'r', 'ShiftJIS').  

It should also be possible to open a raw Unicode file
and read the bytes into ordinary Python strings, or
Unicode strings.  In this case one needs to watch out
for the byte-order marks at the beginning of the file.

Not sure of a good API to do this.  We could have
OrdinaryFile objects and UnicodeFile objects, or
proliferate the arguments to 'open.

Doing the Conversions
----------------------------
All conversions should go through Unicode as the
central point.  

Here is where we can start to define the territory.

Some conversions are algorithmic, some are lookups,
many are a mixture with some simple state transitions
(e.g. shift characters to denote switches from
double-byte to single-byte).  I'd like to see an
'encoding engine' modelled on something like
mxTextTools - a state machine with a few simple
actions, effectively a mini-language for doing simple
operations.  Then a new encoding can be added in a
data-driven way, and still go at C-like speeds. 
Making this open and extensible (and preferably not
needing to code C to do it) is the only way I can see
to get a really good solid encodings library.  Not all
encodings need go in the standard distribution, but
all should be downloadable from www.python.org.

A generalized two-byte-to-two-byte mapping is 128kb. 
But there are compact forms which can reduce these to
a few kb, and also make the data intelligible. It is
obviously desirable to store stuff compactly if we can
unpack it fast.


Typed Strings
----------------
When you are writing data conversion tools to sit in
the middle of a bunch of databases, you could save a
lot of grief with a string that knows its encoding. 
What follows could be done as a Python wrapper around
something ordinary strings rather than as a new type,
and thus need not be part of the language.  

This is analogous to Martin Fowler's Quantity pattern
in Analysis Patterns, where a number knows its units
and you cannot add dollars and pounds accidentally.  

These would do implicit conversions; and they would
stop you assigning or confusing differently encoded
strings.  They would also validate when constructed. 
'Typecasting' would be allowed but would require
explicit code.  So maybe something like...

>>>ts1 = TypedString('hello', 'cp932ms')   # specify
encoding, it remembers it
>>>ts2 = TypedString('goodbye','cp5035')  
>>>ts1 + ts2   #or any of a host of other encoding
options
EncodingError
>>>ts3 = TypedString(ts1, 'cp5035')   #converts it
implicitly going via Unicode
>>>ts4 = ts1.cast('ShiftJIS')   #the developer knows
that in this case the string is compatible.


Going Deeper
----------------
The project I describe involved many more issues than
just a straight conversion.  I envisage an encodings
package or module which power users could get at
directly.  

We have be able to answer the questions:

'is string X a valid instance of encoding Y?'
'is string X nearly a valid instance of encoding Y,
maybe with a little corruption, or is it something
totally different?' - this one might be a task left to
a programmer, but the toolkit should help where it
can.

'can string X be converted from encoding Y to encoding
Z without loss of data?  If not, exactly what will get
trashed' ?  This is a really useful utility.

More generally, I want tools to reason about character
sets and encodings.  I have 'Character Set' and
'Character Mapping' classes - very app-specific and
proprietary - which let me express and answer
questions about whether one character set is a
superset of another and reason about round trips.  I'd
like to do these properly for the toolkit.  They would
need some C support for speed, but I think they could
still be data driven.   So we could have an Endoding
object which could be pickled, and we could keep a
directory full of them as our database.  There might
actually be two encoding objects - one for
single-byte, one for multi-byte, with the same API.

There are so many subtle differences between encodings
(even within the Shift-JIS family) - company X has ten
extra characters, and that is technically a new
encoding.  So it would be really useful to reason
about these and say 'find me all JIS-compatible
encodings', or 'report on the differences between
Shift-JIS and 'cp932ms'.

GUI Issues
-------------
The new Pythonwin breaks somewhat on Japanese - editor
windows are fine but console output is show as
single-byte garbage.  I will try to evaluate IDLE on a
Japanese test box this week.  I think these two need
to work for double-byte languages for our credibility.

Verifiability and printing
-----------------------------
We will need to prove it all works.  This means
looking at text on a screen or on paper.  A really
wicked demo utility would be a GUI which could open
files and convert encodings in an editor window or
spreadsheet window, and specify conversions on
copy/paste.  If it could save a page as HTML (just an
encoding tag and data between 
 tags, then we
could use Netscape/IE for verification.  Better still,
a web server demo could convert on python.org and tag
the pages appropriately - browsers support most common
encodings.

All the encoding stuff is ultimately a bit meaningless
without a way to display a character.  I am hoping
that PDF and PDFgen may add a lot of value here. 
Adobe (and Ken Lunde) have spent years coming up with
a general architecture for this stuff in PDF. 
Basically, the multi-byte fonts they use are encoding
independent, and come with a whole bunch of mapping
tables.  So I can ask for the same Japanese font in
any of about ten encodings - font name is a
combination of face name and encoding.  The font
itself does the remapping.  They make available
downloadable font packs for Acrobat 4.0 for most
languages now; these are good places to raid for
building encoding databases.  

It also means that I can write a Python script to
crank out beautiful-looking code page charts for all
of our encodings from the database, and input and
output to regression tests.  I've done it for
Shift-JIS at Fidelity, and would have to rewrite it
once I am out of here.  But I think that some good
graphic design here would lead to a product that blows
people away - an encodings library that can print out
its own contents for viewing and thus help demonstrate
its own correctness (or make errors stick out like a
sore thumb).

Am I mad?  Have I put you off forever?  What I outline
above would be a serious project needing months of
work; I'd be really happy to take a role, if we could
find sponsors for the project.  But I believe we could
define the standard for years to come.  Furthermore,
it would go a long way to making Python the corporate
choice for data cleaning and transformation -
territory I think we should own.

Regards,

Andy Robinson
Robinson Analytics Ltd.









=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From guido at CNRI.Reston.VA.US  Tue Nov  9 17:46:41 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 09 Nov 1999 11:46:41 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: Your message of "Tue, 09 Nov 1999 05:58:39 PST."
             <19991109135839.25864.rocketmail@web607.mail.yahoo.com> 
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> 
Message-ID: <199911091646.LAA21467@eric.cnri.reston.va.us>

Andy,

Thanks a bundle for your case study and your toolkit proposal.  It's
interesting that you haven't touched upon internationalization of user
interfaces (dialog text, menus etc.) -- that's a whole nother can of
worms.

Marc-Andre Lemburg has a proposal for work that I'm asking him to do
(under pressure from HP who want Python i18n badly and are willing to
pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt

I think his proposal will go a long way towards your toolkit.  I hope
to hear soon from anybody who disagrees with Marc-Andre's proposal,
because without opposition this is going to be Python 1.6's offering
for i18n...  (Together with a new Unicode regex engine by /F.)

One specific question: in you discussion of typed strings, I'm not
sure why you couldn't convert everything to Unicode and be done with
it.  I have a feeling that the answer is somewhere in your case study
-- maybe you can elaborate?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From akuchlin at mems-exchange.org  Tue Nov  9 18:21:03 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 9 Nov 1999 12:21:03 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <14376.22527.323888.677816@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>I think his proposal will go a long way towards your toolkit.  I hope
>to hear soon from anybody who disagrees with Marc-Andre's proposal,
>because without opposition this is going to be Python 1.6's offering
>for i18n...  

The proposal seems reasonable to me.

>(Together with a new Unicode regex engine by /F.)

This is good news!  Would it be a from-scratch regex implementation,
or would it be an adaptation of an existing engine?  Would it involve
modifications to the existing re module, or a completely new unicodere
module?  (If, unlike re.py, it has POSIX longest-match semantics, that
would pretty much settle the question.)

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
All around me darkness gathers, fading is the sun that shone, we must speak of
other matters, you can be me when I'm gone...
    -- The train's clattering, in SANDMAN #67: "The Kindly Ones:11"




From guido at CNRI.Reston.VA.US  Tue Nov  9 18:26:38 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 09 Nov 1999 12:26:38 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: Your message of "Tue, 09 Nov 1999 12:21:03 EST."
             <14376.22527.323888.677816@amarok.cnri.reston.va.us> 
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us>  
            <14376.22527.323888.677816@amarok.cnri.reston.va.us> 
Message-ID: <199911091726.MAA21754@eric.cnri.reston.va.us>

[AMK]
> The proposal seems reasonable to me.

Thanks.  I really hope that this time we can move forward united...

> >(Together with a new Unicode regex engine by /F.)
> 
> This is good news!  Would it be a from-scratch regex implementation,
> or would it be an adaptation of an existing engine?  Would it involve
> modifications to the existing re module, or a completely new unicodere
> module?  (If, unlike re.py, it has POSIX longest-match semantics, that
> would pretty much settle the question.)

It's from scratch, and I believe it's got Perl style, not POSIX style
semantics -- per Tim Peters' recommendations.  Do we need to open the
discussion again?

It involves a redone re module (supporting Unicode as well as 8-bit),
but its API could be unchanged.  /F does the parsing and compilation
in Python, only the matching engine is in C -- not sure how that
impacts performance, but I imagine with aggressive caching it would be
okay.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From akuchlin at mems-exchange.org  Tue Nov  9 18:40:07 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 9 Nov 1999 12:40:07 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<14376.22527.323888.677816@amarok.cnri.reston.va.us>
	<199911091726.MAA21754@eric.cnri.reston.va.us>
Message-ID: <14376.23671.250752.637144@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>It's from scratch, and I believe it's got Perl style, not POSIX style
>semantics -- per Tim Peters' recommendations.  Do we need to open the
>discussion again?

No, no; I'm actually happier with Perl-style, because it's far better
documented and familiar to people. Worse *is* better, after all.

My concern is simply that I've started translating re.py into C, and
wonder how this affects the translation.  This isn't a pressing issue,
because the C version isn't finished yet.

>It involves a redone re module (supporting Unicode as well as 8-bit),
>but its API could be unchanged.  /F does the parsing and compilation
>in Python, only the matching engine is in C -- not sure how that
>impacts performance, but I imagine with aggressive caching it would be
>okay.

Can I get my paws on a copy of the modified re.py to see what
ramifications it has, or is this all still an unreleased
work-in-progress?

Doing the compilation in Python is a good idea, and will make it
possible to implement alternative syntaxes.  I would have liked to
make it possible to generate PCRE bytecodes from Python, but what
stopped me is the chance of bogus bytecode causing the engine to dump
core, loop forever, or some other nastiness.  (This is particularly
important for code that uses rexec.py, because you'd expect regexes to
be safe.)  Fixing the engine to be stable when faced with bad
bytecodes appears to require many additional checks that would slow
down the common case of correct code, which is unappealing.


-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Anybody else on the list got an opinion? Should I change the language or not?
    -- Guido van Rossum, 28 Dec 91




From ping at lfw.org  Tue Nov  9 19:08:05 1999
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 9 Nov 1999 10:08:05 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <14376.23671.250752.637144@amarok.cnri.reston.va.us>
Message-ID: 

On Tue, 9 Nov 1999, Andrew M. Kuchling wrote:
> Guido van Rossum writes:
> >It's from scratch, and I believe it's got Perl style, not POSIX style
> >semantics -- per Tim Peters' recommendations.  Do we need to open the
> >discussion again?
> 
> No, no; I'm actually happier with Perl-style, because it's far better
> documented and familiar to people. Worse *is* better, after all.

I would concur with the preference for Perl-style semantics.
Aside from the issue of consistency with other scripting
languages, i think it's easier to predict the behaviour of
these semantics.  You can run the algorithm in your head,
and try the backtracking yourself.  It's good for the algorithm
to be predictable and well understood.

> Doing the compilation in Python is a good idea, and will make it
> possible to implement alternative syntaxes.

Also agree.  I still have some vague wishes for a simpler,
more readable (more Pythonian?) way to express patterns --
perhaps not as powerful as full regular expressions, but
useful for many simpler cases (an 80-20 solution).


-- ?!ng




From bwarsaw at cnri.reston.va.us  Tue Nov  9 19:15:04 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Tue, 9 Nov 1999 13:15:04 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<14376.22527.323888.677816@amarok.cnri.reston.va.us>
	<199911091726.MAA21754@eric.cnri.reston.va.us>
	<14376.23671.250752.637144@amarok.cnri.reston.va.us>
Message-ID: <14376.25768.368164.88151@anthem.cnri.reston.va.us>

>>>>> "AMK" == Andrew M Kuchling  writes:

    AMK> No, no; I'm actually happier with Perl-style, because it's
    AMK> far better documented and familiar to people. Worse *is*
    AMK> better, after all.

Plus, you can't change re's semantics and I think it makes sense if
the Unicode engine is as close semantically as possible to the
existing engine.

We need to be careful not to worsen performance for 8bit strings.  I
think we're already on the edge of acceptability w.r.t. P*** and
hopefully we can /improve/ performance here.

MAL's proposal seems quite reasonable.  It would be excellent to see
these things done for Python 1.6.  There's still some discussion on
supporting internationalization of applications, e.g. using gettext
but I think those are smaller in scope.

-Barry



From akuchlin at mems-exchange.org  Tue Nov  9 20:36:28 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 9 Nov 1999 14:36:28 -0500 (EST)
Subject: [Python-Dev] I18N Toolkit
In-Reply-To: <14376.25768.368164.88151@anthem.cnri.reston.va.us>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<14376.22527.323888.677816@amarok.cnri.reston.va.us>
	<199911091726.MAA21754@eric.cnri.reston.va.us>
	<14376.23671.250752.637144@amarok.cnri.reston.va.us>
	<14376.25768.368164.88151@anthem.cnri.reston.va.us>
Message-ID: <14376.30652.201552.116828@amarok.cnri.reston.va.us>

Barry A. Warsaw writes:
(in relation to support for Unicode regexes)
>We need to be careful not to worsen performance for 8bit strings.  I
>think we're already on the edge of acceptability w.r.t. P*** and
>hopefully we can /improve/ performance here.

I don't think that will be a problem, given that the Unicode engine
would be a separate C implementation.  A bit of 'if type(strg) ==
UnicodeType' in re.py isn't going to cost very much speed.

(Speeding up PCRE -- that's another question.  I'm often tempted to
rewrite pcre_compile to generate an easier-to-analyse parse tree,
instead of its current complicated-but-memory-parsimonious compiler,
but I'm very reluctant to introduce a fork like that.)

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
The world does so well without me, that I am moved to wish that I could do
equally well without the world.
    -- Robertson Davies, _The Diary of Samuel Marchbanks_




From mhammond at skippinet.com.au  Tue Nov  9 23:27:45 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 10 Nov 1999 09:27:45 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <001c01bf2b01$a58d5d50$0501a8c0@bobcat>

> I think his proposal will go a long way towards your toolkit.  I
hope
> to hear soon from anybody who disagrees with Marc-Andre's proposal,

No disagreement as such, but a small hole:


From tim_one at email.msn.com  Wed Nov 10 06:57:14 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 10 Nov 1999 00:57:14 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us>
Message-ID: <000001bf2b40$70183840$d82d153f@tim>

[Guido, on "a new Unicode regex engine by /F"]

> It's from scratch, and I believe it's got Perl style, not POSIX style
> semantics -- per Tim Peters' recommendations.  Do we need to open the
> discussion again?

No, but I get to whine just a little :  I didn't recommend either
approach.  I asked many futile questions about HP's requirements, and
sketched implications either way.  If HP *has* a requirement wrt
POSIX-vs-Perl, it would be good to find that out before it's too late.

I personally prefer POSIX semantics -- but, as Andrew so eloquently said,
worse is better here; all else being equal it's best to follow JPython's
Perl-compatible re lead.

last-time-i-ever-say-what-i-really-think-ly y'rs  - tim





From tim_one at email.msn.com  Wed Nov 10 07:25:07 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 10 Nov 1999 01:25:07 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <000201bf2b44$55b8ad00$d82d153f@tim>

> Marc-Andre Lemburg has a proposal for work that I'm asking him to do
> (under pressure from HP who want Python i18n badly and are willing to
> pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt

I can't make time for a close review now.  Just one thing that hit my eye
early:

    Python should provide a built-in constructor for Unicode strings
    which is available through __builtins__:

    u = unicode([,=
                                         ])

    u = u''

Two points on the Unicode literals (u'abc'):

UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
hand -- it breaks apart and rearranges bytes at the bit level, and
everything other than 7-bit ASCII requires solid strings of "high-bit"
characters.  This is painful for people to enter manually on both counts --
and no common reference gives the UTF-8 encoding of glyphs directly.  So, as
discussed earlier, we should follow Java's lead and also introduce a \u
escape sequence:

    octet:           hexdigit hexdigit
    unicodecode:     octet octet
    unicode_escape:  "\\u" unicodecode

Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
Unicode character at the unicodecode code position.  For consistency, then,
it should probably expand the same way inside "regular strings" too.  Unlike
Java does, I'd rather not give it a meaning outside string literals.

The other point is a nit:  The vast bulk of UTF-8 encodings encode
characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
those must either be explicitly outlawed, or explicitly defined.  I vote for
outlawed, in the sense of detected error that raises an exception.  That
leaves our future options open.

BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
inverse in the Unicode world?  Both seem essential.

international-in-spite-of-himself-ly y'rs  - tim





From fredrik at pythonware.com  Wed Nov 10 09:08:06 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 09:08:06 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> http://starship.skyport.net/~lemburg/unicode-proposal.txt

Marc-Andre writes:

    The internal format for Unicode objects should either use a Python
    specific fixed cross-platform format  (e.g. 2-byte
    little endian byte order) or a compiler provided wchar_t format (if
    available). Using the wchar_t format will ease embedding of Python in
    other Unicode aware applications, but will also make internal format
    dumps platform dependent. 

having been there and done that, I strongly suggest
a third option: a 16-bit unsigned integer, in platform
specific byte order (PY_UNICODE_T).  along all other
roads lie code bloat and speed penalties...

(besides, this is exactly how it's already done in
unicode.c and what 'sre' prefers...)






From captainrobbo at yahoo.com  Wed Nov 10 09:09:26 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 00:09:26 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>

In general, I like this proposal a lot, but I think it
only covers half the story.  How we actually build the
encoder/decoder for each encoding is a very big issue.
 Thoughts on this below.

First, a little nit
>  u = u''
I don't like using funny prime characters - why not an
explicit function like "utf8()"


On to the important stuff:> 
>  unicodec.register(,,
>  [,, ])

> This registers the codecs under the given encoding
> name in the module global dictionary 
> unicodec.codecs. Stream codecs are optional: 
> the unicodec module will provide appropriate
> wrappers around  and
>  if not given.

I would MUCH prefer a single 'Encoding' class or type
to wrap up these things, rather than up to four
disconnected objects/functions.  Essentially it would
be an interface standard and would offer methods to do
the four things above.  

There are several reasons for this.  
(1) there are quite a lot of things you might want to
do with an encoding object, and we could extend the
interface in future easily.  As a minimum, give it the
four methods implied by the above, two of which can be
defaults.  But I'd like an encoding to be able to tell
me the set of characters to which it applies; validate
a string; and maybe tell me if it is a subset or
superset of another.

(2) especially with double-byte encodings, they will
need to load up some kind of database on startup and
use this for both encoding and decoding - much better
to share it and encapsulate it inside one object

(3) for some languages, there are extra functions
wanted.  For Japanese, you need two or three functions
to expand half-width to full-width katakana, convert
double-byte english to single-byte and vice versa.  A
Japanese encoding object would be a handy place to put
this knowledge.

(4) In the real world you get many encodings which are
subtle variations of the same thing, plus or minus a
few characters.  One bit of code might be able to
share the work of several encodings, by setting a few
flags.  Certainly true of Japanese.

(5) encoding/decoding algorithms can be program or
data or (very often) a bit of both.  We have not yet
discussed where to keep all the mapping tables, but if
data is involved it should be hidden in an object.

(6) See my comments on a state machine for doing the
encodings.  If this is done well, we might two
different standard objects which conform to the
Encoding interface (a really light one for single-byte
encodings, and a bigger one for multi-byte), and
everything else could be data driven.  

(6) Easy to grow - encodings can be prototyped and
proven in Python, ported to C if needed or when ready.
 

In summary, firm up the concept of an Encoding object
and give it room to grow - that's the key to
real-world usefulness.   If people feel the same way
I'll have a go at an interface for that, and try show
how it would have simplified specific problems I have
faced.

We also need to think about where encoding info will
live.  You cannot avoid mapping tables, although you
can hide them inside code modules or pickled objects
if you want.  Should there be a standard 
"..\Python\Enc" directory?

And we're going to need some kind of testing and
certification procedure when adding new encodings. 
This stuff has to be right.  

Guido asked about TypedString.  This can probably be
done on top of the built-in stuff - it is just a
convenience which would clarify intent, reduce lines
of code and prevent people shooting themselves in the
foot when juggling a lot of strings in different
(non-Unicode) encodings.  I can do a Python module to
implement that on top of whatever is built.


Regards,

Andy








=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From fredrik at pythonware.com  Wed Nov 10 09:14:21 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 09:14:21 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000201bf2b44$55b8ad00$d82d153f@tim>
Message-ID: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com>

Tim Peters wrote:
> UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
> hand -- it breaks apart and rearranges bytes at the bit level, and
> everything other than 7-bit ASCII requires solid strings of "high-bit"
> characters.

unless you're using a UTF-8 aware editor, of course ;-)

(some days, I think we need some way to tell the compiler
what encoding we're using for the source file...)

> This is painful for people to enter manually on both counts --
> and no common reference gives the UTF-8 encoding of glyphs
> directly.  So, as discussed earlier, we should follow Java's lead
> and also introduce a \u escape sequence:
> 
>     octet:           hexdigit hexdigit
>     unicodecode:     octet octet
>     unicode_escape:  "\\u" unicodecode
> 
> Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
> Unicode character at the unicodecode code position.  For consistency, then,
> it should probably expand the same way inside "regular strings" too.  Unlike
> Java does, I'd rather not give it a meaning outside string literals.

good idea.  and by some reason, patches for this is included
in the unicode distribution (see the attached str2utf.c).

> The other point is a nit:  The vast bulk of UTF-8 encodings encode
> characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
> those must either be explicitly outlawed, or explicitly defined.  I vote for
> outlawed, in the sense of detected error that raises an exception.  That
> leaves our future options open.

I vote for 'outlaw'.




/* A small code snippet that translates \uxxxx syntax to UTF-8 text.
   To be cut and pasted into Python/compile.c */

/* Written by Fredrik Lundh, January 1999. */

/* Documentation (for the language reference):

\uxxxx -- Unicode character with hexadecimal value xxxx.  The
character is stored using UTF-8 encoding, which means that this
sequence can result in up to three encoded characters.

Note that the 'u' must be followed by four hexadecimal digits.  If
fewer digits are given, the sequence is left in the resulting string
exactly as given.  If more digits are given, only the first four are
translated to Unicode, and the remaining digits are left in the
resulting string.

*/

#define Py_CHARMASK(ch) ch

void
convert(const char *s, char *p)
{
    while (*s) {
        if (*s != '\\') {
            *p++ = *s++;
            continue;
        }
        s++;
        switch (*s++) {

/* -------------------------------------------------------------------- */
/* copy this section to the appropriate place in compile.c... */

        case 'u':
            /* \uxxxx => UTF-8 encoded unicode character */
            if (isxdigit(Py_CHARMASK(s[0])) && isxdigit(Py_CHARMASK(s[1])) &&
                isxdigit(Py_CHARMASK(s[2])) && isxdigit(Py_CHARMASK(s[3]))) {
                /* fetch hexadecimal character value */
                unsigned int n, ch = 0;
                for (n = 0; n < 4; n++) {
                    int c = Py_CHARMASK(*s);
                    s++;
                    ch = (ch << 4) & ~0xF;
                    if (isdigit(c))
                        ch += c - '0';
                    else if (islower(c))
                        ch += 10 + c - 'a';
                    else
                        ch += 10 + c - 'A';
                }
                /* store as UTF-8 */
                if (ch < 0x80)
                    *p++ = (char) ch;
                else {
                    if (ch < 0x800) {
                        *p++ = 0xc0 | (ch >> 6);
                        *p++ = 0x80 | (ch & 0x3f);
                    } else {
                        *p++ = 0xe0 | (ch >> 12);
                        *p++ = 0x80 | ((ch >> 6) & 0x3f);
                        *p++ = 0x80 | (ch & 0x3f);
                    }
                }
                break;
            } else
                goto bogus;

/* -------------------------------------------------------------------- */

        default:

bogus:      *p++ = '\\';
            *p++ = s[-1];
            break;
        }
    }
    *p++ = '\0';
}

main()
{
    int i;
    unsigned char buffer[100];
    
    convert("Link\\u00f6ping", buffer);

    for (i = 0; buffer[i]; i++)
        if (buffer[i] < 0x20 || buffer[i] >= 0x80)
            printf("\\%03o", buffer[i]);
        else
            printf("%c", buffer[i]);
}





From gstein at lyra.org  Thu Nov 11 10:18:52 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 01:18:52 -0800 (PST)
Subject: [Python-Dev] Re: Internal Format
In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
Message-ID: 

On Wed, 10 Nov 1999, Fredrik Lundh wrote:
> Marc-Andre writes:
> 
>     The internal format for Unicode objects should either use a Python
>     specific fixed cross-platform format  (e.g. 2-byte
>     little endian byte order) or a compiler provided wchar_t format (if
>     available). Using the wchar_t format will ease embedding of Python in
>     other Unicode aware applications, but will also make internal format
>     dumps platform dependent. 
> 
> having been there and done that, I strongly suggest
> a third option: a 16-bit unsigned integer, in platform
> specific byte order (PY_UNICODE_T).  along all other
> roads lie code bloat and speed penalties...

I agree 100% !!

wchar_t will introduce portability issues right on up into the Python
level. The byte-order introduces speed issues and OS interoperability
issues, yet solves no portability problems (Byte Order Marks should still
be present and used).

There are two "platforms" out there that use Unicode: Win32 and Java. They
both use UCS-2, AFAIK.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From fredrik at pythonware.com  Wed Nov 10 09:24:16 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 09:24:16 +0100
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> One specific question: in you discussion of typed strings, I'm not
> sure why you couldn't convert everything to Unicode and be done with
> it.  I have a feeling that the answer is somewhere in your case study
> -- maybe you can elaborate?

Marc-Andre writes:

    Unicode objects should have a pointer to a cached (read-only) char
    buffer  holding the object's value using the current
    .  This is needed for performance and internal
    parsing (see below) reasons. The buffer is filled when the first
    conversion request to the  is issued on the object.

keeping track of an external encoding is better left
for the application programmers -- I'm pretty sure that
different application builders will want to handle this
in radically different ways, depending on their environ-
ment, underlying user interface toolkit, etc.

besides, this is how Tcl would have done it.  Python's
not Tcl, and I think you need *very* good arguments
for moving in that direction.






From mal at lemburg.com  Wed Nov 10 10:04:39 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 10:04:39 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <001c01bf2b01$a58d5d50$0501a8c0@bobcat>
Message-ID: <38293527.3CF5C7B0@lemburg.com>

Mark Hammond wrote:
> 
> > I think his proposal will go a long way towards your toolkit.  I
> hope
> > to hear soon from anybody who disagrees with Marc-Andre's proposal,
> 
> No disagreement as such, but a small hole:
> 
> >From the proposal:
> 
> Internal Argument Parsing:
> --------------------------
> ...
> 's':    For Unicode objects: auto convert them to the 
>         and return a pointer to the object's  buffer.
> 
> --
> Excellent - if someone passes a Unicode object, it can be
> auto-converted to a string.  This will allow "open()" to accept
> Unicode strings.

Well almost... it depends on the current value of .
If it's UTF8 and you only use normal ASCII characters the above is indeed
true, but UTF8 can go far beyond ASCII and have up to 3 bytes per
character (for UCS2, even more for UCS4). With  set
to other exotic encodings this is likely to fail though.
 
> However, there doesnt appear to be a reverse.  Eg, if my extension
> module interfaces to a library that uses Unicode natively, how can I
> get a Unicode object when the user passes a string?  If I had to
> explicitely check for a string, then check for a Unicode on failure it
> would get messy pretty quickly...  Is it not possible to have "U" also
> do a conversion?

"U" is meant to simplify checks for Unicode objects, much like "S".
It returns a reference to the object. Auto-conversions are not possible
due to this, because they would create new objects which don't get
properly garbage collected later on.

Another problem is that Unicode types differ between platforms
(MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit
wchar_t). Depending on the internal format of Unicode objects
this could mean calling different conversion APIs.

BTW, I'm still not too sure about the underlying internal format.
The problem here is that Unicode started out as 2-byte fixed length
representation (UCS2) but then shifted towards a 4-byte fixed length
reprensetation known as UCS4. Since having 4 bytes per character
is hard sell to customers, UTF16 was created to stuff the UCS4
code points (this is how character entities are called in Unicode)
into 2 bytes... with a variable length encoding.

Some platforms that started early into the Unicode business
such as the MS ones use UCS2 as wchar_t, while more recent
ones (e.g. the glibc2 on Linux) use UCS4 for wchar_t. I haven't
yet checked in what ways the two are compatible (I would suspect
the top bytes in UCS4 being 0 for UCS2 codes), but would like
to hear whether it wouldn't be a better idea to use UTF16
as internal format. The latter works in 2 bytes for most
characters and conversion to UCS2|4 should be fast. Still,
conversion to UCS2 could fail.

The downside of using UTF16: it is a variable length format,
so iterations over it will be slower than for UCS4.

Simply sticking to UCS2 is probably out of the question,
since Unicode 3.0 requires UCS4 and we are targetting
Unicode 3.0.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 10 10:49:01 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 10:49:01 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000201bf2b44$55b8ad00$d82d153f@tim>
Message-ID: <38293F8D.F60AE605@lemburg.com>

Tim Peters wrote:
> 
> > Marc-Andre Lemburg has a proposal for work that I'm asking him to do
> > (under pressure from HP who want Python i18n badly and are willing to
> > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt
> 
> I can't make time for a close review now.  Just one thing that hit my eye
> early:
> 
>     Python should provide a built-in constructor for Unicode strings
>     which is available through __builtins__:
> 
>     u = unicode([,=
>                                          ])
> 
>     u = u''
> 
> Two points on the Unicode literals (u'abc'):
> 
> UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
> hand -- it breaks apart and rearranges bytes at the bit level, and
> everything other than 7-bit ASCII requires solid strings of "high-bit"
> characters.  This is painful for people to enter manually on both counts --
> and no common reference gives the UTF-8 encoding of glyphs directly.  So, as
> discussed earlier, we should follow Java's lead and also introduce a \u
> escape sequence:
> 
>     octet:           hexdigit hexdigit
>     unicodecode:     octet octet
>     unicode_escape:  "\\u" unicodecode
> 
> Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
> Unicode character at the unicodecode code position.  For consistency, then,
> it should probably expand the same way inside "regular strings" too.  Unlike
> Java does, I'd rather not give it a meaning outside string literals.

It would be more conform to use the Unicode ordinal (instead of
interpreting the number as UTF8 encoding), e.g. \u03C0 for Pi. The
codes are easy to look up in the standard's UnicodeData.txt file or the
Unicode book for that matter.
 
> The other point is a nit:  The vast bulk of UTF-8 encodings encode
> characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
> those must either be explicitly outlawed, or explicitly defined.  I vote for
> outlawed, in the sense of detected error that raises an exception.  That
> leaves our future options open.

See my other post for a discussion of UCS4 vs. UTF16 vs. UCS2.

Perhaps we could add a flag to Unicode objects stating whether the characters
can be treated as UCS4 limited to the lower 16 bits (UCS4 and UTF16 are
the same in most ranges).

This flag could then be used to choose optimized algorithms for scanning
the strings. Fredrik's implementation currently uses UCS2, BTW.

> BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
> inverse in the Unicode world?  Both seem essential.

Good points.

How about 

  uniord(u[:1]) --> Unicode ordinal number (32-bit)

  unichr(i) --> Unicode object for character i (provided it is 32-bit);
                ValueError otherwise

They are inverse of each other, but note that Unicode allows 
private encodings too, which will of course not necessarily make
it across platforms or even from one PC to the next (see Andy Robinson's
interesting case study).

I've uploaded a new version of the proposal (0.3) to the URL:

	http://starship.skyport.net/~lemburg/unicode-proposal.txt

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fredrik at pythonware.com  Wed Nov 10 11:50:05 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 11:50:05 +0100
Subject: regexp performance (Re: [Python-Dev] I18N Toolkit
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us>
Message-ID: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com>

Andrew M. Kuchling  wrote:
> (Speeding up PCRE -- that's another question.  I'm often tempted to
> rewrite pcre_compile to generate an easier-to-analyse parse tree,
> instead of its current complicated-but-memory-parsimonious compiler,
> but I'm very reluctant to introduce a fork like that.)

any special pattern constructs that are in need of per-
formance improvements?  (compared to Perl, that is).

or maybe anyone has an extensive performance test
suite for perlish regular expressions?  (preferrably based
on how real people use regular expressions, not only on
things that are known to be slow if not optimized)






From gstein at lyra.org  Thu Nov 11 11:46:55 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 02:46:55 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38293527.3CF5C7B0@lemburg.com>
Message-ID: 

On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
>...
> Well almost... it depends on the current value of .

Default encodings are kind of nasty when they can be altered. The same
problem occurred with import hooks. Only one can be present at a time.
This implies that modules, packages, subsystems, whatever, cannot set a
default encoding because something else might depend on it having a
different value. In the end, nobody uses the default encoding because it
is unreliable, so you end up with extra implementation/semantics that
aren't used/needed.

Have you ever noticed how Python modules, packages, tools, etc, never
define an import hook?

I'll bet nobody ever monkeys with the default encoding either...

I say axe it and say "UTF-8" is the fixed, default encoding. If you want
something else, then do that explicitly.

>...
> Another problem is that Unicode types differ between platforms
> (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit
> wchar_t). Depending on the internal format of Unicode objects
> this could mean calling different conversion APIs.

Exactly the reason to avoid wchar_t.

> BTW, I'm still not too sure about the underlying internal format.
> The problem here is that Unicode started out as 2-byte fixed length
> representation (UCS2) but then shifted towards a 4-byte fixed length
> reprensetation known as UCS4. Since having 4 bytes per character
> is hard sell to customers, UTF16 was created to stuff the UCS4
> code points (this is how character entities are called in Unicode)
> into 2 bytes... with a variable length encoding.

History is basically irrelevant. What is the situation today? What is in
use, and what are people planning for right now?

>...
> The downside of using UTF16: it is a variable length format,
> so iterations over it will be slower than for UCS4.

Bzzt. May as well go with UTF-8 as the internal format, much like Perl is
doing (as I recall).

Why go with a variable length format, when people seem to be doing fine
with UCS-2?

Like I said in the other mail note: two large platforms out there are
UCS-2 based. They seem to be doing quite well with that approach.

If people truly need UCS-4, then they can work with that on their own. One
of the major reasons for putting Unicode into Python is to
increase/simplify its ability to speak to the underlying platform. Hey!
Guess what? That generally means UCS2.

If we didn't need to speak to the OS with these Unicode values, then
people can work with the values entirely in Python,
PyUnicodeType-be-damned.

Are we digging a hole for ourselves? Maybe. But there are two other big
platforms that have the same hole to dig out of *IF* it ever comes to
that. I posit that it won't be necessary; that the people needing UCS-4
can do so entirely in Python.

Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and
vice-versa. But: it only does it from String to String -- you can't use
Unicode objects anywhere in there.

> Simply sticking to UCS2 is probably out of the question,
> since Unicode 3.0 requires UCS4 and we are targetting
> Unicode 3.0.

Oh? Who says?

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From fredrik at pythonware.com  Wed Nov 10 11:52:28 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 11:52:28 +0100
Subject: [Python-Dev] I18N Toolkit
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us>
Message-ID: <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com>

(a copy was sent to comp.lang.python by mistake;
sorry for that).

Andrew M. Kuchling  wrote:
> I don't think that will be a problem, given that the Unicode engine
> would be a separate C implementation.  A bit of 'if type(strg) ==
> UnicodeType' in re.py isn't going to cost very much speed.

a slightly hairer design issue is what combinations
of pattern and string the new 're' will handle.

the first two are obvious:
 
     ordinary pattern, ordinary string
     unicode pattern, unicode string
 
 but what about these?
 
     ordinary pattern, unicode string
     unicode pattern, ordinary string
 
 "coercing" patterns (i.e. recompiling, on demand)
 seem to be a somewhat risky business ;-)
 
 




From gstein at lyra.org  Thu Nov 11 11:50:56 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 02:50:56 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38293F8D.F60AE605@lemburg.com>
Message-ID: 

On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
> Tim Peters wrote:
> > BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
> > inverse in the Unicode world?  Both seem essential.
> 
> Good points.
> 
> How about 
> 
>   uniord(u[:1]) --> Unicode ordinal number (32-bit)
> 
>   unichr(i) --> Unicode object for character i (provided it is 32-bit);
>                 ValueError otherwise

Why new functions? Why not extend the definition of ord() and chr()?

In terms of backwards compatibility, the only issue could possibly be that
people relied on chr(x) to throw an error when x>=256. They certainly
couldn't pass a Unicode object to ord(), so that function can safely be
extended to accept a Unicode object and return a larger integer.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From jcw at equi4.com  Wed Nov 10 12:14:17 1999
From: jcw at equi4.com (Jean-Claude Wippler)
Date: Wed, 10 Nov 1999 12:14:17 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <38295389.397DDE5E@equi4.com>

Greg Stein wrote:
[MAL:]
> > The downside of using UTF16: it is a variable length format,
> > so iterations over it will be slower than for UCS4.
> 
> Bzzt. May as well go with UTF-8 as the internal format, much like Perl
> is doing (as I recall).

Ehm, pardon me for asking - what is the brief rationale for selecting
UCS2/4, or whetever it ends up being, over UTF8?

I couldn't find a discussion in the last months of the string SIG, was
this decided upon and frozen long ago?

I'm not trying to re-open a can of worms, just to understand.

-- Jean-Claude



From gstein at lyra.org  Thu Nov 11 12:17:56 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 03:17:56 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38295389.397DDE5E@equi4.com>
Message-ID: 

On Wed, 10 Nov 1999, Jean-Claude Wippler wrote:
> Greg Stein wrote:
> > Bzzt. May as well go with UTF-8 as the internal format, much like Perl
> > is doing (as I recall).
> 
> Ehm, pardon me for asking - what is the brief rationale for selecting
> UCS2/4, or whetever it ends up being, over UTF8?
> 
> I couldn't find a discussion in the last months of the string SIG, was
> this decided upon and frozen long ago?

Try sometime last year :-) ... something like July thru September as I
recall.

Things will be a lot faster if we have a fixed-size character. Variable
length formats like UTF-8 are a lot harder to slice, search, etc. Also,
(IMO) a big reason for this new type is for interaction with the
underlying OS/platform. I don't know of any platforms right now that
really use UTF-8 as their Unicode string representation (meaning we'd
have to convert back/forth from our UTF-8 representation to talk to the
OS).

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Wed Nov 10 10:55:42 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 10:55:42 +0100
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
Message-ID: <3829411E.FD32F8CC@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum  wrote:
> > One specific question: in you discussion of typed strings, I'm not
> > sure why you couldn't convert everything to Unicode and be done with
> > it.  I have a feeling that the answer is somewhere in your case study
> > -- maybe you can elaborate?
> 
> Marc-Andre writes:
> 
>     Unicode objects should have a pointer to a cached (read-only) char
>     buffer  holding the object's value using the current
>     .  This is needed for performance and internal
>     parsing (see below) reasons. The buffer is filled when the first
>     conversion request to the  is issued on the object.
> 
> keeping track of an external encoding is better left
> for the application programmers -- I'm pretty sure that
> different application builders will want to handle this
> in radically different ways, depending on their environ-
> ment, underlying user interface toolkit, etc.

It's not that hard to implement. All you have to do is check
whether the current encoding in  still is the same
as the threads view of . The 
buffer is needed to implement "s" et al. argument parsing
anyways.
 
> besides, this is how Tcl would have done it.  Python's
> not Tcl, and I think you need *very* good arguments
> for moving in that direction.
> 
> 
> 
> _______________________________________________
> Python-Dev maillist  -  Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 10 12:42:00 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 12:42:00 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
Message-ID: <38295A08.D3928401@lemburg.com>

Andy Robinson wrote:
> 
> In general, I like this proposal a lot, but I think it
> only covers half the story.  How we actually build the
> encoder/decoder for each encoding is a very big issue.
>  Thoughts on this below.
> 
> First, a little nit
> >  u = u''
> I don't like using funny prime characters - why not an
> explicit function like "utf8()"

u = unicode('...I am UTF8...','utf-8')

will do just that. I've moved to Tim's proposal with the
\uXXXX encoding for u'', BTW.
 
> On to the important stuff:>
> >  unicodec.register(,,
> >  [,, ])
> 
> > This registers the codecs under the given encoding
> > name in the module global dictionary
> > unicodec.codecs. Stream codecs are optional:
> > the unicodec module will provide appropriate
> > wrappers around  and
> >  if not given.
> 
> I would MUCH prefer a single 'Encoding' class or type
> to wrap up these things, rather than up to four
> disconnected objects/functions.  Essentially it would
> be an interface standard and would offer methods to do
> the four things above.
> 
> There are several reasons for this.
>
> ...
>
> In summary, firm up the concept of an Encoding object
> and give it room to grow - that's the key to
> real-world usefulness.   If people feel the same way
> I'll have a go at an interface for that, and try show
> how it would have simplified specific problems I have
> faced.

Ok, you have a point there.

Here's a proposal (note that this only defines an interface,
not a class structure):

Codec Interface Definition:
---------------------------

The following base class should be defined in the module unicodec.

class Codec:

    def encode(self,u):
	
	""" Return the Unicode object u encoded as Python string.

	"""
	...

    def decode(self,s):

	""" Return an equivalent Unicode object for the encoded Python
	    string s.

	""" 
	...
	
    def dump(self,u,stream,slice=None):

	""" Writes the Unicode object's contents encoded to the stream.

	    stream must be a file-like object open for writing binary
	    data.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def load(self,stream,length=None):

	""" Reads an encoded string (up to  bytes) from the
	    stream and returns an equivalent Unicode object.

	    stream must be a file-like object open for reading binary
	    data.

	    If length is given, only length bytes are read. Note that
	    this can cause the decoding algorithm to fail due to
	    truncations in the encoding.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...

Codecs should raise an UnicodeError in case the conversion is
not possible.

It is not required by the unicodec.register() API to provide a
subclass of this base class, only the 4 given methods must be present.
This allows writing Codecs as extensions types.

XXX Still to be discussed: 

    ? support for line breaks (see
      http://www.unicode.org/unicode/reports/tr13/ )

    ? support for case conversion: 

      Problems: string lengths can change due to multiple
      characters being mapped to a single new one, capital letters
      starting a word can be different than ones occurring in the
      middle, there are locale dependent deviations from the standard
      mappings.

    ? support for numbers, digits, whitespace, etc.

    ? support (or no support) for private code point areas


> We also need to think about where encoding info will
> live.  You cannot avoid mapping tables, although you
> can hide them inside code modules or pickled objects
> if you want.  Should there be a standard
> "..\Python\Enc" directory?

Mapping tables should be incorporated into the codec
modules preferably as static C data. That way multiple
processes can share the same data.

> And we're going to need some kind of testing and
> certification procedure when adding new encodings.
> This stuff has to be right.

I will have to rely on your cooperation for the test data.
Roundtrip testing is easy to implement, but I will also have
to verify the output against prechecked data which is probably only
creatable using visual tools to which I don't have access
(e.g. a Japanese Windows installation).
 
> Guido asked about TypedString.  This can probably be
> done on top of the built-in stuff - it is just a
> convenience which would clarify intent, reduce lines
> of code and prevent people shooting themselves in the
> foot when juggling a lot of strings in different
> (non-Unicode) encodings.  I can do a Python module to
> implement that on top of whatever is built.

Ok.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Wed Nov 10 11:03:36 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 11:03:36 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
Message-ID: <382942F8.1921158E@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum  wrote:
> > http://starship.skyport.net/~lemburg/unicode-proposal.txt
> 
> Marc-Andre writes:
> 
>     The internal format for Unicode objects should either use a Python
>     specific fixed cross-platform format  (e.g. 2-byte
>     little endian byte order) or a compiler provided wchar_t format (if
>     available). Using the wchar_t format will ease embedding of Python in
>     other Unicode aware applications, but will also make internal format
>     dumps platform dependent.
> 
> having been there and done that, I strongly suggest
> a third option: a 16-bit unsigned integer, in platform
> specific byte order (PY_UNICODE_T).  along all other
> roads lie code bloat and speed penalties...
>
> (besides, this is exactly how it's already done in
> unicode.c and what 'sre' prefers...)

Ok, byte order can cause a speed penalty, so it might be
worthwhile introducing sys.bom (or sys.endianness) for this
reason and sticking to 16-bit integers as you have already done
in unicode.h.

What I don't like is using wchar_t if available (and then addressing
it as if it were defined as unsigned integer). IMO, it's better
to define a Python Unicode representation which then gets converted
to whatever wchar_t represents on the target machine.

Another issue is whether to use UCS2 (as you have done) or UTF16
(which is what Unicode 3.0 requires)... see my other post
for a discussion.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fredrik at pythonware.com  Wed Nov 10 13:32:16 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 13:32:16 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com>
Message-ID: <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com>

> What I don't like is using wchar_t if available (and then addressing
> it as if it were defined as unsigned integer). IMO, it's better
> to define a Python Unicode representation which then gets converted
> to whatever wchar_t represents on the target machine.

you should read the unicode.h file a bit more carefully:

...

/* Unicode declarations. Tweak these to match your platform */

/* set this flag if the platform has "wchar.h", "wctype.h" and the
   wchar_t type is a 16-bit unsigned type */
#define HAVE_USABLE_WCHAR_H

#if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H)

    (this uses wchar_t, and also iswspace and friends)

...

#else

/* Use if you have a standard ANSI compiler, without wchar_t support.
   If a short is not 16 bits on your platform, you have to fix the
   typedef below, or the module initialization code will complain. */

    (this maps iswspace to isspace, for 8-bit characters).

#endif

...

the plan was to use the second solution (using "configure"
to figure out what integer type to use), and its own uni-
code database table for the is/to primitives

(iirc, the unicode.txt file discussed this, but that one
seems to be missing from the zip archive).






From fredrik at pythonware.com  Wed Nov 10 13:39:56 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 13:39:56 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com>

Greg Stein  wrote:
> Have you ever noticed how Python modules, packages, tools, etc, never
> define an import hook?

hey, didn't MAL use one in one of his mx kits? ;-)

> I say axe it and say "UTF-8" is the fixed, default encoding. If you want
> something else, then do that explicitly.

exactly.

modes are evil.  python is not perl.  etc.

> Are we digging a hole for ourselves? Maybe. But there are two other big
> platforms that have the same hole to dig out of *IF* it ever comes to
> that. I posit that it won't be necessary; that the people needing UCS-4
> can do so entirely in Python.

last time I checked, there were no characters (even in the
ISO standard) outside the 16-bit range.  has that changed?






From mal at lemburg.com  Wed Nov 10 13:44:39 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 13:44:39 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <382968B7.ABFFD4C0@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
> > Tim Peters wrote:
> > > BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
> > > inverse in the Unicode world?  Both seem essential.
> >
> > Good points.
> >
> > How about
> >
> >   uniord(u[:1]) --> Unicode ordinal number (32-bit)
> >
> >   unichr(i) --> Unicode object for character i (provided it is 32-bit);
> >                 ValueError otherwise
> 
> Why new functions? Why not extend the definition of ord() and chr()?
> 
> In terms of backwards compatibility, the only issue could possibly be that
> people relied on chr(x) to throw an error when x>=256. They certainly
> couldn't pass a Unicode object to ord(), so that function can safely be
> extended to accept a Unicode object and return a larger integer.

Because unichr() will always have to return Unicode objects. You don't
want chr(i) to return Unicode for i>255 and strings for i<256.

OTOH, ord() could probably be extended to also work on Unicode objects.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 10 14:08:30 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 14:08:30 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <38296E4E.914C0ED7@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
> >...
> > Well almost... it depends on the current value of .
> 
> Default encodings are kind of nasty when they can be altered. The same
> problem occurred with import hooks. Only one can be present at a time.
> This implies that modules, packages, subsystems, whatever, cannot set a
> default encoding because something else might depend on it having a
> different value. In the end, nobody uses the default encoding because it
> is unreliable, so you end up with extra implementation/semantics that
> aren't used/needed.

I know, but this is a little different: you use strings a lot while
import hooks are rarely used directly by the user.

E.g. people in Europe will probably prefer Latin-1 as default
encoding while people in Asia will use one of the common CJK encodings.

The  decides what encoding to use for many typical
tasks: printing, str(u), "s" argument parsing, etc.

Note that setting the  is not intended to be
done prior to single operations. It is meant to be settable at
thread creation time.

> [...]
> 
> > BTW, I'm still not too sure about the underlying internal format.
> > The problem here is that Unicode started out as 2-byte fixed length
> > representation (UCS2) but then shifted towards a 4-byte fixed length
> > reprensetation known as UCS4. Since having 4 bytes per character
> > is hard sell to customers, UTF16 was created to stuff the UCS4
> > code points (this is how character entities are called in Unicode)
> > into 2 bytes... with a variable length encoding.
> 
> History is basically irrelevant. What is the situation today? What is in
> use, and what are people planning for right now?
> 
> >...
> > The downside of using UTF16: it is a variable length format,
> > so iterations over it will be slower than for UCS4.
> 
> Bzzt. May as well go with UTF-8 as the internal format, much like Perl is
> doing (as I recall).
> 
> Why go with a variable length format, when people seem to be doing fine
> with UCS-2?

The reason for UTF-16 is simply that it is identical to UCS-2
over large ranges which makes optimizations (e.g. the UCS2 flag
I mentioned in an earlier post) feasable and effective. UTF-8
slows things down for CJK encodings, since the APIs will very often
have to scan the string to find the correct logical position in
the data.
 
Here's a quote from the Unicode FAQ (http://www.unicode.org/unicode/faq/ ):
"""
Q: How about using UCS-4 interfaces in my APIs?

Given an internal UTF-16 storage, you can, of course, still index into text
using UCS-4 indices. However, while converting from a UCS-4 index to a
UTF-16 index or vice versa is fairly straightforward, it does involve a
scan through the 16-bit units up to the index point. In a test run, for
example, accessing UTF-16 storage as UCS-4 characters results in a
10X degradation. Of course, the precise differences will depend on the
compiler, and there are some interesting optimizations that can be
performed, but it will always be slower on average. This kind of
performance hit is unacceptable in many environments.

Most Unicode APIs are using UTF-16. The low-level character indexing
are at the common storage level, with higher-level mechanisms for
graphemes or words specifying their boundaries in terms of the storage
units. This provides efficiency at the low levels, and the required
functionality at the high levels.

Convenience APIs can be produced that take parameters in UCS-4
methods for common utilities: e.g. converting UCS-4 indices back and
forth, accessing character properties, etc. Outside of indexing, differences
between UCS-4 and UTF-16 are not as important. For most other APIs
outside of indexing, characters values cannot really be considered
outside of their context--not when you are writing internationalized code.
For such operations as display, input, collation, editing, and even upper
and lowercasing, characters need to be considered in the context of a
string. That means that in any event you end up looking at more than one
character. In our experience, the incremental cost of doing surrogates is
pretty small.
"""

> Like I said in the other mail note: two large platforms out there are
> UCS-2 based. They seem to be doing quite well with that approach.
> 
> If people truly need UCS-4, then they can work with that on their own. One
> of the major reasons for putting Unicode into Python is to
> increase/simplify its ability to speak to the underlying platform. Hey!
> Guess what? That generally means UCS2.

All those formats are upward compatible (within certain ranges) and
the Python Unicode API will provide converters between its internal
format and the few common Unicode implementations, e.g. for MS
compilers (16-bit UCS2 AFAIK), GLIBC (32-bit UCS4).
 
> If we didn't need to speak to the OS with these Unicode values, then
> people can work with the values entirely in Python,
> PyUnicodeType-be-damned.
> 
> Are we digging a hole for ourselves? Maybe. But there are two other big
> platforms that have the same hole to dig out of *IF* it ever comes to
> that. I posit that it won't be necessary; that the people needing UCS-4
> can do so entirely in Python.
> 
> Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and
> vice-versa. But: it only does it from String to String -- you can't use
> Unicode objects anywhere in there.

See above.
 
> > Simply sticking to UCS2 is probably out of the question,
> > since Unicode 3.0 requires UCS4 and we are targetting
> > Unicode 3.0.
> 
> Oh? Who says?

>From the FAQ:
"""
Q: What is UTF-16?

Unicode was originally designed as a pure 16-bit encoding, aimed at
representing all modern scripts. (Ancient scripts were to be represented
with private-use characters.) Over time, and especially after the addition
of over 14,500 composite characters for compatibility with legacy sets, it
became clear that 16-bits were not sufficient for the user community. Out
of this arose UTF-16.
"""

Note that there currently are no defined surrogate pairs for
UTF-16, meaning that in practice the difference between UCS-2 and
UTF-16 is probably negligable, e.g. we could define the internal
format to be UTF-16 and raise exception whenever the border between
UTF-16 and UCS-2 is crossed -- sort of as political compromise ;-).

But... I think HP has the last word on this one.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 10 13:36:44 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 13:36:44 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com> <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com>
Message-ID: <382966DC.F33E340E@lemburg.com>

Fredrik Lundh wrote:
> 
> > What I don't like is using wchar_t if available (and then addressing
> > it as if it were defined as unsigned integer). IMO, it's better
> > to define a Python Unicode representation which then gets converted
> > to whatever wchar_t represents on the target machine.
> 
> you should read the unicode.h file a bit more carefully:
> 
> ...
> 
> /* Unicode declarations. Tweak these to match your platform */
> 
> /* set this flag if the platform has "wchar.h", "wctype.h" and the
>    wchar_t type is a 16-bit unsigned type */
> #define HAVE_USABLE_WCHAR_H
> 
> #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H)
> 
>     (this uses wchar_t, and also iswspace and friends)
> 
> ...
> 
> #else
> 
> /* Use if you have a standard ANSI compiler, without wchar_t support.
>    If a short is not 16 bits on your platform, you have to fix the
>    typedef below, or the module initialization code will complain. */
> 
>     (this maps iswspace to isspace, for 8-bit characters).
> 
> #endif
> 
> ...
> 
> the plan was to use the second solution (using "configure"
> to figure out what integer type to use), and its own uni-
> code database table for the is/to primitives

Oh, I did read unicode.h, stumbled across the mixed usage
and decided not to like it ;-)

Seriously, I find the second solution where you use the 'unsigned
short' much more portable and straight forward. You never know what
the compiler does for isw*() and it's probably better sticking
to one format for all platforms. Only endianness gets in the way,
but that's easy to handle.

So I opt for 'unsigned short'. The encoding used in these 2 bytes
is a different question though. If HP insists on Unicode 3.0, there's
probably no other way than to use UTF-16.
 
> (iirc, the unicode.txt file discussed this, but that one
> seems to be missing from the zip archive).

It's not in the file I downloaded from your site. Could you post
it here ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 10 14:13:10 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 14:13:10 +0100
Subject: [Python-Dev] Internationalization Toolkit
References:  <38295389.397DDE5E@equi4.com>
Message-ID: <38296F66.5DF9263E@lemburg.com>

Jean-Claude Wippler wrote:
> 
> Greg Stein wrote:
> [MAL:]
> > > The downside of using UTF16: it is a variable length format,
> > > so iterations over it will be slower than for UCS4.
> >
> > Bzzt. May as well go with UTF-8 as the internal format, much like Perl
> > is doing (as I recall).
> 
> Ehm, pardon me for asking - what is the brief rationale for selecting
> UCS2/4, or whetever it ends up being, over UTF8?

UCS-2 is the native format on major platforms (meaning straight
fixed length encoding using 2 bytes), ie. interfacing between
Python's Unicode object and the platform APIs will be simple and
fast.

UTF-8 is short for ASCII users, but imposes a performance 
hit for the CJK (Asian character sets) world, since UTF8 uses
*variable* length encodings.
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From akuchlin at mems-exchange.org  Wed Nov 10 15:56:16 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Wed, 10 Nov 1999 09:56:16 -0500 (EST)
Subject: [Python-Dev] Re: regexp performance
In-Reply-To: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<14376.22527.323888.677816@amarok.cnri.reston.va.us>
	<199911091726.MAA21754@eric.cnri.reston.va.us>
	<14376.23671.250752.637144@amarok.cnri.reston.va.us>
	<14376.25768.368164.88151@anthem.cnri.reston.va.us>
	<14376.30652.201552.116828@amarok.cnri.reston.va.us>
	<027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com>
Message-ID: <14377.34704.639462.794509@amarok.cnri.reston.va.us>

[Cc'ed to the String-SIG; sheesh, what's the point of having SIGs
otherwise?]

Fredrik Lundh writes:
>any special pattern constructs that are in need of per-
>formance improvements?  (compared to Perl, that is).

In the 1.5 source tree, I think one major slowdown is coming from the
malloc'ed failure stack.  This was introduced in order to prevent an
expression like (x)* from filling the stack when applied to a string
contained 50,000 'x' characters (hence 50,000 recursive function
calls).  I'd like to get rid of this stack because it's slow and
requires much tedious patching of the upstream PCRE.

>or maybe anyone has an extensive performance test
>suite for perlish regular expressions?  (preferrably based
>on how real people use regular expressions, not only on
>things that are known to be slow if not optimized)

Friedl's book describes several optimizations which aren't implemented
in PCRE.  The problem is that PCRE never builds a parse tree, and
parse trees are easy to analyse recursively.  Instead, PCRE's
functions actually look at the compiled byte codes (for example, look
at find_firstchar or is_anchored in pypcre.c), but this makes analysis
functions hard to write, and rearranging the code near-impossible.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I didn't say it was my fault. I said it was my responsibility. I know the
difference.
    -- Rose Walker, in SANDMAN #60: "The Kindly Ones:4"



From jack at oratrix.nl  Wed Nov 10 16:04:58 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Wed, 10 Nov 1999 16:04:58 +0100
Subject: [Python-Dev] I18N Toolkit 
In-Reply-To: Message by "Fredrik Lundh"  ,
	     Wed, 10 Nov 1999 11:52:28 +0100 , <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com> 
Message-ID: <19991110150458.B542735BB1E@snelboot.oratrix.nl>

> a slightly hairer design issue is what combinations
> of pattern and string the new 're' will handle.
> 
> the first two are obvious:
>  
>      ordinary pattern, ordinary string
>      unicode pattern, unicode string
>  
>  but what about these?
>  
>      ordinary pattern, unicode string
>      unicode pattern, ordinary string

I think the logical thing to do would be to "promote" the ordinary pattern or 
string to unicode, in a similar way to what happens if you combine ints and 
floats in a single expression.

The result may be a bit surprising if your pattern is in ascii and you've 
never been aware of unicode and are given such a string from somewhere else, 
but then if you're only aware of integer arithmetic and are suddenly presented 
with a couple of floats you'll also be pretty surprised at the result. At 
least it's easily explained.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From fdrake at acm.org  Wed Nov 10 16:22:17 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 10 Nov 1999 10:22:17 -0500 (EST)
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
Message-ID: <14377.36265.315127.788319@weyr.cnri.reston.va.us>

Fredrik Lundh writes:
 > having been there and done that, I strongly suggest
 > a third option: a 16-bit unsigned integer, in platform
 > specific byte order (PY_UNICODE_T).  along all other

  I actually like this best, but I understand that there are reasons
for using wchar_t, especially for interfacing with other code that
uses Unicode.
  Perhaps someone who knows more about the specific issues with
interfacing using wchar_t can summarize them, or point me to whatever
I've already missed.  p-)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From skip at mojam.com  Wed Nov 10 16:54:30 1999
From: skip at mojam.com (Skip Montanaro)
Date: Wed, 10 Nov 1999 09:54:30 -0600 (CST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
Message-ID: <14377.38198.793496.870273@dolphin.mojam.com>

Just a couple observations from the peanut gallery...

1. I'm glad I don't have to do this Unicode/UTF/internationalization stuff.
   Seems like it would be easier to just get the whole world speaking
   Esperanto.

2. Are there plans for an internationalization session at IPC8?  Perhaps a
   few key players could be locked into a room for a couple days, to emerge
   bloodied, but with an implementation in-hand...

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From fdrake at acm.org  Wed Nov 10 16:58:30 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 10 Nov 1999 10:58:30 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38295A08.D3928401@lemburg.com>
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
	<38295A08.D3928401@lemburg.com>
Message-ID: <14377.38438.615701.231437@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 >     def encode(self,u):
 > 	
 > 	""" Return the Unicode object u encoded as Python string.

  This should accept an optional slice parameter, and use it in the
same way as .dump().

 >     def dump(self,u,stream,slice=None):
...
 >     def load(self,stream,length=None):

  Why not have something like .wrapFile(f) that returns a file-like
object with all the file methods implemented, and doing to "right
thing" regarding encoding/decoding?  That way, the new file-like
object can be used directly with code that works with files and
doesn't care whether it uses 8-bit or unicode strings.

 > Codecs should raise an UnicodeError in case the conversion is
 > not possible.

  I think that should be ValueError, or UnicodeError should be a
subclass of ValueError.
  (Can the -X interpreter option be removed yet?)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From bwarsaw at cnri.reston.va.us  Wed Nov 10 17:41:29 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Wed, 10 Nov 1999 11:41:29 -0500 (EST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
	<14377.38198.793496.870273@dolphin.mojam.com>
Message-ID: <14377.41017.413515.887236@anthem.cnri.reston.va.us>

>>>>> "SM" == Skip Montanaro  writes:

    SM> 2. Are there plans for an internationalization session at
    SM> IPC8?  Perhaps a few key players could be locked into a room
    SM> for a couple days, to emerge bloodied, but with an
    SM> implementation in-hand...

I'm starting to think about devday topics.  Sounds like an I18n
session would be very useful.  Champions?

-Barry



From mal at lemburg.com  Wed Nov 10 14:31:47 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 14:31:47 +0100
Subject: [Python-Dev] Internationalization Toolkit
References:  <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com>
Message-ID: <382973C3.DCA77051@lemburg.com>

Fredrik Lundh wrote:
> 
> Greg Stein  wrote:
> > Have you ever noticed how Python modules, packages, tools, etc, never
> > define an import hook?
> 
> hey, didn't MAL use one in one of his mx kits? ;-)

Not yet, but I will unless my last patch ("walk me up, Scotty" - import)
goes into the core interpreter.
 
> > I say axe it and say "UTF-8" is the fixed, default encoding. If you want
> > something else, then do that explicitly.
> 
> exactly.
> 
> modes are evil.  python is not perl.  etc.

But a requirement by the customer... they want to be able to set the locale
on a per thread basis. Not exactly my preference (I think all locale
settings should be passed as parameters, not via globals).
 
> > Are we digging a hole for ourselves? Maybe. But there are two other big
> > platforms that have the same hole to dig out of *IF* it ever comes to
> > that. I posit that it won't be necessary; that the people needing UCS-4
> > can do so entirely in Python.
> 
> last time I checked, there were no characters (even in the
> ISO standard) outside the 16-bit range.  has that changed?

No, but people are already thinking about it and there is
a defined range in the >16-bit area for private encodings
(F0000..FFFFD and 100000..10FFFD).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mhammond at skippinet.com.au  Wed Nov 10 22:36:04 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 11 Nov 1999 08:36:04 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382973C3.DCA77051@lemburg.com>
Message-ID: <005701bf2bc3$980f4d60$0501a8c0@bobcat>

Marc writes:

> > modes are evil.  python is not perl.  etc.
>
> But a requirement by the customer... they want to be able to
> set the locale
> on a per thread basis. Not exactly my preference (I think all locale
> settings should be passed as parameters, not via globals).

Sure - that is what this customer wants, but we need to be clear about
the "best thing" for Python generally versus what this particular
client wants.

For example, if we went with UTF-8 as the only default encoding, then
HP may be forced to use a helper function to perform the conversion,
rather than the built-in functions.  This helper function can use TLS
(in Python) to store the encoding.  At least it is localized.

I agree that having a default encoding that can be changed is a bad
idea.  It may make 3 line scripts that need to print something easier
to work with, but at the cost of reliability in large systems.  Kinda
like the existing "locale" support, which is thread specific, and is
well known to cause these sorts of problems.  The end result is that
in your app, you find _someone_ has changed the default encoding, and
some code no longer works.  So the solution is to change the default
encoding back, so _your_ code works again.  You just know that whoever
it was that changed the default encoding in the first place is now
going to break - but what else can you do?

Having a fixed, default encoding may make life slightly more difficult
when you want to work primarily in a different encoding, but at least
your system is predictable and reliable.

Mark.

>
> > > Are we digging a hole for ourselves? Maybe. But there are
> two other big
> > > platforms that have the same hole to dig out of *IF* it
> ever comes to
> > > that. I posit that it won't be necessary; that the people
> needing UCS-4
> > > can do so entirely in Python.
> >
> > last time I checked, there were no characters (even in the
> > ISO standard) outside the 16-bit range.  has that changed?
>
> No, but people are already thinking about it and there is
> a defined range in the >16-bit area for private encodings
> (F0000..FFFFD and 100000..10FFFD).
>
> --
> Marc-Andre Lemburg
>
______________________________________________________________________
> Y2000:                                                    51 days
left
> Business:
http://www.lemburg.com/
> Python Pages:
http://www.lemburg.com/python/
>
>
> _______________________________________________
> Python-Dev maillist  -  Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
>




From gstein at lyra.org  Fri Nov 12 00:14:55 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 15:14:55 -0800 (PST)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat>
Message-ID: 

On Thu, 11 Nov 1999, Mark Hammond wrote:
> Marc writes:
> > > modes are evil.  python is not perl.  etc.
> >
> > But a requirement by the customer... they want to be able to
> > set the locale
> > on a per thread basis. Not exactly my preference (I think all locale
> > settings should be passed as parameters, not via globals).
> 
> Sure - that is what this customer wants, but we need to be clear about
> the "best thing" for Python generally versus what this particular
> client wants.

Ha! I was getting ready to say exactly the same thing. Are building Python
for a particular customer, or are we building it to Do The Right Thing?

I've been getting increasingly annoyed at "well, HP says this" or "HP
wants that." I'm ecstatic that they are a Consortium member and are
helping to fund the development of Python. However, if that means we are
selling Python's soul to corporate wishes rather than programming and
design ideals... well, it reduces my enthusiasm :-)

>...
> I agree that having a default encoding that can be changed is a bad
> idea.  It may make 3 line scripts that need to print something easier
> to work with, but at the cost of reliability in large systems.  Kinda
> like the existing "locale" support, which is thread specific, and is
> well known to cause these sorts of problems.  The end result is that
> in your app, you find _someone_ has changed the default encoding, and
> some code no longer works.  So the solution is to change the default
> encoding back, so _your_ code works again.  You just know that whoever
> it was that changed the default encoding in the first place is now
> going to break - but what else can you do?

Yes! Yes! Example #2.

My first example (import hooks) was shrugged off by some as "well, nobody
uses those." Okay, maybe people don't use them (but I believe that is
*because* of this kind of problem).

In Mark's example, however... this is a definite problem. I ran into this
when I was building some code for Microsoft Site Server. IIS was setting a
different locale on my thread -- one that I definitely was not expecting.
All of a sudden, strlwr() no longer worked as I expected -- certain
characters didn't get lower-cased, so my dictionary lookups failed because
the keys were not all lower-cased.

Solution? Before passing control from C++ into Python, I set the locale to
the default locale. Restored it on the way back out. Extreme measures, and
costly to do, but it had to be done.

I think I'll pick up Fredrik's phrase here...

(chanting) "Modes Are Evil!"  "Modes Are Evil!"  "Down with Modes!"

:-)

> Having a fixed, default encoding may make life slightly more difficult
> when you want to work primarily in a different encoding, but at least
> your system is predictable and reliable.

*bing*

I'm with Mark on this one. Global modes and state are a serious pain when
it comes to developing a system.

Python is very amenable to utility functions and classes. Any "customer"
can use a utility function to manually do the encoding according to a
per-thread setting stashed in some module-global dictionary (map thread-id
to default-encoding). Done. Keep it out of the interpreter...

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From da at ski.org  Thu Nov 11 00:21:54 1999
From: da at ski.org (David Ascher)
Date: Wed, 10 Nov 1999 15:21:54 -0800 (Pacific Standard Time)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: 
Message-ID: 

On Thu, 11 Nov 1999, Greg Stein wrote:

> Ha! I was getting ready to say exactly the same thing. Are building Python
> for a particular customer, or are we building it to Do The Right Thing?
> 
> I've been getting increasingly annoyed at "well, HP says this" or "HP
> wants that." I'm ecstatic that they are a Consortium member and are
> helping to fund the development of Python. However, if that means we are
> selling Python's soul to corporate wishes rather than programming and
> design ideals... well, it reduces my enthusiasm :-)

What about just explaining the rationale for the default-less point of
view to whoever is in charge of this at HP and see why they came up with
their rationale in the first place?  They might have a good reason, or
they might be willing to change said requirement.

--david




From gstein at lyra.org  Fri Nov 12 00:31:43 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 15:31:43 -0800 (PST)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: 
Message-ID: 

Damn, you're smooth... maybe you should have run for SF Mayor...

:-)

On Wed, 10 Nov 1999, David Ascher wrote:
> On Thu, 11 Nov 1999, Greg Stein wrote:
> 
> > Ha! I was getting ready to say exactly the same thing. Are building Python
> > for a particular customer, or are we building it to Do The Right Thing?
> > 
> > I've been getting increasingly annoyed at "well, HP says this" or "HP
> > wants that." I'm ecstatic that they are a Consortium member and are
> > helping to fund the development of Python. However, if that means we are
> > selling Python's soul to corporate wishes rather than programming and
> > design ideals... well, it reduces my enthusiasm :-)
> 
> What about just explaining the rationale for the default-less point of
> view to whoever is in charge of this at HP and see why they came up with
> their rationale in the first place?  They might have a good reason, or
> they might be willing to change said requirement.
> 
> --david
> 

--
Greg Stein, http://www.lyra.org/




From tim_one at email.msn.com  Thu Nov 11 07:25:27 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 01:25:27 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com>
Message-ID: <000201bf2c0d$8b866160$262d153f@tim>

[/F, dripping with code]
> ...
> Note that the 'u' must be followed by four hexadecimal digits.  If
> fewer digits are given, the sequence is left in the resulting string
> exactly as given.

Yuck -- don't let probable error pass without comment.  "must be" == "must
be"!

[moving backwards]
> \uxxxx -- Unicode character with hexadecimal value xxxx.  The
> character is stored using UTF-8 encoding, which means that this
> sequence can result in up to three encoded characters.

The code is fine, but I've gotten confused about what the intent is now.
Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8
literals, but now he's got Unicode-escaped literals instead -- and you favor
an internal 2-byte-per-char Unicode storage format.  In that combination of
worlds, is there any use in the *language* (as opposed to in a runtime
module) for \uxxxx -> UTF-8 conversion?

And MAL, if you're listening, I'm not clear on what a Unicode-escaped
literal means.  When you had UTF-8 literals, the meaning of something like

    u"a\340\341"

was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals
were just a way of specifying a byte stream.  As a Unicode-escaped string, I
assume the "a" maps to the Unicode "a", but what of the rest?  Are the octal
escapes to be taken as two separate Latin-1 characters (in their role as a
Unicode subset), or as an especially clumsy way to specify a single 16-bit
Unicode character?  I'm afraid I'd vote for the former.  Same issue wrt \x
escapes.

One other issue:  are there "raw" Unicode strings too, as in ur"\u20ac"?
There probably should be; and while Guido will hate this, a ur string should
probably *not* leave \uxxxx escapes untouched.  Nasties like this are why
Java defines \uxxxx expansion as occurring in a preprocessing step.

BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or
isn't \uxxxx allowed in a non-Unicode string?  that's what I would do ...).





From tim_one at email.msn.com  Thu Nov 11 07:49:16 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 01:49:16 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000501bf2c10$df4679e0$262d153f@tim>

[ Greg Stein]
> ...
> Things will be a lot faster if we have a fixed-size character. Variable
> length formats like UTF-8 are a lot harder to slice, search, etc.

The initial byte of any UTF-8 encoded character never appears in a
*non*-initial position of any UTF-8 encoded character.  Which means
searching is not only tractable in UTF-8, but also that whatever optimized
8-bit clean string searching routines you happen to have sitting around
today can be used as-is on UTF-8 encoded strings.  This is not true of UCS-2
encoded strings (in which "the first" byte is not distinguished, so 8-bit
search is vulnerable to finding a hit starting "in the middle" of a
character).  More, to the extent that the bulk of your text is plain ASCII,
the UTF-8 search will run much faster than when using a 2-byte encoding,
simply because it has half as many bytes to chew over.

UTF-8 is certainly slower for random-access indexing, including slicing.

I don't know what "etc" means, but if it follows the pattern so far,
sometimes it's faster and sometimes it's slower .

> (IMO) a big reason for this new type is for interaction with the
> underlying OS/platform. I don't know of any platforms right now that
> really use UTF-8 as their Unicode string representation (meaning we'd
> have to convert back/forth from our UTF-8 representation to talk to the
> OS).

No argument here.





From tim_one at email.msn.com  Thu Nov 11 07:56:35 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 01:56:35 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382968B7.ABFFD4C0@lemburg.com>
Message-ID: <000601bf2c11$e4b07920$262d153f@tim>

[MAL, on Unicode chr() and ord()
> ...
> Because unichr() will always have to return Unicode objects. You don't
> want chr(i) to return Unicode for i>255 and strings for i<256.

Indeed I do not!

> OTOH, ord() could probably be extended to also work on Unicode objects.

I think should be -- it's a good & natural use of polymorphism; introducing
a new function *here* would be as odd as introducing a unilen() function to
get the length of a Unicode string.





From tim_one at email.msn.com  Thu Nov 11 08:03:34 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 02:03:34 -0500
Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance
In-Reply-To: <14377.34704.639462.794509@amarok.cnri.reston.va.us>
Message-ID: <000701bf2c12$de8bca80$262d153f@tim>

[Andrew M. Kuchling]
> ...
> Friedl's book describes several optimizations which aren't implemented
> in PCRE.  The problem is that PCRE never builds a parse tree, and
> parse trees are easy to analyse recursively.  Instead, PCRE's
> functions actually look at the compiled byte codes (for example, look
> at find_firstchar or is_anchored in pypcre.c), but this makes analysis
> functions hard to write, and rearranging the code near-impossible.

This is wonderfully & ironically Pythonic.  That is, the Python compiler
itself goes straight to byte code, and the optimization that's done works at
the latter low level.  Luckily , very little optimization is
attempted, and what's there only replaces one bytecode with another of the
same length.  If it tried to do more, it would have to rearrange the code
...

the-more-things-differ-the-more-things-don't-ly y'rs  - tim





From tim_one at email.msn.com  Thu Nov 11 08:27:52 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 02:27:52 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382973C3.DCA77051@lemburg.com>
Message-ID: <000801bf2c16$43f9a4c0$262d153f@tim>

[/F]
> last time I checked, there were no characters (even in the
> ISO standard) outside the 16-bit range.  has that changed?

[MAL]
> No, but people are already thinking about it and there is
> a defined range in the >16-bit area for private encodings
> (F0000..FFFFD and 100000..10FFFD).

Over the decades I've developed a rule of thumb that has never wound up
stuck in my ass :  If I engineer code that I expect to be in use for N
years, I make damn sure that every internal limit is at least 10x larger
than the largest I can conceive of a user making reasonable use of at the
end of those N years.  The invariable result is that the N years pass, and
fewer than half of the users have bumped into the limit <0.5 wink>.

At the risk of offending everyone, I'll suggest that, qualitatively
speaking, Unicode is as Eurocentric as ASCII is Anglocentric.  We've just
replaced "256 characters?!  We'll *never* run out of those!" with 64K.  But
when Asian languages consume them 7K at a pop, 64K isn't even in my 10x
comfort range for some individual languages.  In just a few months, Unicode
3 will already have used up > 56K of the 64K slots.

As I understand it, UTF-16 "only" adds 1M new code points.  That's in my 10x
zone, for about a decade.

predicting-we'll-live-to-regret-it-either-way-ly y'rs  - tim





From captainrobbo at yahoo.com  Thu Nov 11 08:29:05 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:29:05 -0800 (PST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
Message-ID: <19991111072905.25203.rocketmail@web607.mail.yahoo.com>

> 2. Are there plans for an internationalization
> session at IPC8?  Perhaps a
>    few key players could be locked into a room for a
> couple days, to emerge
>    bloodied, but with an implementation in-hand...

Excellent idea.  

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From tim_one at email.msn.com  Thu Nov 11 08:29:50 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 02:29:50 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat>
Message-ID: <000901bf2c16$8a107420$262d153f@tim>

[Mark Hammond]
> Sure - that is what this customer wants, but we need to be clear about
> the "best thing" for Python generally versus what this particular
> client wants.
> ...
> Having a fixed, default encoding may make life slightly more difficult
> when you want to work primarily in a different encoding, but at least
> your system is predictable and reliable.

Well said, Mark!  Me too.  It's like HP is suffering from Windows envy
.





From captainrobbo at yahoo.com  Thu Nov 11 08:30:53 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:30:53 -0800 (PST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
Message-ID: <19991111073053.7884.rocketmail@web602.mail.yahoo.com>


--- "Barry A. Warsaw" 
wrote:
> 
> I'm starting to think about devday topics.  Sounds
> like an I18n
> session would be very useful.  Champions?
> 
I'm willing to explain what the fuss is about to
bemused onlookers and give some examples of problems
it should be able to solve - plenty of good slides and
screen shots.  I'll stay well away from the C
implementation issues.

Regards,

Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From captainrobbo at yahoo.com  Thu Nov 11 08:33:25 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:33:25 -0800 (PST)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
Message-ID: <19991111073325.8024.rocketmail@web602.mail.yahoo.com>

> 
> What about just explaining the rationale for the
> default-less point of
> view to whoever is in charge of this at HP and see
> why they came up with
> their rationale in the first place?  They might have
> a good reason, or
> they might be willing to change said requirement.
> 
> --david

For that matter (I came into this a bit late), is
there a statement somewhere of what HP actually want
to do?  

- Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From captainrobbo at yahoo.com  Thu Nov 11 08:44:50 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:44:50 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991111074450.20451.rocketmail@web606.mail.yahoo.com>

> I say axe it and say "UTF-8" is the fixed, default
> encoding. If you want
> something else, then do that explicitly.
> 
Let me tell you why you would want to have an encoding
which can be set:

(1) sday I am on a Japanese Windows box, I have a
string called 'address' and I do 'print address'.  If
I see utf8, I see garbage.  If I see Shift-JIS, I see
the correct Japanese address.  At this point in time,
utf8 is an interchange format but 99% of the world's
data is in various native encodings.  

Analogous problems occur on input.

(2) I'm using htmlgen, which 'prints' objects to
standard output.  My web site is supposed to be
encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
etc.)  Yes, browsers CAN detect and display UTF8 but
you just don't find UTF8 sites in the real world - and
most users just don't know about the encoding menu,
and will get pissed off if they have to reach for it.

Ditto for streaming output in some protocol.

Java solves this (and we could too by hacking stdout)
using Writer classes which are created as wrappers
around an output stream and can take an encoding, but
you lose the flexibility to 'just print'.  

I think being able to change encoding would be useful.
 What I do not want is to auto-detect it from the
operating system when Python boots - that would be a
portability nightmare. 

Regards,

Andy





=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From fredrik at pythonware.com  Thu Nov 11 09:06:04 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Thu, 11 Nov 1999 09:06:04 +0100
Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance
References: <000701bf2c12$de8bca80$262d153f@tim>
Message-ID: <009201bf2c1b$9a5c1b90$f29b12c2@secret.pythonware.com>

Tim Peters  wrote:
> > The problem is that PCRE never builds a parse tree, and
> > parse trees are easy to analyse recursively.  Instead, PCRE's
> > functions actually look at the compiled byte codes (for example, look
> > at find_firstchar or is_anchored in pypcre.c), but this makes analysis
> > functions hard to write, and rearranging the code near-impossible.
> 
> This is wonderfully & ironically Pythonic.  That is, the Python compiler
> itself goes straight to byte code, and the optimization that's done works at
> the latter low level.

yeah, but by some reason, people (including GvR) expect a
regular expression machinery to be more optimized than the
language interpreter ;-)






From tim_one at email.msn.com  Thu Nov 11 09:01:58 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 03:01:58 -0500
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: <19991111073325.8024.rocketmail@web602.mail.yahoo.com>
Message-ID: <000c01bf2c1b$0734c060$262d153f@tim>

[Andy Robinson]
> For that matter (I came into this a bit late), is
> there a statement somewhere of what HP actually want
> to do?

On this list, the best explanation we got was from Guido:  they want
"internationalization", and "Perl-compatible Unicode regexps".  I'm not sure
they even know the two aren't identical <0.9 wink>.

code-without-requirements-is-like-sex-without-consequences-ly y'rs  - tim





From guido at CNRI.Reston.VA.US  Thu Nov 11 13:03:51 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 11 Nov 1999 07:03:51 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: Your message of "Wed, 10 Nov 1999 23:44:50 PST."
             <19991111074450.20451.rocketmail@web606.mail.yahoo.com> 
References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> 
Message-ID: <199911111203.HAA24221@eric.cnri.reston.va.us>

> Let me tell you why you would want to have an encoding
> which can be set:
> 
> (1) sday I am on a Japanese Windows box, I have a
> string called 'address' and I do 'print address'.  If
> I see utf8, I see garbage.  If I see Shift-JIS, I see
> the correct Japanese address.  At this point in time,
> utf8 is an interchange format but 99% of the world's
> data is in various native encodings.  
> 
> Analogous problems occur on input.
> 
> (2) I'm using htmlgen, which 'prints' objects to
> standard output.  My web site is supposed to be
> encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
> etc.)  Yes, browsers CAN detect and display UTF8 but
> you just don't find UTF8 sites in the real world - and
> most users just don't know about the encoding menu,
> and will get pissed off if they have to reach for it.
> 
> Ditto for streaming output in some protocol.
> 
> Java solves this (and we could too by hacking stdout)
> using Writer classes which are created as wrappers
> around an output stream and can take an encoding, but
> you lose the flexibility to 'just print'.  
> 
> I think being able to change encoding would be useful.
>  What I do not want is to auto-detect it from the
> operating system when Python boots - that would be a
> portability nightmare. 

You almost convinced me there, but I think this can still be done
without changing the default encoding: simply reopen stdout with a
different encoding.  This is how Java does it.  I/O streams with an
encoding specified at open() are a very powerful feature.  You can
hide this in your $PYTHONSTARTUP.

Fran?ois Pinard might not like it though...

BTW, someone asked what HP asked for: I can't reveal what exactly they
asked for, basically because they don't seem to agree amongst
themselves.  The only firm statements I have is that they want i18n
and that they want it fast (before the end of the year).

The desire from Perl-compatible regexps comes from me, and the only
reason is compatibility with re.py.  (HP did ask for regexps, but they
don't know the difference between POSIX and Perl if it poked them in
the eye.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein at lyra.org  Thu Nov 11 13:20:39 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 04:20:39 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit (fwd)
Message-ID: 

Andy originally sent this just to me... I replied in kind, but saw that he
sent another copy to python-dev. Sending my reply there...

---------- Forwarded message ----------
Date: Thu, 11 Nov 1999 04:00:38 -0800 (PST)
From: Greg Stein 
To: andy at robanal.demon.co.uk
Subject: Re: [Python-Dev] Internationalization Toolkit

[ note: you sent direct to me; replying in kind in case that was your
  intent ]

On Wed, 10 Nov 1999, [iso-8859-1] Andy Robinson wrote:
>...
> Let me tell you why you would want to have an encoding
> which can be set:
>...snip: two examples of how "print" fails...

Neither of those examples are solid reasons for having a default encoding
that can be changed. Both can easily be altered at the Python level by
using an encoding function before printing.

You're asking for convenience, *not* providing a reason.

> Java solves this (and we could too) using Writer
> classes which are created as wrappers around an output
> stream and can take an encoding, but you lose the
> flexibility to just print.  

Not flexibility: convenience. You can certainly do:

  print encode(u,'Shift-JIS')

> I think being able to change encoding would be useful.
>  What I do not want is to auto-detect it from the
> operating system when Python boots - that would be a
> portability nightmare. 

Useful, but not a requirement.

Keep the interpreter simple, understandable, and predictable. A module
that changes the default over to 'utf-8' because it is interacting with a
network object is going to screw up your app if you're relying on an
encoding of 'shift-jis' to be present.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/





From captainrobbo at yahoo.com  Thu Nov 11 13:49:10 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Thu, 11 Nov 1999 04:49:10 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991111124910.6373.rocketmail@web603.mail.yahoo.com>

> You almost convinced me there, but I think this can
> still be done
> without changing the default encoding: simply reopen
> stdout with a
> different encoding.  This is how Java does it.  I/O
> streams with an
> encoding specified at open() are a very powerful
> feature.  You can
> hide this in your $PYTHONSTARTUP.

Good point, I'm happy with this.  Make sure we specify
it in the docs as the right way to do it.  In an IDE,
we'd have an Options screen somewhere for the output
encoding.

What the Java code I have seen does is to open a raw
file and construct wrappers (InputStreamReader,
OutputStreamWriter) around it to do an encoding
conversion.  This kind of obfuscates what is going on
- Python just needs the extra argument.  

- Andy








=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From mal at lemburg.com  Thu Nov 11 13:42:51 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 13:42:51 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
		<38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us>
Message-ID: <382AB9CB.634A9782@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  >     def encode(self,u):
>  >
>  >      """ Return the Unicode object u encoded as Python string.
> 
>   This should accept an optional slice parameter, and use it in the
> same way as .dump().

Ok.
 
>  >     def dump(self,u,stream,slice=None):
> ...
>  >     def load(self,stream,length=None):
> 
>   Why not have something like .wrapFile(f) that returns a file-like
> object with all the file methods implemented, and doing to "right
> thing" regarding encoding/decoding?  That way, the new file-like
> object can be used directly with code that works with files and
> doesn't care whether it uses 8-bit or unicode strings.

See File Output of the latest version:

File/Stream Output:
-------------------

Since file.write(object) and most other stream writers use the 's#'
argument parsing marker, the buffer interface implementation
determines the encoding to use (see Buffer Interface).

For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(,,) could be provided as
    short-hand for unicodec.file(open(,),) which
    also assures that  contains the 'b' character when needed.
 
>  > Codecs should raise an UnicodeError in case the conversion is
>  > not possible.
> 
>   I think that should be ValueError, or UnicodeError should be a
> subclass of ValueError.

Ok.

>   (Can the -X interpreter option be removed yet?)

Doesn't Python convert class exceptions to strings when -X is
used ? I would guess that many scripts already rely on the class
based mechanism (much of my stuff does for sure), so by the time
1.6 is out, I think -X should be considered an option to run
pre 1.5 code rather than using it for performance reasons.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 14:01:40 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 14:01:40 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <005701bf2bc3$980f4d60$0501a8c0@bobcat>
Message-ID: <382ABE34.5D27C701@lemburg.com>

Mark Hammond wrote:
> 
> Marc writes:
> 
> > > modes are evil.  python is not perl.  etc.
> >
> > But a requirement by the customer... they want to be able to
> > set the locale
> > on a per thread basis. Not exactly my preference (I think all locale
> > settings should be passed as parameters, not via globals).
> 
> Sure - that is what this customer wants, but we need to be clear about
> the "best thing" for Python generally versus what this particular
> client wants.
> 
> For example, if we went with UTF-8 as the only default encoding, then
> HP may be forced to use a helper function to perform the conversion,
> rather than the built-in functions.  This helper function can use TLS
> (in Python) to store the encoding.  At least it is localized.
> 
> I agree that having a default encoding that can be changed is a bad
> idea.  It may make 3 line scripts that need to print something easier
> to work with, but at the cost of reliability in large systems.  Kinda
> like the existing "locale" support, which is thread specific, and is
> well known to cause these sorts of problems.  The end result is that
> in your app, you find _someone_ has changed the default encoding, and
> some code no longer works.  So the solution is to change the default
> encoding back, so _your_ code works again.  You just know that whoever
> it was that changed the default encoding in the first place is now
> going to break - but what else can you do?
> 
> Having a fixed, default encoding may make life slightly more difficult
> when you want to work primarily in a different encoding, but at least
> your system is predictable and reliable.

I think the discussion on this is getting a little too hot. The point
is simply that the option of changing the per-thread default encoding
is there. You are not required to use it and if you do you are on
your own when something breaks.

Think of it as a HP specific feature... perhaps I should wrap the code
in #ifdefs and leave it undocumented.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Thu Nov 11 16:02:32 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 11 Nov 1999 10:02:32 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AB9CB.634A9782@lemburg.com>
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
	<38295A08.D3928401@lemburg.com>
	<14377.38438.615701.231437@weyr.cnri.reston.va.us>
	<382AB9CB.634A9782@lemburg.com>
Message-ID: <14378.55944.371933.613604@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > For explicit handling of Unicode using files, the unicodec module
 > could provide stream wrappers which provide transparent
 > encoding/decoding for any open stream (file-like object):

  Sounds good to me!  I guess I just missed, there's been so much
going on lately.

 > XXX unicodec.file(,,) could be provided as
 >     short-hand for unicodec.file(open(,),) which
 >     also assures that  contains the 'b' character when needed.

  Actually, I'd call it unicodec.open().

I asked:
 >   (Can the -X interpreter option be removed yet?)

You commented:
 > Doesn't Python convert class exceptions to strings when -X is
 > used ? I would guess that many scripts already rely on the class
 > based mechanism (much of my stuff does for sure), so by the time
 > 1.6 is out, I think -X should be considered an option to run
 > pre 1.5 code rather than using it for performance reasons.

  Gosh, I never thought of it as a performance issue!
  What I'd like to do is avoid code like this:

	try:
            class UnicodeError(ValueError):
                # well, something would probably go here...
                pass
        except TypeError:
            class UnicodeError:
                # something slightly different for this one...
                pass

  Trying to use class exceptions can be really tedious, and often I'd
like to pick up the stuff from Exception.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From mal at lemburg.com  Thu Nov 11 15:21:50 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:21:50 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000201bf2c0d$8b866160$262d153f@tim>
Message-ID: <382AD0FE.B604876A@lemburg.com>

Tim Peters wrote:
> 
> [/F, dripping with code]
> > ...
> > Note that the 'u' must be followed by four hexadecimal digits.  If
> > fewer digits are given, the sequence is left in the resulting string
> > exactly as given.
> 
> Yuck -- don't let probable error pass without comment.  "must be" == "must
> be"!

I second that.
 
> [moving backwards]
> > \uxxxx -- Unicode character with hexadecimal value xxxx.  The
> > character is stored using UTF-8 encoding, which means that this
> > sequence can result in up to three encoded characters.
> 
> The code is fine, but I've gotten confused about what the intent is now.
> Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8
> literals, but now he's got Unicode-escaped literals instead -- and you favor
> an internal 2-byte-per-char Unicode storage format.  In that combination of
> worlds, is there any use in the *language* (as opposed to in a runtime
> module) for \uxxxx -> UTF-8 conversion?

No, no...  :-) 

I think it was a simple misunderstanding... \uXXXX is only to be
used within u'' strings and then gets expanded to *one* character
encoded in the internal Python format (which is heading towards UTF-16
without surrogates).
 
> And MAL, if you're listening, I'm not clear on what a Unicode-escaped
> literal means.  When you had UTF-8 literals, the meaning of something like
> 
>     u"a\340\341"
> 
> was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals
> were just a way of specifying a byte stream.  As a Unicode-escaped string, I
> assume the "a" maps to the Unicode "a", but what of the rest?  Are the octal
> escapes to be taken as two separate Latin-1 characters (in their role as a
> Unicode subset), or as an especially clumsy way to specify a single 16-bit
> Unicode character?  I'm afraid I'd vote for the former.  Same issue wrt \x
> escapes.

Good points.

The conversion goes as follows:
? for single characters (and this includes all \XXX sequences except \uXXXX),
  take the ordinal and interpret it as Unicode ordinal
? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX 
  instead
 
> One other issue:  are there "raw" Unicode strings too, as in ur"\u20ac"?
> There probably should be; and while Guido will hate this, a ur string should
> probably *not* leave \uxxxx escapes untouched.  Nasties like this are why
> Java defines \uxxxx expansion as occurring in a preprocessing step.

Not sure whether we really need to make this even more complicated...
The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or filenames
won't hurt much in the context of those \uXXXX monsters :-)

> BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or
> isn't \uxxxx allowed in a non-Unicode string?  that's what I would do ...).

Right. \uXXXX will only be allowed in u'' strings, not in "normal"
strings.

BTW, if you want to type in UTF-8 strings and have them converted
to Unicode, you can use the standard:

u = unicode('...string with UTF-8 encoded characters...','utf-8')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 15:23:45 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:23:45 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000601bf2c11$e4b07920$262d153f@tim>
Message-ID: <382AD171.D22A1D6E@lemburg.com>

Tim Peters wrote:
> 
> [MAL, on Unicode chr() and ord()
> > ...
> > Because unichr() will always have to return Unicode objects. You don't
> > want chr(i) to return Unicode for i>255 and strings for i<256.
> 
> Indeed I do not!
> 
> > OTOH, ord() could probably be extended to also work on Unicode objects.
> 
> I think should be -- it's a good & natural use of polymorphism; introducing
> a new function *here* would be as odd as introducing a unilen() function to
> get the length of a Unicode string.

Fine. So I'll drop the uniord() API and extend ord() instead.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 15:36:41 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:36:41 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000901bf2c16$8a107420$262d153f@tim>
Message-ID: <382AD479.5261B43B@lemburg.com>

Tim Peters wrote:
> 
> [Mark Hammond]
> > Sure - that is what this customer wants, but we need to be clear about
> > the "best thing" for Python generally versus what this particular
> > client wants.
> > ...
> > Having a fixed, default encoding may make life slightly more difficult
> > when you want to work primarily in a different encoding, but at least
> > your system is predictable and reliable.
> 
> Well said, Mark!  Me too.  It's like HP is suffering from Windows envy
> .

See my other post on the subject...

Note that if we make UTF-8 the standard encoding, nearly all 
special Latin-1 characters will produce UTF-8 errors on input
and unreadable garbage on output. That will probably be unacceptable
in Europe. To remedy this, one would *always* have to use
u.encode('latin-1') to get readable output for Latin-1 strings
repesented in Unicode.

I'd rather see this happen the other way around: *always* explicitly
state the encoding you want in case you rely on it, e.g. write

file.write(u.encode('utf-8'))

instead of

file.write(u) # let's hope this goes out as UTF-8...

Using the  as site dependent setting is useful
for convenience in those cases where the output format should be
readable rather than parseable.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 15:26:59 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:26:59 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000801bf2c16$43f9a4c0$262d153f@tim>
Message-ID: <382AD233.BE6DE888@lemburg.com>

Tim Peters wrote:
> 
> [/F]
> > last time I checked, there were no characters (even in the
> > ISO standard) outside the 16-bit range.  has that changed?
> 
> [MAL]
> > No, but people are already thinking about it and there is
> > a defined range in the >16-bit area for private encodings
> > (F0000..FFFFD and 100000..10FFFD).
> 
> Over the decades I've developed a rule of thumb that has never wound up
> stuck in my ass :  If I engineer code that I expect to be in use for N
> years, I make damn sure that every internal limit is at least 10x larger
> than the largest I can conceive of a user making reasonable use of at the
> end of those N years.  The invariable result is that the N years pass, and
> fewer than half of the users have bumped into the limit <0.5 wink>.
> 
> At the risk of offending everyone, I'll suggest that, qualitatively
> speaking, Unicode is as Eurocentric as ASCII is Anglocentric.  We've just
> replaced "256 characters?!  We'll *never* run out of those!" with 64K.  But
> when Asian languages consume them 7K at a pop, 64K isn't even in my 10x
> comfort range for some individual languages.  In just a few months, Unicode
> 3 will already have used up > 56K of the 64K slots.
> 
> As I understand it, UTF-16 "only" adds 1M new code points.  That's in my 10x
> zone, for about a decade.

If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
signal failure of this assertion at Unicode object construction time
via an exception. That way we are within the standard, can use
reasonably fast code for Unicode manipulation and add those extra 1M
character at a later stage.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 15:47:49 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:47:49 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> <199911111203.HAA24221@eric.cnri.reston.va.us>
Message-ID: <382AD715.66DBA125@lemburg.com>

Guido van Rossum wrote:
> 
> > Let me tell you why you would want to have an encoding
> > which can be set:
> >
> > (1) sday I am on a Japanese Windows box, I have a
> > string called 'address' and I do 'print address'.  If
> > I see utf8, I see garbage.  If I see Shift-JIS, I see
> > the correct Japanese address.  At this point in time,
> > utf8 is an interchange format but 99% of the world's
> > data is in various native encodings.
> >
> > Analogous problems occur on input.
> >
> > (2) I'm using htmlgen, which 'prints' objects to
> > standard output.  My web site is supposed to be
> > encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
> > etc.)  Yes, browsers CAN detect and display UTF8 but
> > you just don't find UTF8 sites in the real world - and
> > most users just don't know about the encoding menu,
> > and will get pissed off if they have to reach for it.
> >
> > Ditto for streaming output in some protocol.
> >
> > Java solves this (and we could too by hacking stdout)
> > using Writer classes which are created as wrappers
> > around an output stream and can take an encoding, but
> > you lose the flexibility to 'just print'.
> >
> > I think being able to change encoding would be useful.
> >  What I do not want is to auto-detect it from the
> > operating system when Python boots - that would be a
> > portability nightmare.
> 
> You almost convinced me there, but I think this can still be done
> without changing the default encoding: simply reopen stdout with a
> different encoding.  This is how Java does it.  I/O streams with an
> encoding specified at open() are a very powerful feature.  You can
> hide this in your $PYTHONSTARTUP.

True and it probably covers all cases where setting the
default encoding to something other than UTF-8 makes sense.

I guess you've convinced me there ;-)

The current proposal has wrappers around stream for this purpose:

For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(,,) could be provided as
    short-hand for unicodec.file(open(,),) which
    also assures that  contains the 'b' character when needed.

The above can be done using:

import sys,unicodec
sys.stdin = unicodec.stream(sys.stdin,'jis')
sys.stdout = unicodec.stream(sys.stdout,'jis')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From jack at oratrix.nl  Thu Nov 11 16:58:39 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Thu, 11 Nov 1999 16:58:39 +0100
Subject: [Python-Dev] Internationalization Toolkit 
In-Reply-To: Message by "M.-A. Lemburg"  ,
	     Thu, 11 Nov 1999 15:23:45 +0100 , <382AD171.D22A1D6E@lemburg.com> 
Message-ID: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl>

> > [MAL, on Unicode chr() and ord()
> > > ...
> > > Because unichr() will always have to return Unicode objects. You don't
> > > want chr(i) to return Unicode for i>255 and strings for i<256.

> > > OTOH, ord() could probably be extended to also work on Unicode objects.

> Fine. So I'll drop the uniord() API and extend ord() instead.

Hmm, then wouldn't it be more logical to drop unichr() too, but add an 
optional parameter to chr() to specify what sort of a string you want? The 
type-object of a unicode string comes to mind...
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From bwarsaw at cnri.reston.va.us  Thu Nov 11 17:04:29 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Thu, 11 Nov 1999 11:04:29 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
	<38295A08.D3928401@lemburg.com>
	<14377.38438.615701.231437@weyr.cnri.reston.va.us>
	<382AB9CB.634A9782@lemburg.com>
Message-ID: <14378.59661.376434.449820@anthem.cnri.reston.va.us>

>>>>> "M" == M   writes:

    M> Doesn't Python convert class exceptions to strings when -X is
    M> used ? I would guess that many scripts already rely on the
    M> class based mechanism (much of my stuff does for sure), so by
    M> the time 1.6 is out, I think -X should be considered an option
    M> to run pre 1.5 code rather than using it for performance
    M> reasons.

This is a little off-topic so I'll be brief.  When using -X Python
never even creates the class exceptions, so it isn't really a
conversion.  It just uses string exceptions and tries to craft tuples
for what would be the superclasses in the class-based exception
hierarchy.  Yes, class-based exceptions are a bit of a performance hit
when you are catching exceptions in Python (because they need to be
instantiated), but they're just so darn *useful*.  I wouldn't mind
seeing the -X option go away for 1.6.

-Barry



From captainrobbo at yahoo.com  Thu Nov 11 17:08:15 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Thu, 11 Nov 1999 08:08:15 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991111160815.5235.rocketmail@web608.mail.yahoo.com>

> See my other post on the subject...
> 
> Note that if we make UTF-8 the standard encoding,
> nearly all 
> special Latin-1 characters will produce UTF-8 errors
> on input
> and unreadable garbage on output. That will probably
> be unacceptable
> in Europe. To remedy this, one would *always* have
> to use
> u.encode('latin-1') to get readable output for
> Latin-1 strings
> repesented in Unicode.

You beat me to it - a colleague and I were just
discussing this verbally.  Specifically we Brits will
get annoyed as soon as we read in a text file with
pound (sterling) signs.

We concluded that the only reasonable default (if you
have one at all) is pure ASCII.  At least that way I
will get a clear and intelligible warning when I load
in such a file, and will remember to specify
ISO-Latin-1.  

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From mal at lemburg.com  Thu Nov 11 16:59:21 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 16:59:21 +0100
Subject: [Python-Dev] Unicode proposal: %-formatting ?
Message-ID: <382AE7D9.147D58CB@lemburg.com>

I wonder how we could add %-formatting to Unicode strings without
duplicating the PyString_Format() logic.

First, do we need Unicode object %-formatting at all ?

Second, here is an emulation using strings and 
that should give an idea of one could work with the different
encodings:

    s = '%s %i abc???' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string via Unicode
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

Note that .encode() defaults to the current setting of
.

Provided u maps to Latin-1, an alternative would be:

    u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 18:04:37 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 18:04:37 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl>
Message-ID: <382AF725.FC66C9B6@lemburg.com>

Jack Jansen wrote:
> 
> > > [MAL, on Unicode chr() and ord()
> > > > ...
> > > > Because unichr() will always have to return Unicode objects. You don't
> > > > want chr(i) to return Unicode for i>255 and strings for i<256.
> 
> > > > OTOH, ord() could probably be extended to also work on Unicode objects.
> 
> > Fine. So I'll drop the uniord() API and extend ord() instead.
> 
> Hmm, then wouldn't it be more logical to drop unichr() too, but add an
> optional parameter to chr() to specify what sort of a string you want? The
> type-object of a unicode string comes to mind...

Like:

import types
uc = chr(12,types.UnicodeType)

... looks overly complicated, IMHO.

uc = unichr(12)

and

u = unicode('abc')

look pretty intuitive to me.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 16:59:21 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 16:59:21 +0100
Subject: [Python-Dev] Unicode proposal: %-formatting ?
Message-ID: <382AE7D9.147D58CB@lemburg.com>

I wonder how we could add %-formatting to Unicode strings without
duplicating the PyString_Format() logic.

First, do we need Unicode object %-formatting at all ?

Second, here is an emulation using strings and 
that should give an idea of one could work with the different
encodings:

    s = '%s %i abc???' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string via Unicode
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

Note that .encode() defaults to the current setting of
.

Provided u maps to Latin-1, an alternative would be:

    u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 18:31:34 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 18:31:34 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991111160815.5235.rocketmail@web608.mail.yahoo.com>
Message-ID: <382AFD76.A0D3FEC4@lemburg.com>

Andy Robinson wrote:
> 
> > See my other post on the subject...
> >
> > Note that if we make UTF-8 the standard encoding,
> > nearly all
> > special Latin-1 characters will produce UTF-8 errors
> > on input
> > and unreadable garbage on output. That will probably
> > be unacceptable
> > in Europe. To remedy this, one would *always* have
> > to use
> > u.encode('latin-1') to get readable output for
> > Latin-1 strings
> > repesented in Unicode.
> 
> You beat me to it - a colleague and I were just
> discussing this verbally.  Specifically we Brits will
> get annoyed as soon as we read in a text file with
> pound (sterling) signs.
> 
> We concluded that the only reasonable default (if you
> have one at all) is pure ASCII.  At least that way I
> will get a clear and intelligible warning when I load
> in such a file, and will remember to specify
> ISO-Latin-1.

Well, Guido's post made me rethink the approach...

1. Setting  to any non UTF encoding
   will result in data lossage due to the encoding limits
   imposed by the other formats -- this is dangerous and
   will result in errors (some of which may not even be
   noticed due to the interpreter ignoring them) in case
   your strings use non encodable characters.

2. You basically only want to set  to
   anything other than UTF-8 for stream input and output.
   This can be done using the unicodec stream wrapper without
   too much inconvenience. (We'll have to extend the wrapper a little,
   though, because it currently only accept Unicode objects for
   writing and always return Unicode object when reading.)

3. We should leave the issue open until some code is there
   to be tested... I have a feeling that there will be quite
   a few strange effects when APIs expecting strings are fed
   with Unicode objects returning UTF-8.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mhammond at skippinet.com.au  Fri Nov 12 02:10:09 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 12 Nov 1999 12:10:09 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382ABE34.5D27C701@lemburg.com>
Message-ID: <007a01bf2caa$aabdef60$0501a8c0@bobcat>

> Mark Hammond wrote:
> > Having a fixed, default encoding may make life slightly
> more difficult
> > when you want to work primarily in a different encoding,
> but at least
> > your system is predictable and reliable.
>
> I think the discussion on this is getting a little too hot.

Really - I see it as moving to a rational consensus that doesnt
support the proposal in this regard.  I see no heat in it at all.  Im
sorry if you saw my post or any of the followups as "emotional", but I
certainly not getting passionate about this.  I dont see any of this
as affecting me personally.  I believe that I can replace my Unicode
implementation with this either way we go.  Just because a we are
trying to get it right doesnt mean we are getting heated.

> The point
> is simply that the option of changing the per-thread default
encoding
> is there. You are not required to use it and if you do you are on
> your own when something breaks.

Hrm - Im having serious trouble following your logic here.  If make
_any_ assumptions about a default encoding, I am in danger of
breaking.  I may not choose to change the default, but as soon as
_anyone_ does, unrelated code may break.

I agree that I will be "on my own", but I wont necessarily have been
the one that changed it :-(

The only answer I can see is, as you suggest, to ignore the fact that
there is _any_ default.  Always specify the encoding.  But obviously
this is not good enough for HP:

> Think of it as a HP specific feature... perhaps I should wrap the
code
> in #ifdefs and leave it undocumented.

That would work - just ensure that no standard Python has those
#ifdefs turned on :-)  I would be sorely dissapointed if the fact that
HP are throwing money for this means they get every whim implemented
in the core language.  Imagine the outcry if it were instead MS'
money, and you were attempting to put an MS spin on all this.

Are you writing a module for HP, or writing a module for Python that
HP are assisting by providing some funding?  Clear difference.  IMO,
it must also be seen that there is a clear difference.

Maybe Im missing something.  Can you explain why it is good enough
everyone else to be required to assume there is no default encoding,
but HP get their thread specific global?  Are their requirements
greater than anyone elses?  Is everyone else not as important?  What
would you, as a consultant, recommend to people who arent HP, but have
a similar requirement?  It would seem obvious to me that HPs
requirement can be met in "pure Python", thereby keeping this out of
the core all together...

Mark.




From gmcm at hypernet.com  Fri Nov 12 03:01:23 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Thu, 11 Nov 1999 21:01:23 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat>
References: <382ABE34.5D27C701@lemburg.com>
Message-ID: <1269750417-7621469@hypernet.com>

[per-thread defaults]

C'mon guys, hasn't anyone ever played consultant before? The 
idea is obviously brain-dead. OTOH, they asked for it 
specifically, meaning they have some assumptions about how 
they think they're going to use it. If you give them what they 
ask for, you'll only have to fix it when they realize there are 
other ways of doing things that don't work with per-thread 
defaults. So, you find out why they think it's a good thing; you 
make it easy for them to code this way (without actually using 
per-thread defaults) and you don't make a fuss about it. More 
than likely, they won't either.

"requirements"-are-only-useful-as-clues-to-the-objectives-
behind-them-ly y'rs



- Gordon



From tim_one at email.msn.com  Fri Nov 12 06:04:44 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 00:04:44 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AB9CB.634A9782@lemburg.com>
Message-ID: <000a01bf2ccb$6f59c2c0$fd2d153f@tim>

[MAL]
>>> Codecs should raise an UnicodeError in case the conversion is
>>> not possible.

[Fred L. Drake, Jr.]
>>   I think that should be ValueError, or UnicodeError should be a
>> subclass of ValueError.
>>   (Can the -X interpreter option be removed yet?)

[MAL]
> Doesn't Python convert class exceptions to strings when -X is
> used ? I would guess that many scripts already rely on the class
> based mechanism (much of my stuff does for sure), so by the time
> 1.6 is out, I think -X should be considered an option to run
> pre 1.5 code rather than using it for performance reasons.

-X is a red herring.  That is, do what seems best without regard for -X.  I
already added one subclass exception to the CVS tree (UnboundLocalError as a
subclass of NameError), and in doing that had to figure out how to make it
do the right thing under -X too.  It's a bit clumsy to arrange, but not a
problem.





From tim_one at email.msn.com  Fri Nov 12 06:18:09 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 00:18:09 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <382AD0FE.B604876A@lemburg.com>
Message-ID: <000e01bf2ccd$4f4b0e60$fd2d153f@tim>

[MAL]
> ...
> The conversion goes as follows:
> ? for single characters (and this includes all \XXX sequences
>   except \uXXXX), take the ordinal and interpret it as Unicode
>   ordinal for \uXXXX sequences, insert the Unicode character
>   with ordinal 0xXXXX instead

Perfect!

[about "raw" Unicode strings]
> ...
> Not sure whether we really need to make this even more complicated...
> The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or
> filenames won't hurt much in the context of those \uXXXX monsters :-)

Alas, this won't stand over the long term.  Eventually people will write
Python using nothing but Unicode strings -- "regular strings" will
eventurally become a backward compatibility headache <0.7 wink>.  IOW,
Unicode regexps and Unicode docstrings and Unicode formatting ops ...
nothing will escape.  Nor should it.

I don't think it all needs to be done at once, though -- existing languages
usually take years to graft in gimmicks to cover all the fine points.  So,
happy to let raw Unicode strings pass for now, as a relatively minor point,
but without agreeing it can be ignored forever.

> ...
> BTW, if you want to type in UTF-8 strings and have them converted
> to Unicode, you can use the standard:
>
> u = unicode('...string with UTF-8 encoded characters...','utf-8')

That's what I figured, and thanks for the confirmation.





From tim_one at email.msn.com  Fri Nov 12 06:42:32 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 00:42:32 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AD233.BE6DE888@lemburg.com>
Message-ID: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>

[MAL]
> If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
> signal failure of this assertion at Unicode object construction time
> via an exception. That way we are within the standard, can use
> reasonably fast code for Unicode manipulation and add those extra 1M
> character at a later stage.

I think this is reasonable.

Using UTF-8 internally is also reasonable, and if it's being rejected on the
grounds of supposed slowness, that deserves a closer look (it's an ingenious
encoding scheme that works correctly with a surprising number of existing
8-bit string routines as-is).  Indexing UTF-8 strings is greatly speeded by
adding a simple finger (i.e., store along with the string an index+offset
pair identifying the most recent position indexed to -- since string
indexing is overwhelmingly sequential, this makes most indexing
constant-time; and UTF-8 can be scanned either forward or backward from a
random internal point because "the first byte" of each encoding is
recognizable as such).

I expect either would work well.  It's at least curious that Perl and Tcl
both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
people here saying UCS-2 is the obviously better choice are all from the
Microsoft camp .  It's not obvious to me, but then neither do I claim
that UTF-8 is obviously better.





From tim_one at email.msn.com  Fri Nov 12 07:02:01 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 01:02:01 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AD479.5261B43B@lemburg.com>
Message-ID: <001001bf2cd3$6fa57820$fd2d153f@tim>

[MAL]
> Note that if we make UTF-8 the standard encoding, nearly all
> special Latin-1 characters will produce UTF-8 errors on input
> and unreadable garbage on output. That will probably be unacceptable
> in Europe. To remedy this, one would *always* have to use
> u.encode('latin-1') to get readable output for Latin-1 strings
> repesented in Unicode.

I think it's time for the Europeans to pronounce on what's acceptable in
Europe.  To the limited extent that I can pretend I'm Eurpoean, I'm happy
with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea.

> I'd rather see this happen the other way around: *always* explicitly
> state the encoding you want in case you rely on it, e.g. write
>
> file.write(u.encode('utf-8'))
>
> instead of
>
> file.write(u) # let's hope this goes out as UTF-8...

By the same argument, those pesky Europeans who are relying on Latin-1
should write

file.write(u.encode('latin-1'))

instead of

file.write(u)  # let's hope this goes out as Latin-1

> Using the  as site dependent setting is useful
> for convenience in those cases where the output format should be
> readable rather than parseable.

Well, "convenience" is always the argument advanced in favor of modes.
Conflicts and nasty intermittent bugs are always the result.  The latter
will happen under Guido's idea too, as various careless modules rebind stdin
& stdout to their own ideas of what "the proper" encoding should be.  But at
least the blame doesn't fall on the core language then <0.3 wink>.

Since there doesn't appear to be anything (either or good or bad) you can do
(or avoid) by using Guido's scheme instead of magical core thread state,
there's no *need* for the latter.  That is, it can be done with a user-level
API without involving the core.





From tim_one at email.msn.com  Fri Nov 12 07:17:08 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 01:17:08 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat>
Message-ID: <001501bf2cd5$8c380140$fd2d153f@tim>

[Mark Hammond]
> ...
> Are you writing a module for HP, or writing a module for Python that
> HP are assisting by providing some funding?  Clear difference.  IMO,
> it must also be seen that there is a clear difference.

I can resolve this easily, but only with input from Guido.  Guido, did HP's
check clear yet?  If so, we can ignore them .





From captainrobbo at yahoo.com  Fri Nov 12 09:15:19 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Fri, 12 Nov 1999 00:15:19 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991112081519.20636.rocketmail@web603.mail.yahoo.com>

--- Gordon McMillan  wrote:
> [per-thread defaults]
> 
> C'mon guys, hasn't anyone ever played consultant
> before? The 
> idea is obviously brain-dead. OTOH, they asked for
> it 
> specifically, meaning they have some assumptions
> about how 
> they think they're going to use it. If you give them
> what they 
> ask for, you'll only have to fix it when they
> realize there are 
> other ways of doing things that don't work with
> per-thread 
> defaults. So, you find out why they think it's a
> good thing; you 
> make it easy for them to code this way (without
> actually using 
> per-thread defaults) and you don't make a fuss about
> it. More 
> than likely, they won't either.
> 

I wrote directly to ask them exactly this last night. 
Let's forget the per-thread thing until we get an
answer.

- Andy




=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From mal at lemburg.com  Fri Nov 12 10:27:29 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:27:29 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000e01bf2ccd$4f4b0e60$fd2d153f@tim>
Message-ID: <382BDD81.458D3125@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > ...
> > The conversion goes as follows:
> > ? for single characters (and this includes all \XXX sequences
> >   except \uXXXX), take the ordinal and interpret it as Unicode
> >   ordinal for \uXXXX sequences, insert the Unicode character
> >   with ordinal 0xXXXX instead
> 
> Perfect!

Thanks :-)
 
> [about "raw" Unicode strings]
> > ...
> > Not sure whether we really need to make this even more complicated...
> > The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or
> > filenames won't hurt much in the context of those \uXXXX monsters :-)
> 
> Alas, this won't stand over the long term.  Eventually people will write
> Python using nothing but Unicode strings -- "regular strings" will
> eventurally become a backward compatibility headache <0.7 wink>.  IOW,
> Unicode regexps and Unicode docstrings and Unicode formatting ops ...
> nothing will escape.  Nor should it.
> 
> I don't think it all needs to be done at once, though -- existing languages
> usually take years to graft in gimmicks to cover all the fine points.  So,
> happy to let raw Unicode strings pass for now, as a relatively minor point,
> but without agreeing it can be ignored forever.

Agreed... note that you could also write your own codec for just this
reason and then use:

u = unicode('....\u1234...\...\...','raw-unicode-escaped')

Put that into a function called 'ur' and you have:

u = ur('...\u4545...\...\...')

which is not that far away from ur'...' w/r to cosmetics.

> > ...
> > BTW, if you want to type in UTF-8 strings and have them converted
> > to Unicode, you can use the standard:
> >
> > u = unicode('...string with UTF-8 encoded characters...','utf-8')
> 
> That's what I figured, and thanks for the confirmation.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 10:00:47 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:00:47 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991112081519.20636.rocketmail@web603.mail.yahoo.com>
Message-ID: <382BD73E.E6729C79@lemburg.com>

Andy Robinson wrote:
> 
> --- Gordon McMillan  wrote:
> > [per-thread defaults]
> >
> > C'mon guys, hasn't anyone ever played consultant
> > before? The
> > idea is obviously brain-dead. OTOH, they asked for
> > it
> > specifically, meaning they have some assumptions
> > about how
> > they think they're going to use it. If you give them
> > what they
> > ask for, you'll only have to fix it when they
> > realize there are
> > other ways of doing things that don't work with
> > per-thread
> > defaults. So, you find out why they think it's a
> > good thing; you
> > make it easy for them to code this way (without
> > actually using
> > per-thread defaults) and you don't make a fuss about
> > it. More
> > than likely, they won't either.
> >
> 
> I wrote directly to ask them exactly this last night.
> Let's forget the per-thread thing until we get an
> answer.

That's the way to go, Andy.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 10:44:14 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:44:14 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <007a01bf2caa$aabdef60$0501a8c0@bobcat>
Message-ID: <382BE16E.D17C80E1@lemburg.com>

Mark Hammond wrote:
> 
> > Mark Hammond wrote:
> > > Having a fixed, default encoding may make life slightly
> > more difficult
> > > when you want to work primarily in a different encoding,
> > but at least
> > > your system is predictable and reliable.
> >
> > I think the discussion on this is getting a little too hot.
> 
> Really - I see it as moving to a rational consensus that doesnt
> support the proposal in this regard.  I see no heat in it at all.  Im
> sorry if you saw my post or any of the followups as "emotional", but I
> certainly not getting passionate about this.  I dont see any of this
> as affecting me personally.  I believe that I can replace my Unicode
> implementation with this either way we go.  Just because a we are
> trying to get it right doesnt mean we are getting heated.

Naa... with "heated" I meant the "HP wants this, HP wants that" side
of things. We'll just have to wait for their answer on this one.

> > The point
> > is simply that the option of changing the per-thread default
> encoding
> > is there. You are not required to use it and if you do you are on
> > your own when something breaks.
> 
> Hrm - Im having serious trouble following your logic here.  If make
> _any_ assumptions about a default encoding, I am in danger of
> breaking.  I may not choose to change the default, but as soon as
> _anyone_ does, unrelated code may break.
> 
> I agree that I will be "on my own", but I wont necessarily have been
> the one that changed it :-(

Sure there are some very subtile dangers in setting the default
to anything other than the default ;-) For some this risk may
be worthwhile taking, for others not. In fact, in large projects
I would never take such a risk... I'm sure we can get this 
message across to them.
 
> The only answer I can see is, as you suggest, to ignore the fact that
> there is _any_ default.  Always specify the encoding.  But obviously
> this is not good enough for HP:
> 
> > Think of it as a HP specific feature... perhaps I should wrap the
> code
> > in #ifdefs and leave it undocumented.
> 
> That would work - just ensure that no standard Python has those
> #ifdefs turned on :-)  I would be sorely dissapointed if the fact that
> HP are throwing money for this means they get every whim implemented
> in the core language.  Imagine the outcry if it were instead MS'
> money, and you were attempting to put an MS spin on all this.
> 
> Are you writing a module for HP, or writing a module for Python that
> HP are assisting by providing some funding?  Clear difference.  IMO,
> it must also be seen that there is a clear difference.
> 
> Maybe Im missing something.  Can you explain why it is good enough
> everyone else to be required to assume there is no default encoding,
> but HP get their thread specific global?  Are their requirements
> greater than anyone elses?  Is everyone else not as important?  What
> would you, as a consultant, recommend to people who arent HP, but have
> a similar requirement?  It would seem obvious to me that HPs
> requirement can be met in "pure Python", thereby keeping this out of
> the core all together...

Again, all I can try is convince them of not really needing
settable default encodings.


Since this is the first time a Python Consortium member is
pushing development, I think we can learn a lot here. For one,
it should be clear that money doesn't buy everything, OTOH,
we cannot put the whole thing at risk just because
of some minor disagreement that cannot be solved between the
parties. The standard solution for the latter should be a
customized Python interpreter.


-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 10:04:31 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:04:31 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <001001bf2cd3$6fa57820$fd2d153f@tim>
Message-ID: <382BD81F.B2BC896A@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > Note that if we make UTF-8 the standard encoding, nearly all
> > special Latin-1 characters will produce UTF-8 errors on input
> > and unreadable garbage on output. That will probably be unacceptable
> > in Europe. To remedy this, one would *always* have to use
> > u.encode('latin-1') to get readable output for Latin-1 strings
> > repesented in Unicode.
> 
> I think it's time for the Europeans to pronounce on what's acceptable in
> Europe.  To the limited extent that I can pretend I'm Eurpoean, I'm happy
> with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea.

Agreed.
 
> > I'd rather see this happen the other way around: *always* explicitly
> > state the encoding you want in case you rely on it, e.g. write
> >
> > file.write(u.encode('utf-8'))
> >
> > instead of
> >
> > file.write(u) # let's hope this goes out as UTF-8...
> 
> By the same argument, those pesky Europeans who are relying on Latin-1
> should write
> 
> file.write(u.encode('latin-1'))
> 
> instead of
> 
> file.write(u)  # let's hope this goes out as Latin-1

Right.
 
> > Using the  as site dependent setting is useful
> > for convenience in those cases where the output format should be
> > readable rather than parseable.
> 
> Well, "convenience" is always the argument advanced in favor of modes.
> Conflicts and nasty intermittent bugs are always the result.  The latter
> will happen under Guido's idea too, as various careless modules rebind stdin
> & stdout to their own ideas of what "the proper" encoding should be.  But at
> least the blame doesn't fall on the core language then <0.3 wink>.
> 
> Since there doesn't appear to be anything (either or good or bad) you can do
> (or avoid) by using Guido's scheme instead of magical core thread state,
> there's no *need* for the latter.  That is, it can be done with a user-level
> API without involving the core.

Dito :-)

I have nothing against telling people to take care about the problem
in user space (meaning: not done by the core interpreter) and I'm
pretty sure that HP will agree on this too, provided we give them
the proper user space tools like file wrappers et al.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 10:16:57 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:16:57 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
Message-ID: <382BDB09.55583F28@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
> > signal failure of this assertion at Unicode object construction time
> > via an exception. That way we are within the standard, can use
> > reasonably fast code for Unicode manipulation and add those extra 1M
> > character at a later stage.
> 
> I think this is reasonable.
> 
> Using UTF-8 internally is also reasonable, and if it's being rejected on the
> grounds of supposed slowness, that deserves a closer look (it's an ingenious
> encoding scheme that works correctly with a surprising number of existing
> 8-bit string routines as-is).  Indexing UTF-8 strings is greatly speeded by
> adding a simple finger (i.e., store along with the string an index+offset
> pair identifying the most recent position indexed to -- since string
> indexing is overwhelmingly sequential, this makes most indexing
> constant-time; and UTF-8 can be scanned either forward or backward from a
> random internal point because "the first byte" of each encoding is
> recognizable as such).

Here are some arguments for using the proposed UTF-16 strategy instead:

? all characters have the same length; indexing is fast
? conversion APIs to platform dependent wchar_t implementation are fast
  because they either can simply copy the content or extend the 2-bytes
  to 4 byte
? UTF-8 needs 2 bytes for all the compound Latin-1 characters (e.g. u
  with two dots) which are used in many non-English languages
? from the Unicode Consortium FAQ: "Most Unicode APIs are using UTF-16."

Besides, the Unicode object will have a buffer containing the
 representation of the object, which, if all goes
well, will always hold the UTF-8 value. RE engines etc. can then directly
work with this buffer.
 
> I expect either would work well.  It's at least curious that Perl and Tcl
> both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
> people here saying UCS-2 is the obviously better choice are all from the
> Microsoft camp .  It's not obvious to me, but then neither do I claim
> that UTF-8 is obviously better.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From gstein at lyra.org  Fri Nov 12 11:20:16 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:20:16 -0800 (PST)
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit)
In-Reply-To: <382BE16E.D17C80E1@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> 
> Since this is the first time a Python Consortium member is
> pushing development, I think we can learn a lot here. For one,
> it should be clear that money doesn't buy everything, OTOH,
> we cannot put the whole thing at risk just because
> of some minor disagreement that cannot be solved between the
> parties. The standard solution for the latter should be a
> customized Python interpreter.
> 

hehe... funny you mention this. Go read the Consortium docs. Last time
that I read them, there are no "parties" to reach consensus. *Every*
technical decision regarding the Python language falls to the Technical
Director (Guido, of course). I looked. I found nothing that can override
the T.D.'s decisions and no way to force a particular decision.

Guido is still the Benevolent Dictator :-)

Cheers,
-g

p.s. yes, there is always the caveat that "sure, Guido has final say" but
"Al can fire him at will for being too stubborn" :-) ... but hey, Guido's
title does have the word Benevolent in it, so things are cool...

--
Greg Stein, http://www.lyra.org/





From gstein at lyra.org  Fri Nov 12 11:24:56 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:24:56 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382BE16E.D17C80E1@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> Sure there are some very subtile dangers in setting the default
> to anything other than the default ;-) For some this risk may
> be worthwhile taking, for others not. In fact, in large projects
> I would never take such a risk... I'm sure we can get this 
> message across to them.

It's a lot easier to just never provide the rope (per-thread default
encodings) in the first place.

If the feature exists, then it will be used. Period. Try to get the
message across until you're blue in the face, but it would be used.

Anyhow... discussion is pretty moot until somebody can state that it
is/isn't a "real requirement" and/or until The Guido takes a position.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Fri Nov 12 11:30:04 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:30:04 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
Message-ID: 

On Fri, 12 Nov 1999, Tim Peters wrote:
>...
> Using UTF-8 internally is also reasonable, and if it's being rejected on the
> grounds of supposed slowness

No... my main point was interaction with the underlying OS. I made a SWAG
(Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower
for various types of operations. As always, your infernal meddling has
dashed that hypothesis, so I must retreat...

>...
> I expect either would work well.  It's at least curious that Perl and Tcl
> both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
> people here saying UCS-2 is the obviously better choice are all from the
> Microsoft camp .  It's not obvious to me, but then neither do I claim
> that UTF-8 is obviously better.

Probably for the exact reason that you stated in your messages: many 8-bit
(7-bit?) functions continue to work quite well when given a UTF-8-encoded
string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter
to deal with a new string type.

I'd guess it is a helluva lot easier for us to add a Python Type than for
Perl or TCL to whack around with new string types (since they use strings
so heavily).

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Fri Nov 12 11:30:28 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 11:30:28 +0100
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization 
 Toolkit)
References: 
Message-ID: <382BEC44.A2541C7E@lemburg.com>

Greg Stein wrote:
> 
> On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> > 
> > Since this is the first time a Python Consortium member is
> > pushing development, I think we can learn a lot here. For one,
> > it should be clear that money doesn't buy everything, OTOH,
> > we cannot put the whole thing at risk just because
> > of some minor disagreement that cannot be solved between the
> > parties. The standard solution for the latter should be a
> > customized Python interpreter.
> > 
> 
> hehe... funny you mention this. Go read the Consortium docs. Last time
> that I read them, there are no "parties" to reach consensus. *Every*
> technical decision regarding the Python language falls to the Technical
> Director (Guido, of course). I looked. I found nothing that can override
> the T.D.'s decisions and no way to force a particular decision.
> 
> Guido is still the Benevolent Dictator :-)

Sure, but have you considered the option of a member simply bailing
out ? HP could always stop funding Unicode integration. That wouldn't
help us either...
 
> Cheers,
> -g
> 
> p.s. yes, there is always the caveat that "sure, Guido has final say" but
> "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's
> title does have the word Benevolent in it, so things are cool...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From gstein at lyra.org  Fri Nov 12 11:39:45 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:39:45 -0800 (PST)
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization
  Toolkit)
In-Reply-To: <382BEC44.A2541C7E@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
>...
> Sure, but have you considered the option of a member simply bailing
> out ? HP could always stop funding Unicode integration. That wouldn't
> help us either...

I'm not that dumb... come on. That was my whole point about "Benevolent"
below... Guido is a fair and reasonable Dictator... he wouldn't let that
happen.

>...
> > p.s. yes, there is always the caveat that "sure, Guido has final say" but
> > "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's
> > title does have the word Benevolent in it, so things are cool...


Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From Mike.Da.Silva at uk.fid-intl.com  Fri Nov 12 12:00:49 1999
From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike)
Date: Fri, 12 Nov 1999 11:00:49 -0000
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: 

Most of the ASCII string functions do indeed work for UTF-8.  I have made
extensive use of this feature when writing translation logic to harmonize
ASCII text (an SQL statement) with substitution parameters that must be
converted from IBM EBCDIC code pages (5035, 1027) into UTF8.  Since UTF-8 is
a superset of ASCII, this all works fine.

Some of the character classification functions etc can be flaky when used
with UTF8 characters outside the ASCII range, but simple string operations
work fine.

As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an
internal string representation are:

1.	UTF-8 allows all characters to be displayed (in some form or other)
on the users machine, with or without native fonts installed.  Naturally
anything outside the ASCII range will be garbage, but it is an immense
debugging aid when working with character encodings to be able to touch and
feel something recognizable.  Trying to decode a block of raw UTF-16 is a
pain.
2.	UTF-8 works with most existing string manipulation libraries quite
happily.  It is also portable (a char is always 8 bits, regardless of
platform; wchar_t varies between 16 and 32 bits depending on the underlying
operating system (although unsigned short does seems to work across
platforms, in my experience).
3.	UTF-16 has some advantages in providing fixed width characters and,
(ignoring surrogate pairs etc) a modeless encoding space.  This is an
advantage for fast string operations, especially on CPU's that have
efficient operations for handling 16bit data.
4.	UTF-16 would directly support a tightly coupled character properties
engine, which would enable Unicode compliant case folding and character
decomposition to be performed without an intermediate UTF-8 <----> UTF-16
translation step.
5.	UTF-16 requires string operations that do not make assumptions about
nulls - this means re-implementing most of the C runtime functions to work
with unsigned shorts.

Regards,
Mike da Silva

	-----Original Message-----
	From:	Greg Stein [SMTP:gstein at lyra.org]
	Sent:	12 November 1999 10:30
	To:	Tim Peters
	Cc:	python-dev at python.org
	Subject:	RE: [Python-Dev] Internationalization Toolkit

	On Fri, 12 Nov 1999, Tim Peters wrote:
	>...
	> Using UTF-8 internally is also reasonable, and if it's being
rejected on the
	> grounds of supposed slowness

	No... my main point was interaction with the underlying OS. I made a
SWAG
	(Scientific Wild Ass Guess :-) and stated that UTF-8 is probably
slower
	for various types of operations. As always, your infernal meddling
has
	dashed that hypothesis, so I must retreat...

	>...
	> I expect either would work well.  It's at least curious that Perl
and Tcl
	> both went with UTF-8 -- does anyone think they know *why*?  I
don't.  The
	> people here saying UCS-2 is the obviously better choice are all
from the
	> Microsoft camp .  It's not obvious to me, but then neither
do I claim
	> that UTF-8 is obviously better.

	Probably for the exact reason that you stated in your messages: many
8-bit
	(7-bit?) functions continue to work quite well when given a
UTF-8-encoded
	string. i.e. they didn't have to rewrite the entire Perl/TCL
interpreter
	to deal with a new string type.

	I'd guess it is a helluva lot easier for us to add a Python Type
than for
	Perl or TCL to whack around with new string types (since they use
strings
	so heavily).

	Cheers,
	-g

	--
	Greg Stein, http://www.lyra.org/


	_______________________________________________
	Python-Dev maillist  -  Python-Dev at python.org
	http://www.python.org/mailman/listinfo/python-dev



From fredrik at pythonware.com  Fri Nov 12 12:23:24 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 12:23:24 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com>
Message-ID: <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>

> Besides, the Unicode object will have a buffer containing the
>  representation of the object, which, if all goes
> well, will always hold the UTF-8 value.



over my dead body, that one...

(fwiw, over the last 20 years, I've implemented about a
dozen image processing libraries, supporting loads of
pixel layouts and file formats.  one important lesson
from that is to stick to a single internal representation,
and let the application programmers build their own
layers if they need to speed things up -- yes, they're
actually happier that way.  and text strings are not
that different from pixel buffers or sound streams or
scientific data sets, after all...)

(and sticks and modes will break your bones, but you
know that...)

> RE engines etc. can then directly work with this buffer.

sidebar: the RE engine that's being developed for this
project can handle 8-bit, 16-bit, and (optionally) 32-bit
text buffers. a single compiled expression can be used
with any character size, and performance is about the
same for all sizes (at least on any decent cpu).

> > I expect either would work well.  It's at least curious that Perl and Tcl
> > both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
> > people here saying UCS-2 is the obviously better choice are all from the
> > Microsoft camp .

(hey, I'm not a microsofter.  but I've been writing "i/o
libraries" for various "object types" all my life, so I do
have strong preferences on what works, and what
doesn't...  I use Python for good reasons, you know ;-)



thanks.  I feel better now.






From fredrik at pythonware.com  Fri Nov 12 12:23:38 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 12:23:38 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <027f01bf2d00$648745e0$f29b12c2@secret.pythonware.com>

> 5. UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.

footnote: the mad scientist has been there
and done that:

http://www.pythonware.com/madscientist/

(and you can replace "unsigned short" with
"whatever's suitable on this platform")






From fredrik at pythonware.com  Fri Nov 12 12:36:03 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 12:36:03 +0100
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit)
References: 
Message-ID: <02a701bf2d02$20c66280$f29b12c2@secret.pythonware.com>

> Guido is a fair and reasonable Dictator... he wouldn't let that
> happen.

...but where is he when we need him? ;-)






From Mike.Da.Silva at uk.fid-intl.com  Fri Nov 12 12:43:21 1999
From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike)
Date: Fri, 12 Nov 1999 11:43:21 -0000
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: 

Fredrik Lundh wrote:

> 5. UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.

footnote: the mad scientist has been there and done that:
http://www.pythonware.com/madscientist/
 
(and you can replace "unsigned short" with "whatever's suitable on this
platform")

Surely using a different type on different platforms means that we throw
away the concept of a platform independent Unicode string?
I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.
Does this mean that to transfer a file between a Windows box and Solaris, an
implicit conversion has to be done to go from 16 bits to 32 bits (and vice
versa)?  What about byte ordering issues?
Or do you mean whatever 16 bit data type is available on the platform, with
a standard (platform independent) byte ordering maintained?
Mike da S



From fredrik at pythonware.com  Fri Nov 12 13:16:24 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 13:16:24 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>

Mike wrote:
> Surely using a different type on different platforms means that we throw
> away the concept of a platform independent Unicode string?
> I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.

so?  the interchange format doesn't have to be
the same as the internal format, does it?

> Does this mean that to transfer a file between a Windows box and Solaris, an
> implicit conversion has to be done to go from 16 bits to 32 bits (and vice
> versa)?  What about byte ordering issues?

no problem at all: unicode has special byte order
marks for this purpose (and utf-8 doesn't care, of
course).

> Or do you mean whatever 16 bit data type is available on the platform, with
> a standard (platform independent) byte ordering maintained?

well, my preference is a 16-bit data type in the plat-
form's native byte order (exactly how it's done in the
unicode module -- for the moment, it can use the
platform's wchar_t, but only if it happens to be a
16-bit unsigned type).  gives you good performance,
compact storage, and cleanest possible code.

...

anyway, I think it would help the discussion a little bit
if people looked at (and played with) the existing code
base.  at least that'll change arguments like "but then
we have to implement that" to "but then we have to
maintain that code" ;-)






From captainrobbo at yahoo.com  Fri Nov 12 13:13:03 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Fri, 12 Nov 1999 04:13:03 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991112121303.27452.rocketmail@ web605.yahoomail.com>

--- "Da Silva, Mike" 
wrote:
> As I see it, the relative pros and cons of UTF-8
> versus UTF-16 for use as an
> internal string representation are:
> [snip]
> Regards,
> Mike da Silva
> 

Note that by going with UTF16, we get both.  We will
certainly have a codec for utf8, just as we will for
ISO-Latin-1, Shift-JIS or whatever.  And a perfectly
ordinary Python string is a great place to hold UTF8;
you can look at it and use most of the ordinary string
algorithms on it.  

I presume no one is actually advocating dropping
ordinary Python strings, or the ability to do
   rawdata = open('myfile.txt', 'rb').read()
without any transformations?


- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From mhammond at skippinet.com.au  Fri Nov 12 13:27:19 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 12 Nov 1999 23:27:19 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
Message-ID: <007e01bf2d09$44738440$0501a8c0@bobcat>

/F writes
> anyway, I think it would help the discussion a little bit
> if people looked at (and played with) the existing code
> base.  at least that'll change arguments like "but then
> we have to implement that" to "but then we have to
> maintain that code" ;-)

I second that.  It is good enough for me (although my requirements
arent stringent) - its been used on CE, so would slot directly into
the win32 stuff.  It is pretty much the consensus of the string-sig of
last year, but as code!

The only "problem" with it is the code that hasnt been written yet,
specifically:
* Encoders as streams, and a concrete proposal for them.
* Decent PyArg_ParseTuple support and Py_BuildValue support.
* The ord(), chr() stuff, and other stuff around the edges no doubt.

Couldnt we start with Fredriks implementation, and see how the rest
turns out?  Even if we do choose to change the underlying Unicode
implementation to use a different native encoding, the interface to
the PyUnicode_Type would remain pretty similar.  The advantage is that
we have something now to start working with for the rest of the
support we need.

Mark.




From mal at lemburg.com  Fri Nov 12 13:38:44 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 13:38:44 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.4
Message-ID: <382C0A54.E6E8328D@lemburg.com>

I've uploaded a new version of the proposal which incorporates
a lot of what has been discussed on the list.

Thanks to everybody who helped so far. Note that I have extended
the list of references for those who want to join in, but are
in need of more background information.

The latest version of the proposal is available at:

	http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

	http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ? support for line breaks (see
      http://www.unicode.org/unicode/reports/tr13/ )

    ? support for case conversion: 

      Problems: string lengths can change due to multiple
      characters being mapped to a single new one, capital letters
      starting a word can be different than ones occurring in the
      middle, there are locale dependent deviations from the standard
      mappings.

    ? support for numbers, digits, whitespace, etc.

    ? support (or no support) for private code point areas

    ? should Unicode objects support %-formatting ?

    One possibility would be to emulate this via strings and 
    :

    s = '%s %i abc???' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

    ? specifying file wrappers:

    Open issues: what to do with Python strings
    fed to the .write() method (may need to know the encoding of the
    strings) and when/if to return Python strings through the .read()
    method.

    Perhaps we need more than one type of wrapper here.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 14:11:26 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 14:11:26 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
Message-ID: <382C11FE.D7D9F916@lemburg.com>

Fredrik Lundh wrote:
> 
> > Besides, the Unicode object will have a buffer containing the
> >  representation of the object, which, if all goes
> > well, will always hold the UTF-8 value.
> 
> 
> 
> over my dead body, that one...

Such a buffer is needed to implement "s" and "s#" argument
parsing. It's a simple requirement to support those two
parsing markers -- there's not much to argue about, really...
unless, of course, you want to give up Unicode object support
for all APIs using these parsers.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Fri Nov 12 14:01:28 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 14:01:28 +0100
Subject: [Python-Dev] Internationalization Toolkit
References:  <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
Message-ID: <382C0FA8.ACB6CCD6@lemburg.com>

Fredrik Lundh wrote:
> 
> Mike wrote:
> > Surely using a different type on different platforms means that we throw
> > away the concept of a platform independent Unicode string?
> > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.
> 
> so?  the interchange format doesn't have to be
> the same as the internal format, does it?

The interchange format (marshal + pickle) is defined as UTF-8,
so there's no problem with endianness or missing bits w/r to
shipping Unicode data from one platform to another.
 
> > Does this mean that to transfer a file between a Windows box and Solaris, an
> > implicit conversion has to be done to go from 16 bits to 32 bits (and vice
> > versa)?  What about byte ordering issues?
> 
> no problem at all: unicode has special byte order
> marks for this purpose (and utf-8 doesn't care, of
> course).

Access to this mark will go into sys: sys.bom.
 
> > Or do you mean whatever 16 bit data type is available on the platform, with
> > a standard (platform independent) byte ordering maintained?
> 
> well, my preference is a 16-bit data type in the plat-
> form's native byte order (exactly how it's done in the
> unicode module -- for the moment, it can use the
> platform's wchar_t, but only if it happens to be a
> 16-bit unsigned type).  gives you good performance,
> compact storage, and cleanest possible code.

The 0.4 proposal fixes this to 16-bit unsigned short
using UTF-16 encoding with checks for surrogates. This covers
all defined standard Unicode character points, is fast, etc. pp...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 12:15:15 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 12:15:15 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <382BF6C3.D79840EC@lemburg.com>

"Da Silva, Mike" wrote:
> 
> Most of the ASCII string functions do indeed work for UTF-8.  I have made
> extensive use of this feature when writing translation logic to harmonize
> ASCII text (an SQL statement) with substitution parameters that must be
> converted from IBM EBCDIC code pages (5035, 1027) into UTF8.  Since UTF-8 is
> a superset of ASCII, this all works fine.
> 
> Some of the character classification functions etc can be flaky when used
> with UTF8 characters outside the ASCII range, but simple string operations
> work fine.

That's why there's the  buffer which holds the UTF-8
encoded value...
 
> As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an
> internal string representation are:
> 
> 1.      UTF-8 allows all characters to be displayed (in some form or other)
> on the users machine, with or without native fonts installed.  Naturally
> anything outside the ASCII range will be garbage, but it is an immense
> debugging aid when working with character encodings to be able to touch and
> feel something recognizable.  Trying to decode a block of raw UTF-16 is a
> pain.

True.

> 2.      UTF-8 works with most existing string manipulation libraries quite
> happily.  It is also portable (a char is always 8 bits, regardless of
> platform; wchar_t varies between 16 and 32 bits depending on the underlying
> operating system (although unsigned short does seems to work across
> platforms, in my experience).

You mean with the compiler applying the needed 16->32 bit extension ?

> 3.      UTF-16 has some advantages in providing fixed width characters and,
> (ignoring surrogate pairs etc) a modeless encoding space.  This is an
> advantage for fast string operations, especially on CPU's that have
> efficient operations for handling 16bit data.

Right and this is major argument for using 16 bit encodings without
state internally.

> 4.      UTF-16 would directly support a tightly coupled character properties
> engine, which would enable Unicode compliant case folding and character
> decomposition to be performed without an intermediate UTF-8 <----> UTF-16
> translation step.

Could you elaborate on this one ? It is one of the open issues
in the proposal.

> 5.      UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.

AFAIK, the RE engines in Python are 8-bit clean...

BTW, wouldn't it be possible to take pcre and have it
use Py_Unicode instead of char ? [Of course, there would have to
be some extensions for character classes etc.]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fredrik at pythonware.com  Fri Nov 12 14:43:12 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 14:43:12 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com>
Message-ID: <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com>

> > > Besides, the Unicode object will have a buffer containing the
> > >  representation of the object, which, if all goes
> > > well, will always hold the UTF-8 value.
> > 
> > 
> > 
> > over my dead body, that one...
> 
> Such a buffer is needed to implement "s" and "s#" argument
> parsing. It's a simple requirement to support those two
> parsing markers -- there's not much to argue about, really...

why?  I don't understand why "s" and "s#" has
to deal with encoding issues at all...

> unless, of course, you want to give up Unicode object support
> for all APIs using these parsers.

hmm.  maybe that's exactly what I want...






From fdrake at acm.org  Fri Nov 12 15:34:56 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 09:34:56 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C11FE.D7D9F916@lemburg.com>
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
	<382BDB09.55583F28@lemburg.com>
	<027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
	<382C11FE.D7D9F916@lemburg.com>
Message-ID: <14380.9616.245419.138261@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Such a buffer is needed to implement "s" and "s#" argument
 > parsing. It's a simple requirement to support those two
 > parsing markers -- there's not much to argue about, really...
 > unless, of course, you want to give up Unicode object support
 > for all APIs using these parsers.

  Perhaps I missed the agreement that these should always receive
UTF-8 from Unicode strings.  Was this agreed upon, or has it simply
not been argued over in favor of other topics?
  If this has indeed been agreed upon... at least it can be computed
on demand rather than at initialization!  Perhaps there should be two
pointers: one to the UTF-8 buffer and one to a PyObject; if the
PyObject is there it's a "old-style" string that's actually providing
the buffer.  This may or may not be a good idea; there's a lot of
memory expense for long Unicode strings converted from UTF-8 that
aren't ever converted back to UTF-8 or accessed using "s" or "s#".
Ok, I've talked myself out of that.  ;-)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fdrake at acm.org  Fri Nov 12 15:57:15 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 09:57:15 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C0FA8.ACB6CCD6@lemburg.com>
References: 
	<02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
	<382C0FA8.ACB6CCD6@lemburg.com>
Message-ID: <14380.10955.420102.327867@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Access to this mark will go into sys: sys.bom.

  Can the name in sys be a little more descriptive?
sys.byte_order_mark would be reasonable.
  I think that a support module (possibly unicodec) should provide
constants for all four byte order marks as strings (2- & 4-byte,
little- and big-endian).  Names could be short BOM_2_LE, BOM_4_LE,
etc.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fredrik at pythonware.com  Fri Nov 12 16:00:45 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 16:00:45 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim><382BDB09.55583F28@lemburg.com><027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com><382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us>
Message-ID: <009101bf2d1f$21f5b490$f29b12c2@secret.pythonware.com>

Fred L. Drake, Jr.  wrote:
> M.-A. Lemburg writes:
>  > Such a buffer is needed to implement "s" and "s#" argument
>  > parsing. It's a simple requirement to support those two
>  > parsing markers -- there's not much to argue about, really...
>  > unless, of course, you want to give up Unicode object support
>  > for all APIs using these parsers.
>
>   Perhaps I missed the agreement that these should always receive
> UTF-8 from Unicode strings.

from unicode import *

def getname():
    # hidden in some database engine, or so...
    return unicode("Link?ping", "iso-8859-1")

...

name = getname()

# emulate automatic conversion to utf-8
name = str(name)

# print it in uppercase, in the usual way
import string
print string.upper(name)

## LINK??PING

I don't know, but I think that I think that it
perhaps should raise an exception instead...






From mal at lemburg.com  Fri Nov 12 16:17:43 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 16:17:43 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com>
Message-ID: <382C2F97.8E7D7A4D@lemburg.com>

Fredrik Lundh wrote:
> 
> > > > Besides, the Unicode object will have a buffer containing the
> > > >  representation of the object, which, if all goes
> > > > well, will always hold the UTF-8 value.
> > >
> > > 
> > >
> > > over my dead body, that one...
> >
> > Such a buffer is needed to implement "s" and "s#" argument
> > parsing. It's a simple requirement to support those two
> > parsing markers -- there's not much to argue about, really...
> 
> why?  I don't understand why "s" and "s#" has
> to deal with encoding issues at all...
> 
> > unless, of course, you want to give up Unicode object support
> > for all APIs using these parsers.
> 
> hmm.  maybe that's exactly what I want...

If we don't add that support, lot's of existing APIs won't
accept Unicode object instead of strings. While it could be
argued that automatic conversion to UTF-8 is not transparent
enough for the user, the other solution of using str(u)
everywhere would probably make writing Unicode-aware code a
rather clumsy task and introduce other pitfalls, since str(obj)
calls PyObject_Str() which also works on integers, floats,
etc.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 16:50:33 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 16:50:33 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
		<02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
		<382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us>
Message-ID: <382C3749.198EEBC6@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > Access to this mark will go into sys: sys.bom.
> 
>   Can the name in sys be a little more descriptive?
> sys.byte_order_mark would be reasonable.

The abbreviation BOM is quite common w/r to Unicode.

>   I think that a support module (possibly unicodec) should provide
> constants for all four byte order marks as strings (2- & 4-byte,
> little- and big-endian).  Names could be short BOM_2_LE, BOM_4_LE,
> etc.

Good idea...

sys.bom should return the byte order mark (BOM) for the format used
internally. The unicodec module should provide symbols for all
possible values of this variable:

  BOM_BE: '\376\377' 
    (corresponds to Unicode 0x0000FEFF in UTF-16 
     == ZERO WIDTH NO-BREAK SPACE)

  BOM_LE: '\377\376' 
    (corresponds to Unicode 0x0000FFFE in UTF-16 
     == illegal Unicode character)

  BOM4_BE: '\000\000\377\376'
    (corresponds to Unicode 0x0000FEFF in UCS-4)

  BOM4_LE: '\376\377\000\000'
    (corresponds to Unicode 0x0000FFFE in UCS-4)

Note that Unicode sees big endian byte order as being "correct". The
swapped order is taken to be an indicator for a "wrong" format, hence
the illegal character definition.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Fri Nov 12 16:24:33 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 16:24:33 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
		<382BDB09.55583F28@lemburg.com>
		<027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
		<382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us>
Message-ID: <382C3131.A8965CA5@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > Such a buffer is needed to implement "s" and "s#" argument
>  > parsing. It's a simple requirement to support those two
>  > parsing markers -- there's not much to argue about, really...
>  > unless, of course, you want to give up Unicode object support
>  > for all APIs using these parsers.
> 
>   Perhaps I missed the agreement that these should always receive
> UTF-8 from Unicode strings.  Was this agreed upon, or has it simply
> not been argued over in favor of other topics?

It's been in the proposal since version 0.1. The idea is to
provide a decent way of making existing script Unicode aware.

>   If this has indeed been agreed upon... at least it can be computed
> on demand rather than at initialization!

This is what I intended to implement. The  buffer
will be filled upon the first request to the UTF-8 encoding.
"s" and "s#" are examples of such requests. The buffer will
remain intact until the object is destroyed (since other code
could store the pointer received via e.g. "s").

> Perhaps there should be two
> pointers: one to the UTF-8 buffer and one to a PyObject; if the
> PyObject is there it's a "old-style" string that's actually providing
> the buffer.  This may or may not be a good idea; there's a lot of
> memory expense for long Unicode strings converted from UTF-8 that
> aren't ever converted back to UTF-8 or accessed using "s" or "s#".
> Ok, I've talked myself out of that.  ;-)

Note that Unicode object are completely different beast ;-)
String object are not touched in any way by the proposal.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Fri Nov 12 17:22:24 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 11:22:24 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C3749.198EEBC6@lemburg.com>
References: 
	<02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
	<382C0FA8.ACB6CCD6@lemburg.com>
	<14380.10955.420102.327867@weyr.cnri.reston.va.us>
	<382C3749.198EEBC6@lemburg.com>
Message-ID: <14380.16064.723277.586881@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > The abbreviation BOM is quite common w/r to Unicode.

  Yes: "w/r to Unicode".  In sys, it's out of context and should
receive a more descriptive name.  I think using BOM in unicodec is
good.

 >   BOM_BE: '\376\377' 
 >     (corresponds to Unicode 0x0000FEFF in UTF-16 
 >      == ZERO WIDTH NO-BREAK SPACE)

  I'd also add BOM to be the same as sys.byte_order_mark.  Perhaps
even instead of sys.byte_order_mark (just to localize the areas of
code that are affected).

 > Note that Unicode sees big endian byte order as being "correct". The

  A lot of us do.  ;-)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fdrake at acm.org  Fri Nov 12 17:28:37 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 11:28:37 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C3131.A8965CA5@lemburg.com>
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
	<382BDB09.55583F28@lemburg.com>
	<027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
	<382C11FE.D7D9F916@lemburg.com>
	<14380.9616.245419.138261@weyr.cnri.reston.va.us>
	<382C3131.A8965CA5@lemburg.com>
Message-ID: <14380.16437.71847.832880@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > It's been in the proposal since version 0.1. The idea is to
 > provide a decent way of making existing script Unicode aware.

  Ok, so I haven't read closely enough.

 > This is what I intended to implement. The  buffer
 > will be filled upon the first request to the UTF-8 encoding.
 > "s" and "s#" are examples of such requests. The buffer will
 > remain intact until the object is destroyed (since other code
 > could store the pointer received via e.g. "s").

  Right.

 > Note that Unicode object are completely different beast ;-)
 > String object are not touched in any way by the proposal.

  I wasn't suggesting the PyStringObject be changed, only that the
PyUnicodeObject could maintain a reference.  Consider:

        s = fp.read()
        u = unicode(s, 'utf-8')

u would now hold a reference to s, and s/s# would return a pointer
into s instead of re-building the UTF-8 form.  I talked myself out of
this because it would be too easy to keep a lot more string objects
around than were actually needed.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From jack at oratrix.nl  Fri Nov 12 17:33:46 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Fri, 12 Nov 1999 17:33:46 +0100
Subject: [Python-Dev] just say no... 
In-Reply-To: Message by "M.-A. Lemburg"  ,
	     Fri, 12 Nov 1999 16:24:33 +0100 , <382C3131.A8965CA5@lemburg.com> 
Message-ID: <19991112163347.5527635BB1E@snelboot.oratrix.nl>

The problem with "s" and "s#"  is that they're already semantically 
overloaded, and will become more so with support for multiple charsets.

Some modules use "s#" when they mean "give me a pointer to an area of memory 
and its length". Writing to binary files is an example of this.

Some modules use it to mean "give me a pointer to a string". Writing to a text 
file is (probably) an example of this.

Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This 
is the case if we're going to actually look at the contents (think of 
string.upper() and such).

I think that the only real solution is to define what "s" means, come up with 
new getarg-formats for the other two use cases and convert all modules to use 
the new standard. It'll still cause grief to extension modules that aren't 
part of the core, but at least the problem will go away after a while.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From mal at lemburg.com  Fri Nov 12 19:36:55 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 19:36:55 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
		<382BDB09.55583F28@lemburg.com>
		<027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
		<382C11FE.D7D9F916@lemburg.com>
		<14380.9616.245419.138261@weyr.cnri.reston.va.us>
		<382C3131.A8965CA5@lemburg.com> <14380.16437.71847.832880@weyr.cnri.reston.va.us>
Message-ID: <382C5E47.21FB4DD@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > It's been in the proposal since version 0.1. The idea is to
>  > provide a decent way of making existing script Unicode aware.
> 
>   Ok, so I haven't read closely enough.
> 
>  > This is what I intended to implement. The  buffer
>  > will be filled upon the first request to the UTF-8 encoding.
>  > "s" and "s#" are examples of such requests. The buffer will
>  > remain intact until the object is destroyed (since other code
>  > could store the pointer received via e.g. "s").
> 
>   Right.
> 
>  > Note that Unicode object are completely different beast ;-)
>  > String object are not touched in any way by the proposal.
> 
>   I wasn't suggesting the PyStringObject be changed, only that the
> PyUnicodeObject could maintain a reference.  Consider:
> 
>         s = fp.read()
>         u = unicode(s, 'utf-8')
> 
> u would now hold a reference to s, and s/s# would return a pointer
> into s instead of re-building the UTF-8 form.  I talked myself out of
> this because it would be too easy to keep a lot more string objects
> around than were actually needed.

Agreed. Also, the encoding would always be correct. 
will always hold the  version (which should
be UTF-8...).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From gstein at lyra.org  Fri Nov 12 23:19:15 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 14:19:15 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <007e01bf2d09$44738440$0501a8c0@bobcat>
Message-ID: 

On Fri, 12 Nov 1999, Mark Hammond wrote:
> Couldnt we start with Fredriks implementation, and see how the rest
> turns out?  Even if we do choose to change the underlying Unicode
> implementation to use a different native encoding, the interface to
> the PyUnicode_Type would remain pretty similar.  The advantage is that
> we have something now to start working with for the rest of the
> support we need.

I agree with "start with" here, and will go one step further (which Mark
may have implied) -- *check in* Fredrik's code.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Fri Nov 12 23:59:03 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 14:59:03 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C11FE.D7D9F916@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> Fredrik Lundh wrote:
> > > Besides, the Unicode object will have a buffer containing the
> > >  representation of the object, which, if all goes
> > > well, will always hold the UTF-8 value.
> > 
> > 
> > 
> > over my dead body, that one...
> 
> Such a buffer is needed to implement "s" and "s#" argument
> parsing. It's a simple requirement to support those two
> parsing markers -- there's not much to argue about, really...
> unless, of course, you want to give up Unicode object support
> for all APIs using these parsers.

Bull!

You can easily support "s#" support by returning the pointer to the
Unicode buffer. The *entire* reason for introducing "t#" is to
differentiate between returning a pointer to an 8-bit [character] buffer
and a not-8-bit buffer.

In other words, the work done to introduce "t#" was done *SPECIFICALLY* to
allow "s#" to return a pointer to the Unicode data.

I am with Fredrik on that auxilliary buffer. You'll have two dead bodies
to deal with :-)

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sat Nov 13 00:05:11 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 15:05:11 -0800 (PST)
Subject: [Python-Dev] just say no... 
In-Reply-To: <19991112163347.5527635BB1E@snelboot.oratrix.nl>
Message-ID: 

This was done last year!! We have "s#" meaning "give me some bytes." We
have "t#" meaning "give me some 8-bit characters." The Python distribution
has been completely updated to use the appropriate format in each call.

The was done *specifically* to support the introduction of a Unicode type.
The intent was that "s#" returns the *raw* bytes of the Unicode string --
NOT a UTF-8 encoding!

As a separate argument, MAL can argue that "t#" should create an internal,
associated buffer to hold a UTF-8 encoding and then return that. But the
"s#" should return the raw bytes!
[ and I'll argue against the response to "t#" anyhow... ]

-g

On Fri, 12 Nov 1999, Jack Jansen wrote:
> The problem with "s" and "s#"  is that they're already semantically 
> overloaded, and will become more so with support for multiple charsets.
> 
> Some modules use "s#" when they mean "give me a pointer to an area of memory 
> and its length". Writing to binary files is an example of this.
> 
> Some modules use it to mean "give me a pointer to a string". Writing to a text 
> file is (probably) an example of this.
> 
> Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This 
> is the case if we're going to actually look at the contents (think of 
> string.upper() and such).
> 
> I think that the only real solution is to define what "s" means, come up with 
> new getarg-formats for the other two use cases and convert all modules to use 
> the new standard. It'll still cause grief to extension modules that aren't 
> part of the core, but at least the problem will go away after a while.
> --
> Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
> Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
> www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 
> 
> 
> 
> _______________________________________________
> Python-Dev maillist  -  Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sat Nov 13 00:09:13 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 15:09:13 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C2F97.8E7D7A4D@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> Fredrik Lundh wrote:
>...
> > why?  I don't understand why "s" and "s#" has
> > to deal with encoding issues at all...
> > 
> > > unless, of course, you want to give up Unicode object support
> > > for all APIs using these parsers.
> > 
> > hmm.  maybe that's exactly what I want...
> 
> If we don't add that support, lot's of existing APIs won't
> accept Unicode object instead of strings. While it could be
> argued that automatic conversion to UTF-8 is not transparent
> enough for the user, the other solution of using str(u)
> everywhere would probably make writing Unicode-aware code a
> rather clumsy task and introduce other pitfalls, since str(obj)
> calls PyObject_Str() which also works on integers, floats,
> etc.

No no no...

"s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are
supposed to return the raw bytes.

If a caller wants 8-bit characters, then that caller will use "t#".

If you want to argue for that separate, encoded buffer, then argue for it
for support for the "t#" format. But do NOT say that it is needed for "s#"
which simply means "give me some bytes."

-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sat Nov 13 00:26:08 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 15:26:08 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <14380.16064.723277.586881@weyr.cnri.reston.va.us>
Message-ID: 

On Fri, 12 Nov 1999, Fred L. Drake, Jr. wrote:
> M.-A. Lemburg writes:
>  > The abbreviation BOM is quite common w/r to Unicode.

True.

>   Yes: "w/r to Unicode".  In sys, it's out of context and should
> receive a more descriptive name.  I think using BOM in unicodec is
> good.

I agree and believe that we can avoid putting it into sys altogether.

>  >   BOM_BE: '\376\377' 
>  >     (corresponds to Unicode 0x0000FEFF in UTF-16 
>  >      == ZERO WIDTH NO-BREAK SPACE)

Are you sure about that interpretation? I thought the BOM characters
(0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.

>   I'd also add BOM to be the same as sys.byte_order_mark.  Perhaps
> even instead of sys.byte_order_mark (just to localize the areas of
> code that are affected).

### unicodec.py ###
import struct

BOM = struct.pack('h', 0x0000FEFF)
BOM_BE = '\376\377'
...


If somebody needs the BOM, then they should go to unicodec.py (or some
other module). I do not believe we need to put that stuff into the sys
module. It is just too easy to create the value in Python.

Cheers,
-g

p.s. to be pedantic, the pack() format could be '@h'

--
Greg Stein, http://www.lyra.org/




From mhammond at skippinet.com.au  Sat Nov 13 00:41:16 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sat, 13 Nov 1999 10:41:16 +1100
Subject: [Python-Dev] just say no... 
In-Reply-To: 
Message-ID: <008601bf2d67$6a9982b0$0501a8c0@bobcat>

[Greg writes]

> As a separate argument, MAL can argue that "t#" should create
> an internal,
> associated buffer to hold a UTF-8 encoding and then return
> that. But the
> "s#" should return the raw bytes!
> [ and I'll argue against the response to "t#" anyhow... ]

Hmm.  Climbing over these dead bodies could get a bit smelly :-)

Im inclined to agree that holding 2 internal buffers for the unicode
object is not ideal.  However, I _am_ concerned with getting decent
PyArg_ParseTuple and Py_BuildValue support, and if the cost is an
extra buffer I will survive.  So lets look for solutions that dont
require it, rather than holding it up as evil when no other solution
is obvious.

My requirements appear to me to be very simple (for an anglophile):

Lets say I have a platform Unicode value - eg, I got a Unicode value
from some external library (say COM :-)  Lets assume for now that the
Unicode string is fully representable as ASCII  - say a file or
directory name that COM gave me.  I simply want to be able to pass
this Unicode object to "open()", and have it work.  This assumes that
open() will not become "native unicode", simply as the underlying C
support is not unicode aware - it needs to be converted to a "char *"
(ie, will use the "t#" format)

The second side of the equation is when I expose a Python function
that talks Unicode - eg, I need to _pass_ a platform Unicode value to
an external library.  The Python programmer should be able to pass a
Unicode object (no problem), or a PyString object.

In code terms:
Prob1:
  name = SomeComObject.GetFileName() # A Unicode object
  f = open(name)
Prob2:
  SomeComObject.SetFileName("foo.txt")

IMO it is important that we have a good strategy for dealing with this
for extensions.  MAL addresses one direction, but not the other.

Maybe if we toss around general solutions for this the implementation
will fall out.  MALs idea of the additional buffer starts to address
this, but isnt the whole story.

Any ideas on this?




From gstein at lyra.org  Sat Nov 13 01:49:34 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 16:49:34 -0800 (PST)
Subject: [Python-Dev] argument parsing (was: just say no...)
In-Reply-To: <008601bf2d67$6a9982b0$0501a8c0@bobcat>
Message-ID: 

On Sat, 13 Nov 1999, Mark Hammond wrote:
>...
> Im inclined to agree that holding 2 internal buffers for the unicode
> object is not ideal.  However, I _am_ concerned with getting decent
> PyArg_ParseTuple and Py_BuildValue support, and if the cost is an
> extra buffer I will survive.  So lets look for solutions that dont
> require it, rather than holding it up as evil when no other solution
> is obvious.

I believe Py_BuildValue is pretty straight-forward. Simply state that it
is allowed to perform conversions and place the resulting object into the
resulting tuple.
(with appropriate refcounting)

In other words:

  tuple = Py_BuildValue("U", stringOb);

The stringOb will be converted to a Unicode object. The new Unicode object
will go into the tuple (with the tuple holding the only reference!). The
stringOb will NOT acquire any additional references.

[ "U" format may be wrong; it is here for example purposes ]


Okay... now the PyArg_ParseTuple() is the *real* kicker.

>...
> Prob1:
>   name = SomeComObject.GetFileName() # A Unicode object
>   f = open(name)
> Prob2:
>   SomeComObject.SetFileName("foo.txt")

Both of these issues are due to PyArg_ParseTuple. In Prob1, you want a
string-like object which can be passed to the OS as an 8-bit string. In
Prob2, you want a string-like object which can be passed to the OS as a
Unicode string.

I see three options for PyArg_ParseTuple:

1) allow it to return NEW objects which must be DECREF'd.
   [ current policy only loans out references ]

   This option could be difficult in the presence of errors during the
   parse. For example, the current idiom is:

     if (!PyArg_ParseTuple(args, "..."))
        return NULL;

   If an object was produced, but then a later argument cause a failure,
   then who is responsible for freeing the object?

2) like step 1, but PyArg_ParseTuple is smart enough to NOT return any new
   objects when an error occurred.

   This basically answers the last question in option (1) -- ParseTuple is
   responsible.

3) Return loaned-out-references to objects which have been tested for
   convertability. Helper functions perform the conversion and the caller
   will then free the reference.
   [ this is the model used in PyWin32 ]

   Code in PyWin32 typically looks like:

     if (!PyArg_ParseTuple(args, "O", &ob))
       return NULL;
     if ((unicodeOb = GiveMeUnicode(ob)) == NULL)
       return NULL;
     ...
     Py_DECREF(unicodeOb);

   [ GiveMeUnicode is descriptive here; I forget the name used in PyWin32 ]

   In a "real" situation, the ParseTuple format would be "U" and the
   object would be type-tested for PyStringType or PyUnicodeType.

   Note that GiveMeUnicode() would also do a type-test, but it can't
   produce a *specific* error like ParseTuple (e.g. "string/unicode object
   expected" vs "parameter 3 must be a string/unicode object")

Are there more options? Anybody?


All three of these avoid the secondary buffer. The last is cleanest w.r.t.
to keeping the existing "loaned references" behavior, but can get a bit
wordy when you need to convert a bunch of string arguments.

Option (2) adds a good amount of complexity to PyArg_ParseTuple -- it
would need to keep a "free list" in case an error occurred.

Option (1) adds DECREF logic to callers to ensure they clean up. The add'l
logic isn't much more than the other two options (the only change is
adding DECREFs before returning NULL from the "if (!PyArg_ParseTuple..."
condition). Note that the caller would probably need to initialize each
object to NULL before calling ParseTuple.


Personally, I prefer (3) as it makes it very clear that a new object has
been created and must be DECREF'd at some point. Also note that
GiveMeUnicode() could also accept a second argument for the type of
decoding to do (or NULL meaning "UTF-8").

Oh: note there are equivalents of all options for going from
unicode-to-string; the above is all about string-to-unicode. However, the
tricky part of unicode-to-string is determining whether backwards
compatibility will be a requirement. i.e. does existing code that uses the
"t" format suddenly achieve the capability to accept a Unicode object?
This obviously causes problems in all three options: since a new reference
must be created to handle the situation, then who DECREF's it? The old
code certainly doesn't.
[  I'm with Fredrik in saying "no, old code *doesn't* suddenly get
  the ability to accept a Unicode object." The Python code must use str() to
  do the encoding manually (until the old code is upgraded to one of the
  above three options).  ]

I think that's it for me. In the several years I've been thinking on this
problem, I haven't come up with anything but the above three. There may be
a whole new paradigm for argument parsing, but I haven't tried to think on
that one (and just fit in around ParseTuple).

Cheers,
-g

--
Greg Stein, http://www.lyra.org/






From mal at lemburg.com  Fri Nov 12 19:49:52 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 19:49:52 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
		<02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
		<382C0FA8.ACB6CCD6@lemburg.com>
		<14380.10955.420102.327867@weyr.cnri.reston.va.us>
		<382C3749.198EEBC6@lemburg.com> <14380.16064.723277.586881@weyr.cnri.reston.va.us>
Message-ID: <382C6150.53BDC803@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > The abbreviation BOM is quite common w/r to Unicode.
> 
>   Yes: "w/r to Unicode".  In sys, it's out of context and should
> receive a more descriptive name.  I think using BOM in unicodec is
> good.

Guido proposed to add it to sys. I originally had it defined in
unicodec.

Perhaps a sys.endian would be more appropriate for sys
with values 'little' and 'big' or '<' and '>' to be conform
to the struct module.

unicodec could then define unicodec.bom depending on the setting
in sys.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Sat Nov 13 10:37:35 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Sat, 13 Nov 1999 10:37:35 +0100
Subject: [Python-Dev] just say no...
References: 
Message-ID: <382D315F.A7ADEC42@lemburg.com>

Greg Stein wrote:
> 
> On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> > Fredrik Lundh wrote:
> >...
> > > why?  I don't understand why "s" and "s#" has
> > > to deal with encoding issues at all...
> > >
> > > > unless, of course, you want to give up Unicode object support
> > > > for all APIs using these parsers.
> > >
> > > hmm.  maybe that's exactly what I want...
> >
> > If we don't add that support, lot's of existing APIs won't
> > accept Unicode object instead of strings. While it could be
> > argued that automatic conversion to UTF-8 is not transparent
> > enough for the user, the other solution of using str(u)
> > everywhere would probably make writing Unicode-aware code a
> > rather clumsy task and introduce other pitfalls, since str(obj)
> > calls PyObject_Str() which also works on integers, floats,
> > etc.
> 
> No no no...
> 
> "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are
> supposed to return the raw bytes.

[I've waited quite some time for you to chime in on this one ;-)]

Let me summarize a bit on the general ideas behind "s", "s#"
and the extra buffer:

First, we have a general design question here: should old code
become Unicode compatible or not. As I recall the original idea
about Unicode integration was to follow Perl's idea to have
scripts become Unicode aware by simply adding a 'use utf8;'.

If this is still the case, then we'll have to come with a
resonable approach for integrating classical string based
APIs with the new type.

Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
the Latin-1 folks) which has some very nice features (see
http://czyborra.com/utf/ ) and which is a true extension of ASCII,
this encoding seems best fit for the purpose.

However, one should not forget that UTF-8 is in fact a
variable length encoding of Unicode characters, that is up to
3 bytes form a *single* character. This is obviously not compatible
with definitions that explicitly state data to be using a
8-bit single character encoding, e.g. indexing in UTF-8 doesn't
work like it does in Latin-1 text.

So if we are to do the integration, we'll have to choose
argument parser markers that allow for multi byte characters.
"t#" does not fall into this category, "s#" certainly does,
"s" is argueable.

Also note that we have to watch out for embedded NULL bytes.
UTF-16 has NULL bytes for every character from the Latin-1
domain. If "s" were to give back a pointer to the internal
buffer which is encoded in UTF-16, you would loose data.
UTF-8 doesn't have this problem, since only NULL bytes
map to (single) NULL bytes.

Now Greg would chime in with the buffer interface and
argue that it should make the underlying internal
format accessible. This is a bad idea, IMHO, since you
shouldn't really have to know what the internal data format
is.

Defining "s#" to return UTF-8 data does not only
make "s" and "s#" return the same data format (which should
always be the case, IMO), but also hides the internal
format from the user and gives him a reliable cross-platform
data representation of Unicode data (note that UTF-8 doesn't
have the byte order problems of UTF-16).

If you are still with, let's look at what "s" and "s#"
do: they return pointers into data areas which have to
be kept alive until the corresponding object dies.

The only way to support this feature is by allocating
a buffer for just this purpose (on the fly and only if
needed to prevent excessive memory load). The other
options of adding new magic parser markers or switching
to more generic one all have one downside: you need to
change existing code which is in conflict with the idea
we started out with.

So, again, the question is: do we want this magical
intergration or not ? Note that this is a design question,
not one of memory consumption...

--

Ok, the above covered Unicode -> String conversion. Mark
mentioned that he wanted the other way around to also
work in the same fashion, ie. automatic String -> Unicode
conversion. 

This could also be done in the same way by
interpreting the string as UTF-8 encoded Unicode... but we
have the same problem: where to put the data without
generating new intermediate objects. Since only newly
written code will use this feature there is a way to do
this though:

PyArg_ParseTuple(args,"s#",&utf8,&len);

If your C API understands UTF-8 there's nothing more to do,
if not, take Greg's option 3 approach:

PyArg_ParseTuple(args,"O",&obj);
unicode = PyUnicode_FromObject(obj);
...
Py_DECREF(unicode);

Here PyUnicode_FromObject() will return a new
reference if obj is an Unicode object or create a new
Unicode object by interpreting str(obj) as UTF-8 encoded string.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    48 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido at CNRI.Reston.VA.US  Sat Nov 13 13:12:41 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Sat, 13 Nov 1999 07:12:41 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Fri, 12 Nov 1999 14:59:03 PST."
              
References:  
Message-ID: <199911131212.HAA25895@eric.cnri.reston.va.us>

> I am with Fredrik on that auxilliary buffer. You'll have two dead bodies
> to deal with :-)

I haven't made up my mind yet (due to a very successful
Python-promoting visit to SD'99 east, I'm about 100 msgs behind in
this thread alone) but let me warn you that I can deal with the
carnage, if necessary. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein at lyra.org  Sat Nov 13 13:23:54 1999
From: gstein at lyra.org (Greg Stein)
Date: Sat, 13 Nov 1999 04:23:54 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <199911131212.HAA25895@eric.cnri.reston.va.us>
Message-ID: 

On Sat, 13 Nov 1999, Guido van Rossum wrote:
> > I am with Fredrik on that auxilliary buffer. You'll have two dead bodies
> > to deal with :-)
> 
> I haven't made up my mind yet (due to a very successful
> Python-promoting visit to SD'99 east, I'm about 100 msgs behind in
> this thread alone) but let me warn you that I can deal with the
> carnage, if necessary. :-)

Bring it on, big boy!

:-)

--
Greg Stein, http://www.lyra.org/




From mhammond at skippinet.com.au  Sat Nov 13 13:52:18 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sat, 13 Nov 1999 23:52:18 +1100
Subject: [Python-Dev] argument parsing (was: just say no...)
In-Reply-To: 
Message-ID: <00b301bf2dd5$ec4df840$0501a8c0@bobcat>

[Lamenting about PyArg_ParseTuple and managing memory buffers for
String/Unicode conversions.]

So what is really wrong with Marc's proposal about the extra pointer
on the Unicode object?  And to double the carnage, who not add the
equivilent native Unicode buffer to the PyString object?

These would only ever be filled when requested by the conversion
routines.  They have no other effect than their memory is managed by
the object itself; simply a convenience to avoid having extension
modules manage the conversion buffers.

The only overheads appear to be:
* The conversion buffers may be slightly (or much :-) longer-lived -
ie, they are not freed until the object itself is freed.
* String object slightly bigger, and slightly slower to destroy.

It appears to solve the problems, and the cost doesnt seem too high...

Mark.




From guido at CNRI.Reston.VA.US  Sat Nov 13 14:06:26 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Sat, 13 Nov 1999 08:06:26 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Sat, 13 Nov 1999 10:37:35 +0100."
             <382D315F.A7ADEC42@lemburg.com> 
References:   
            <382D315F.A7ADEC42@lemburg.com> 
Message-ID: <199911131306.IAA26030@eric.cnri.reston.va.us>

I think I have a reasonable grasp of the issues here, even though I
still haven't read about 100 msgs in this thread.  Note that t# and
the charbuffer addition to the buffer API were added by Greg Stein
with my support; I'll attempt to reconstruct our thinking at the
time...

[MAL]
> Let me summarize a bit on the general ideas behind "s", "s#"
> and the extra buffer:

I think you left out t#.

> First, we have a general design question here: should old code
> become Unicode compatible or not. As I recall the original idea
> about Unicode integration was to follow Perl's idea to have
> scripts become Unicode aware by simply adding a 'use utf8;'.

I've never heard of this idea before -- or am I taking it too literal?
It smells of a mode to me :-)  I'd rather live in a world where
Unicode just works as long as you use u'...' literals or whatever
convention we decide.

> If this is still the case, then we'll have to come with a
> resonable approach for integrating classical string based
> APIs with the new type.
> 
> Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> the Latin-1 folks) which has some very nice features (see
> http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> this encoding seems best fit for the purpose.

Yes, especially if we fix the default encoding as UTF-8.  (I'm
expecting feedback from HP on this next week, hopefully when I see the
details, it'll be clear that don't need a per-thread default encoding
to solve their problems; that's quite a likely outcome.  If not, we
have a real-world argument for allowing a variable default encoding,
without carnage.)

> However, one should not forget that UTF-8 is in fact a
> variable length encoding of Unicode characters, that is up to
> 3 bytes form a *single* character. This is obviously not compatible
> with definitions that explicitly state data to be using a
> 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> work like it does in Latin-1 text.

Sure, but where in current Python are there such requirements?

> So if we are to do the integration, we'll have to choose
> argument parser markers that allow for multi byte characters.
> "t#" does not fall into this category, "s#" certainly does,
> "s" is argueable.

I disagree.  I grepped through the source for s# and t#.  Here's a bit
of background.  Before t# was introduced, s# was being used for two
distinct purposes: (1) to get an 8-bit text string plus its length, in
situations where the length was needed; (2) to get binary data (e.g.
GIF data read from a file in "rb" mode).  Greg pointed out that if we
ever introduced some form of Unicode support, these two had to be
disambiguated.  We found that the majority of uses was for (2)!
Therefore we decided to change the definition of s# to mean only (2),
and introduced t# to mean (1).  Also, we introduced getcharbuffer
corresponding to t#, while getreadbuffer was meant for s#.

Note that the definition of the 's' format was left alone -- as
before, it means you need an 8-bit text string not containing null
bytes.

Our expectation was that a Unicode string passed to an s# situation
would give a pointer to the internal format plus a byte count (not a
character count!) while t# would get a pointer to some kind of 8-bit
translation/encoding plus a byte count, with the explicit requirement
that the 8-bit translation would have the same lifetime as the
original unicode object.  We decided to leave it up to the next
generation (i.e., Marc-Andre :-) to decide what kind of translation to
use and what to do when there is no reasonable translation.

Any of the following choices is acceptable (from the point of view of
not breaking the intended t# semantics; we can now start deciding
which we like best):

- utf-8
- latin-1
- ascii
- shift-jis
- lower byte of unicode ordinal
- some user- or os-specified multibyte encoding

As far as t# is concerned, for encodings that don't encode all of
Unicode, untranslatable characters could be dealt with in any number
of ways (raise an exception, ignore, replace with '?', make best
effort, etc.).

Given the current context, it should probably be the same as the
default encoding -- i.e., utf-8.  If we end up making the default
user-settable, we'll have to decide what to do with untranslatable
characters -- but that will probably be decided by the user too (it
would be a property of a specific translation specification).

In any case, I feel that t# could receive a multi-byte encoding, 
s# should receive raw binary data, and they should correspond to
getcharbuffer and getreadbuffer, respectively.

(Aside: the symmetry between 's' and 's#' is now lost; 's' matches
't#', there's no match for 's#'.)

> Also note that we have to watch out for embedded NULL bytes.
> UTF-16 has NULL bytes for every character from the Latin-1
> domain. If "s" were to give back a pointer to the internal
> buffer which is encoded in UTF-16, you would loose data.
> UTF-8 doesn't have this problem, since only NULL bytes
> map to (single) NULL bytes.

This is a red herring given my explanation above.

> Now Greg would chime in with the buffer interface and
> argue that it should make the underlying internal
> format accessible. This is a bad idea, IMHO, since you
> shouldn't really have to know what the internal data format
> is.

This is for C code.  Quite likely it *does* know what the internal
data format is!

> Defining "s#" to return UTF-8 data does not only
> make "s" and "s#" return the same data format (which should
> always be the case, IMO),

That was before t# was introduced.  No more, alas.  If you replace s#
with t#, I agree with you completely.

> but also hides the internal
> format from the user and gives him a reliable cross-platform
> data representation of Unicode data (note that UTF-8 doesn't
> have the byte order problems of UTF-16).
> 
> If you are still with, let's look at what "s" and "s#"

(and t#, which is more relevant here)

> do: they return pointers into data areas which have to
> be kept alive until the corresponding object dies.
> 
> The only way to support this feature is by allocating
> a buffer for just this purpose (on the fly and only if
> needed to prevent excessive memory load). The other
> options of adding new magic parser markers or switching
> to more generic one all have one downside: you need to
> change existing code which is in conflict with the idea
> we started out with.

Agreed.  I think this was our thinking when Greg & I introduced t#.
My own preference would be to allocate a whole string object, not
just a buffer; this could then also be used for the .encode() method
using the default encoding.

> So, again, the question is: do we want this magical
> intergration or not ? Note that this is a design question,
> not one of memory consumption...

Yes, I want it.

Note that this doesn't guarantee that all old extensions will work
flawlessly when passed Unicode objects; but I think that it covers
most cases where you could have a reasonable expectation that it
works.

(Hm, unfortunately many reasonable expectations seem to involve
the current user's preferred encoding. :-( )

> --
> 
> Ok, the above covered Unicode -> String conversion. Mark
> mentioned that he wanted the other way around to also
> work in the same fashion, ie. automatic String -> Unicode
> conversion. 
> 
> This could also be done in the same way by
> interpreting the string as UTF-8 encoded Unicode... but we
> have the same problem: where to put the data without
> generating new intermediate objects. Since only newly
> written code will use this feature there is a way to do
> this though:
> 
> PyArg_ParseTuple(args,"s#",&utf8,&len);

No!  That is supposed to give the native representation of the string
object.

I agree that Mark's problem requires a solution too, but it doesn't
have to use existing formatting characters, since there's no backwards
compatibility issue.

> If your C API understands UTF-8 there's nothing more to do,
> if not, take Greg's option 3 approach:
> 
> PyArg_ParseTuple(args,"O",&obj);
> unicode = PyUnicode_FromObject(obj);
> ...
> Py_DECREF(unicode);
> 
> Here PyUnicode_FromObject() will return a new
> reference if obj is an Unicode object or create a new
> Unicode object by interpreting str(obj) as UTF-8 encoded string.

This might work.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Sat Nov 13 14:06:35 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Sat, 13 Nov 1999 14:06:35 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.5
References: <382C0A54.E6E8328D@lemburg.com>
Message-ID: <382D625B.DC14DBDE@lemburg.com>

FYI, I've uploaded a new version of the proposal which incorporates
proposals for line breaks, case mapping, character properties and
private code points support.

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ? should Unicode objects support %-formatting ?

    One possibility would be to emulate this via strings and 
    :

    s = '%s %i abc???' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

    ? specifying file wrappers:

    Open issues: what to do with Python strings
    fed to the .write() method (may need to know the encoding of the
    strings) and when/if to return Python strings through the .read()
    method.

    Perhaps we need more than one type of wrapper here.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    48 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From jack at oratrix.nl  Sat Nov 13 17:40:34 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Sat, 13 Nov 1999 17:40:34 +0100
Subject: [Python-Dev] just say no... 
In-Reply-To: Message by Greg Stein  ,
	     Fri, 12 Nov 1999 15:05:11 -0800 (PST) ,  
Message-ID: <19991113164039.9B697EA11A@oratrix.oratrix.nl>

Recently, Greg Stein  said:
> This was done last year!! We have "s#" meaning "give me some bytes." We
> have "t#" meaning "give me some 8-bit characters." The Python distribution
> has been completely updated to use the appropriate format in each call.

Oops...

I remember the discussion but I wasn't aware that somone had actually
_implemented_ this:-). Part of my misunderstanding was also caused by
the fact that I inspected what I thought would be the prime candidate
for t#: file.write() to a non-binary file, and it doesn't use the new
format.

I also noted a few inconsistencies at first glance, by the way: most
modules seem to use s# for things like filenames and other
data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an 
exception and it uses t# for uuencoded strings...
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 



From guido at CNRI.Reston.VA.US  Sat Nov 13 20:20:51 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Sat, 13 Nov 1999 14:20:51 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Sat, 13 Nov 1999 17:40:34 +0100."
             <19991113164039.9B697EA11A@oratrix.oratrix.nl> 
References: <19991113164039.9B697EA11A@oratrix.oratrix.nl> 
Message-ID: <199911131920.OAA26165@eric.cnri.reston.va.us>

> I remember the discussion but I wasn't aware that somone had actually
> _implemented_ this:-). Part of my misunderstanding was also caused by
> the fact that I inspected what I thought would be the prime candidate
> for t#: file.write() to a non-binary file, and it doesn't use the new
> format.

I guess that's because file.write() doesn't distinguish between text
and binary files.  Maybe it should: the current implementation
together with my proposed semantics for Unicode strings would mean that
printing a unicode string (to stdout) would dump the internal encoding
to the file.  I guess it should do so only when the file is opened in
binary mode; for files opened in text mode it should use an encoding
(opening a file can specify an encoding; can we change the encoding of
an existing file?).

> I also noted a few inconsistencies at first glance, by the way: most
> modules seem to use s# for things like filenames and other
> data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an 
> exception and it uses t# for uuencoded strings...

Actually, binascii seems to do it right: s# for binary data, t# for
text (uuencoded, hqx, base64).  That is, the b2a variants use s# while
the a2b variants use t#.  The only thing I'm not sure about in that
module are binascii_rledecode_hqx() and binascii_rlecode_hqx() -- I
don't understand where these stand in the complexity of binhex
en/decoding.

--Guido van Rossum (home page: http://www.python.org/~guido/)




From mal at lemburg.com  Sun Nov 14 23:11:54 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Sun, 14 Nov 1999 23:11:54 +0100
Subject: [Python-Dev] just say no...
References:   
	            <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us>
Message-ID: <382F33AA.C3EE825A@lemburg.com>

Guido van Rossum wrote:
> 
> I think I have a reasonable grasp of the issues here, even though I
> still haven't read about 100 msgs in this thread.  Note that t# and
> the charbuffer addition to the buffer API were added by Greg Stein
> with my support; I'll attempt to reconstruct our thinking at the
> time...
>
> [MAL]
> > Let me summarize a bit on the general ideas behind "s", "s#"
> > and the extra buffer:
> 
> I think you left out t#.

On purpose -- according to my thinking. I see "t#" as an interface
to bf_getcharbuf which I understand as 8-bit character buffer...
UTF-8 is a multi byte encoding. It still is character data, but
not necessarily 8 bits in length (up to 24 bits are used).

Anyway, I'm not really interested in having an argument about
this. If you say, "t#" fits the purpose, then that's fine with
me. Still, we should clearly define that "t#" returns
text data and "s#" binary data. Encoding, bit length, etc. should
explicitly remain left undefined.

> > First, we have a general design question here: should old code
> > become Unicode compatible or not. As I recall the original idea
> > about Unicode integration was to follow Perl's idea to have
> > scripts become Unicode aware by simply adding a 'use utf8;'.
> 
> I've never heard of this idea before -- or am I taking it too literal?
> It smells of a mode to me :-)  I'd rather live in a world where
> Unicode just works as long as you use u'...' literals or whatever
> convention we decide.
> 
> > If this is still the case, then we'll have to come with a
> > resonable approach for integrating classical string based
> > APIs with the new type.
> >
> > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> > the Latin-1 folks) which has some very nice features (see
> > http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> > this encoding seems best fit for the purpose.
> 
> Yes, especially if we fix the default encoding as UTF-8.  (I'm
> expecting feedback from HP on this next week, hopefully when I see the
> details, it'll be clear that don't need a per-thread default encoding
> to solve their problems; that's quite a likely outcome.  If not, we
> have a real-world argument for allowing a variable default encoding,
> without carnage.)

Fair enough :-)
 
> > However, one should not forget that UTF-8 is in fact a
> > variable length encoding of Unicode characters, that is up to
> > 3 bytes form a *single* character. This is obviously not compatible
> > with definitions that explicitly state data to be using a
> > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> > work like it does in Latin-1 text.
> 
> Sure, but where in current Python are there such requirements?

It was my understanding that "t#" refers to single byte character
data. That's where the above arguments were aiming at...
 
> > So if we are to do the integration, we'll have to choose
> > argument parser markers that allow for multi byte characters.
> > "t#" does not fall into this category, "s#" certainly does,
> > "s" is argueable.
> 
> I disagree.  I grepped through the source for s# and t#.  Here's a bit
> of background.  Before t# was introduced, s# was being used for two
> distinct purposes: (1) to get an 8-bit text string plus its length, in
> situations where the length was needed; (2) to get binary data (e.g.
> GIF data read from a file in "rb" mode).  Greg pointed out that if we
> ever introduced some form of Unicode support, these two had to be
> disambiguated.  We found that the majority of uses was for (2)!
> Therefore we decided to change the definition of s# to mean only (2),
> and introduced t# to mean (1).  Also, we introduced getcharbuffer
> corresponding to t#, while getreadbuffer was meant for s#.

I know its too late now, but I can't really follow the arguments
here: in what ways are (1) and (2) different from the implementations
point of view ? If "t#" is to return UTF-8 then  will not equal , so both parser markers return
essentially the same information. The only difference would be
on the semantic side: (1) means: give me text data, while (2) does
not specify the data type.

Perhaps I'm missing something...
 
> Note that the definition of the 's' format was left alone -- as
> before, it means you need an 8-bit text string not containing null
> bytes.

This definition should then be changed to "text string without
null bytes" dropping the 8-bit reference.
 
> Our expectation was that a Unicode string passed to an s# situation
> would give a pointer to the internal format plus a byte count (not a
> character count!) while t# would get a pointer to some kind of 8-bit
> translation/encoding plus a byte count, with the explicit requirement
> that the 8-bit translation would have the same lifetime as the
> original unicode object.  We decided to leave it up to the next
> generation (i.e., Marc-Andre :-) to decide what kind of translation to
> use and what to do when there is no reasonable translation.

Hmm, I would strongly object to making "s#" return the internal
format. file.write() would then default to writing UTF-16 data
instead of UTF-8 data. This could result in strange errors
due to the UTF-16 format being endian dependent.

It would also break the symmetry between file.write(u) and
unicode(file.read()), since the default encoding is not used as
internal format for other reasons (see proposal).

> Any of the following choices is acceptable (from the point of view of
> not breaking the intended t# semantics; we can now start deciding
> which we like best):

I think we have already agreed on using UTF-8 for the default
encoding. It has quite a few advantages. See

	http://czyborra.com/utf/

for a good overview of the pros and cons.

> - utf-8
> - latin-1
> - ascii
> - shift-jis
> - lower byte of unicode ordinal
> - some user- or os-specified multibyte encoding
> 
> As far as t# is concerned, for encodings that don't encode all of
> Unicode, untranslatable characters could be dealt with in any number
> of ways (raise an exception, ignore, replace with '?', make best
> effort, etc.).

The usual Python way would be: raise an exception. This is what
the proposal defines for Codecs in case an encoding/decoding
mapping is not possible, BTW. (UTF-8 will always succeed on
output.)
 
> Given the current context, it should probably be the same as the
> default encoding -- i.e., utf-8.  If we end up making the default
> user-settable, we'll have to decide what to do with untranslatable
> characters -- but that will probably be decided by the user too (it
> would be a property of a specific translation specification).
> 
> In any case, I feel that t# could receive a multi-byte encoding,
> s# should receive raw binary data, and they should correspond to
> getcharbuffer and getreadbuffer, respectively.

Why would you want to have "s#" return the raw binary data for
Unicode objects ? 

Note that it is not mentioned anywhere that
"s#" and "t#" do have to necessarily return different things
(binary being a superset of text). I'd opt for "s#" and "t#" both
returning UTF-8 data. This can be implemented by delegating the
buffer slots to the  object (see below).

> > Now Greg would chime in with the buffer interface and
> > argue that it should make the underlying internal
> > format accessible. This is a bad idea, IMHO, since you
> > shouldn't really have to know what the internal data format
> > is.
> 
> This is for C code.  Quite likely it *does* know what the internal
> data format is!

C code can use the PyUnicode_* APIs to access the data. I
don't think that argument parsing is powerful enough to
provide the C code with enough information about the data
contents, e.g. it can only state the encoding length, not the
string length.
 
> > Defining "s#" to return UTF-8 data does not only
> > make "s" and "s#" return the same data format (which should
> > always be the case, IMO),
> 
> That was before t# was introduced.  No more, alas.  If you replace s#
> with t#, I agree with you completely.

Done :-)
 
> > but also hides the internal
> > format from the user and gives him a reliable cross-platform
> > data representation of Unicode data (note that UTF-8 doesn't
> > have the byte order problems of UTF-16).
> >
> > If you are still with, let's look at what "s" and "s#"
> 
> (and t#, which is more relevant here)
> 
> > do: they return pointers into data areas which have to
> > be kept alive until the corresponding object dies.
> >
> > The only way to support this feature is by allocating
> > a buffer for just this purpose (on the fly and only if
> > needed to prevent excessive memory load). The other
> > options of adding new magic parser markers or switching
> > to more generic one all have one downside: you need to
> > change existing code which is in conflict with the idea
> > we started out with.
> 
> Agreed.  I think this was our thinking when Greg & I introduced t#.
> My own preference would be to allocate a whole string object, not
> just a buffer; this could then also be used for the .encode() method
> using the default encoding.

Good point. I'll change  to , a Python
string object created on request.
 
> > So, again, the question is: do we want this magical
> > intergration or not ? Note that this is a design question,
> > not one of memory consumption...
> 
> Yes, I want it.
> 
> Note that this doesn't guarantee that all old extensions will work
> flawlessly when passed Unicode objects; but I think that it covers
> most cases where you could have a reasonable expectation that it
> works.
> 
> (Hm, unfortunately many reasonable expectations seem to involve
> the current user's preferred encoding. :-( )

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    47 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From amk1 at erols.com  Mon Nov 15 02:49:08 1999
From: amk1 at erols.com (A.M. Kuchling)
Date: Sun, 14 Nov 1999 20:49:08 -0500
Subject: [Python-Dev] PyErr_Format security note
Message-ID: <199911150149.UAA00408@mira.erols.com>

I noticed this in PyErr_Format(exception, format, va_alist):

	char buffer[500]; /* Caller is responsible for limiting the format */
	...
	vsprintf(buffer, format, vargs);

Making the caller responsible for this is error-prone.  The danger, of
course, is a buffer overflow caused by generating an error string
that's larger than the buffer, possibly letting people execute
arbitrary code.  We could add a test to the configure script for
vsnprintf() and use it when possible, but that only fixes the problem
on platforms which have it.  Can we find an implementation of
vsnprintf() someplace?

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
One form to rule them all, one form to find them, one form to bring them all
and in the darkness rewrite the hell out of them.
    -- Digital Equipment Corporation, in a comment from SENDMAIL Ruleset 3




From gstein at lyra.org  Mon Nov 15 03:11:39 1999
From: gstein at lyra.org (Greg Stein)
Date: Sun, 14 Nov 1999 18:11:39 -0800 (PST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <199911150149.UAA00408@mira.erols.com>
Message-ID: 

On Sun, 14 Nov 1999, A.M. Kuchling wrote:
> Making the caller responsible for this is error-prone.  The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

Apache has a safe implementation (they have reviewed the heck out of it
for obvious reasons :-).

In the Apache source distribution, it is located in src/ap/ap_snprintf.c.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Mon Nov 15 09:09:07 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 09:09:07 +0100
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
Message-ID: <382FBFA3.B28B8E1E@lemburg.com>

"A.M. Kuchling" wrote:
> 
> I noticed this in PyErr_Format(exception, format, va_alist):
> 
>         char buffer[500]; /* Caller is responsible for limiting the format */
>         ...
>         vsprintf(buffer, format, vargs);
> 
> Making the caller responsible for this is error-prone.  The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

In sysmodule.c, this check is done which should be safe enough
since no "return" is issued (Py_FatalError() does an abort()):

  if (vsprintf(buffer, format, va) >= sizeof(buffer))
    Py_FatalError("PySys_WriteStdout/err: buffer overrun");


-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From gstein at lyra.org  Mon Nov 15 10:28:06 1999
From: gstein at lyra.org (Greg Stein)
Date: Mon, 15 Nov 1999 01:28:06 -0800 (PST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <382FBFA3.B28B8E1E@lemburg.com>
Message-ID: 

On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
>...
> In sysmodule.c, this check is done which should be safe enough
> since no "return" is issued (Py_FatalError() does an abort()):
> 
>   if (vsprintf(buffer, format, va) >= sizeof(buffer))
>     Py_FatalError("PySys_WriteStdout/err: buffer overrun");

I believe the return from vsprintf() itself would be the problem.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Mon Nov 15 10:49:26 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 10:49:26 +0100
Subject: [Python-Dev] PyErr_Format security note
References: 
Message-ID: <382FD726.6ACB912F@lemburg.com>

Greg Stein wrote:
> 
> On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> >...
> > In sysmodule.c, this check is done which should be safe enough
> > since no "return" is issued (Py_FatalError() does an abort()):
> >
> >   if (vsprintf(buffer, format, va) >= sizeof(buffer))
> >     Py_FatalError("PySys_WriteStdout/err: buffer overrun");
> 
> I believe the return from vsprintf() itself would be the problem.

Ouch, yes, you are right... but who could exploit this security
hole ? Since PyErr_Format() is only reachable for C code, only
bad programming style in extensions could make it exploitable
via user input.

Wouldn't it be possible to assign thread globals for these
functions to use ? These would live on the heap instead of
on the stack and eliminate the buffer overrun possibilities
(I guess -- I don't have any experience with these...).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From akuchlin at mems-exchange.org  Mon Nov 15 16:17:58 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Mon, 15 Nov 1999 10:17:58 -0500 (EST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <382FD726.6ACB912F@lemburg.com>
References: 
	<382FD726.6ACB912F@lemburg.com>
Message-ID: <14384.9254.152604.11688@amarok.cnri.reston.va.us>

M.-A. Lemburg writes:
>Ouch, yes, you are right... but who could exploit this security
>hole ? Since PyErr_Format() is only reachable for C code, only
>bad programming style in extensions could make it exploitable
>via user input.

99% of security holes arise out of carelessness, and besides, this
buffer size doesn't seem to be documented in either api.tex or
ext.tex.  I'll look into borrowing Apache's implementation and
modifying it into a varargs form.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I can also withstand considerably more G-force than most people, even though I
do say so myself.
    -- The Doctor, in "The Ambassadors of Death"




From guido at CNRI.Reston.VA.US  Mon Nov 15 16:23:57 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 10:23:57 -0500
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: Your message of "Sun, 14 Nov 1999 20:49:08 EST."
             <199911150149.UAA00408@mira.erols.com> 
References: <199911150149.UAA00408@mira.erols.com> 
Message-ID: <199911151523.KAA27163@eric.cnri.reston.va.us>

> I noticed this in PyErr_Format(exception, format, va_alist):
> 
> 	char buffer[500]; /* Caller is responsible for limiting the format */
> 	...
> 	vsprintf(buffer, format, vargs);
> 
> Making the caller responsible for this is error-prone.

Agreed.  The limit of 500 chars, while technically undocumented, is
part of the specs for PyErr_Format (which is currently wholly
undocumented).  The current callers all have explicit precautions, but
of course I agree that this is a potential danger.

> The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

Assuming that Linux and Solaris have vsnprintf(), can't we just use
the configure script to detect it, and issue a warning blaming the
platform for those platforms that don't have it?  That seems much
simpler (from a maintenance perspective) than carrying our own
implementation around (even if we can borrow the Apache version).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake at acm.org  Mon Nov 15 16:24:27 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon, 15 Nov 1999 10:24:27 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C6150.53BDC803@lemburg.com>
References: 
	<02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
	<382C0FA8.ACB6CCD6@lemburg.com>
	<14380.10955.420102.327867@weyr.cnri.reston.va.us>
	<382C3749.198EEBC6@lemburg.com>
	<14380.16064.723277.586881@weyr.cnri.reston.va.us>
	<382C6150.53BDC803@lemburg.com>
Message-ID: <14384.9643.145759.816037@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Guido proposed to add it to sys. I originally had it defined in
 > unicodec.

  Well, he clearly didn't ask me!  ;-)

 > Perhaps a sys.endian would be more appropriate for sys
 > with values 'little' and 'big' or '<' and '>' to be conform
 > to the struct module.
 > 
 > unicodec could then define unicodec.bom depending on the setting
 > in sys.

  This seems more reasonable, though I'd go with BOM instead of bom.
But that's a style issue, so not so important.  If your write bom,
I'll write bom.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From captainrobbo at yahoo.com  Mon Nov 15 16:30:45 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Mon, 15 Nov 1999 07:30:45 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
Message-ID: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>

Some thoughts on the codecs...

1. Stream interface
At the moment a codec has dump and load methods which
read a (slice of a) stream into a string in memory and
vice versa.  As the proposal notes, this could lead to
errors if you take a slice out of a stream.   This is
not just due to character truncation; some Asian
encodings are modal and have shift-in and shift-out
sequences as they move from Western single-byte
characters to double-byte ones.   It also seems a bit
pointless to me as the source (or target) is still a
Unicode string in memory.

This is a real problem - a filter to convert big files
between two encodings should be possible without
knowledge of the particular encoding, as should one on
the input/output of some server.  We can still give a
default implementation for single-byte encodings.

What's a good API for real stream conversion?   just
Codec.encodeStream(infile, outfile)  ?  or is it more
useful to feed the codec with data a chunk at a time?


2. Data driven codecs
I really like codecs being objects, and believe we
could build support for a lot more encodings, a lot
sooner than is otherwise possible, by making them data
driven rather making each one compiled C code with
static mapping tables.  What do people think about the
approach below?

First of all, the ISO8859-1 series are straight
mappings to Unicode code points.  So one Python script
could parse these files and build the mapping table,
and a very small data file could hold these encodings.
  A compiled helper function analogous to
string.translate() could deal with most of them.

Secondly, the double-byte ones involve a mixture of
algorithms and data.  The worst cases I know are modal
encodings which need a single-byte lookup table, a
double-byte lookup table, and have some very simple
rules about escape sequences in between them.  A
simple state machine could still handle these (and the
single-byte mappings above become extra-simple special
cases); I could imagine feeding it a totally
data-driven set of rules.  

Third, we can massively compress the mapping tables
using a notation which just lists contiguous ranges;
and very often there are relationships between
encodings.  For example, "cpXYZ is just like cpXYY but
with an extra 'smiley' at 0XFE32".  In these cases, a
script can build a family of related codecs in an
auditable manner. 

3. What encodings to distribute?
The only clean answers to this are 'almost none', or
'everything that Unicode 3.0 has a mapping for'.  The
latter is going to add some weight to the
distribution.  What are people's feelings?  Do we ship
any at all apart from the Unicode ones?  Should new
encodings be downloadable from www.python.org?  Should
there be an optional package outside the main
distribution?

Thanks,

Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From akuchlin at mems-exchange.org  Mon Nov 15 16:36:47 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Mon, 15 Nov 1999 10:36:47 -0500 (EST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <199911151523.KAA27163@eric.cnri.reston.va.us>
References: <199911150149.UAA00408@mira.erols.com>
	<199911151523.KAA27163@eric.cnri.reston.va.us>
Message-ID: <14384.10383.718373.432606@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>Assuming that Linux and Solaris have vsnprintf(), can't we just use
>the configure script to detect it, and issue a warning blaming the
>platform for those platforms that don't have it?  That seems much

But people using an already-installed Python binary won't see any such
configure-time warning, and won't find out about the potential
problem.  Plus, how do people fix the problem on platforms that don't
have vsnprintf() -- switch to Solaris or Linux?  Not much of a
solution.  (vsnprintf() isn't ANSI C, though it's a common extension,
so platforms that lack it aren't really deficient.)

Hmm... could we maybe use Python's existing (string % vars) machinery?
 No, that seems to be hard, because it would want
PyObjects, and we can't know what Python types to convert the varargs
to, unless we parse the format string (at which point we may as well
get a vsnprintf() implementation.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
A successful tool is one that was used to do something undreamed of by its
author.
    -- S.C. Johnson




From guido at CNRI.Reston.VA.US  Mon Nov 15 16:50:24 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 10:50:24 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Sun, 14 Nov 1999 23:11:54 +0100."
             <382F33AA.C3EE825A@lemburg.com> 
References:  <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us>  
            <382F33AA.C3EE825A@lemburg.com> 
Message-ID: <199911151550.KAA27188@eric.cnri.reston.va.us>

> On purpose -- according to my thinking. I see "t#" as an interface
> to bf_getcharbuf which I understand as 8-bit character buffer...
> UTF-8 is a multi byte encoding. It still is character data, but
> not necessarily 8 bits in length (up to 24 bits are used).
> 
> Anyway, I'm not really interested in having an argument about
> this. If you say, "t#" fits the purpose, then that's fine with
> me. Still, we should clearly define that "t#" returns
> text data and "s#" binary data. Encoding, bit length, etc. should
> explicitly remain left undefined.

Thanks for not picking an argument.  Multibyte encodings typically
have ASCII as a subset (in such a way that an ASCII string is
represented as itself in bytes).  This is the characteristic that's
needed in my view.

> > > First, we have a general design question here: should old code
> > > become Unicode compatible or not. As I recall the original idea
> > > about Unicode integration was to follow Perl's idea to have
> > > scripts become Unicode aware by simply adding a 'use utf8;'.
> > 
> > I've never heard of this idea before -- or am I taking it too literal?
> > It smells of a mode to me :-)  I'd rather live in a world where
> > Unicode just works as long as you use u'...' literals or whatever
> > convention we decide.
> > 
> > > If this is still the case, then we'll have to come with a
> > > resonable approach for integrating classical string based
> > > APIs with the new type.
> > >
> > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> > > the Latin-1 folks) which has some very nice features (see
> > > http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> > > this encoding seems best fit for the purpose.
> > 
> > Yes, especially if we fix the default encoding as UTF-8.  (I'm
> > expecting feedback from HP on this next week, hopefully when I see the
> > details, it'll be clear that don't need a per-thread default encoding
> > to solve their problems; that's quite a likely outcome.  If not, we
> > have a real-world argument for allowing a variable default encoding,
> > without carnage.)
> 
> Fair enough :-)
>  
> > > However, one should not forget that UTF-8 is in fact a
> > > variable length encoding of Unicode characters, that is up to
> > > 3 bytes form a *single* character. This is obviously not compatible
> > > with definitions that explicitly state data to be using a
> > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> > > work like it does in Latin-1 text.
> > 
> > Sure, but where in current Python are there such requirements?
> 
> It was my understanding that "t#" refers to single byte character
> data. That's where the above arguments were aiming at...

t# refers to byte-encoded data.  Multibyte encodings are explicitly
designed to be passed cleanly through processing steps that handle
single-byte character data, as long as they are 8-bit clean and don't
do too much processing.

> > > So if we are to do the integration, we'll have to choose
> > > argument parser markers that allow for multi byte characters.
> > > "t#" does not fall into this category, "s#" certainly does,
> > > "s" is argueable.
> > 
> > I disagree.  I grepped through the source for s# and t#.  Here's a bit
> > of background.  Before t# was introduced, s# was being used for two
> > distinct purposes: (1) to get an 8-bit text string plus its length, in
> > situations where the length was needed; (2) to get binary data (e.g.
> > GIF data read from a file in "rb" mode).  Greg pointed out that if we
> > ever introduced some form of Unicode support, these two had to be
> > disambiguated.  We found that the majority of uses was for (2)!
> > Therefore we decided to change the definition of s# to mean only (2),
> > and introduced t# to mean (1).  Also, we introduced getcharbuffer
> > corresponding to t#, while getreadbuffer was meant for s#.
> 
> I know its too late now, but I can't really follow the arguments
> here: in what ways are (1) and (2) different from the implementations
> point of view ? If "t#" is to return UTF-8 then  buffer> will not equal , so both parser markers return
> essentially the same information. The only difference would be
> on the semantic side: (1) means: give me text data, while (2) does
> not specify the data type.
> 
> Perhaps I'm missing something...

The idea is that (1)/s# disallows any translation of the data, while
(2)/t# requires translation of the data to an ASCII superset (possibly
multibyte, such as UTF-8 or shift-JIS).  (2)/t# assumes that the data
contains text and that if the text consists of only ASCII characters
they are represented as themselves.  (1)/s# makes no such assumption.

In terms of implementation, Unicode objects should translate
themselves to the default encoding for t# (if possible), but they
should make the native representation available for s#.

For example, take an encryption engine.  While it is defined in terms
of byte streams, there's no requirement that the bytes represent
characters -- they could be the bytes of a GIF file, an MP3 file, or a
gzipped tar file.  If we pass Unicode to an encryption engine, we want
Unicode to come out at the other end, not UTF-8.  (If we had wanted to
encrypt UTF-8, we should have fed it UTF-8.)

> > Note that the definition of the 's' format was left alone -- as
> > before, it means you need an 8-bit text string not containing null
> > bytes.
> 
> This definition should then be changed to "text string without
> null bytes" dropping the 8-bit reference.

Aha, I think there's a confusion about what "8-bit" means.  For me, a
multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?
(As far as I know, C uses char* to represent multibyte characters.)
Maybe we should disambiguate it more explicitly?

> > Our expectation was that a Unicode string passed to an s# situation
> > would give a pointer to the internal format plus a byte count (not a
> > character count!) while t# would get a pointer to some kind of 8-bit
> > translation/encoding plus a byte count, with the explicit requirement
> > that the 8-bit translation would have the same lifetime as the
> > original unicode object.  We decided to leave it up to the next
> > generation (i.e., Marc-Andre :-) to decide what kind of translation to
> > use and what to do when there is no reasonable translation.
> 
> Hmm, I would strongly object to making "s#" return the internal
> format. file.write() would then default to writing UTF-16 data
> instead of UTF-8 data. This could result in strange errors
> due to the UTF-16 format being endian dependent.

But this was the whole design.  file.write() needs to be changed to
use s# when the file is open in binary mode and t# when the file is
open in text mode.

> It would also break the symmetry between file.write(u) and
> unicode(file.read()), since the default encoding is not used as
> internal format for other reasons (see proposal).

If the file is encoded using UTF-16 or UCS-2, you should open it in
binary mode and use unicode(file.read(), 'utf-16').  (Or perhaps the
app should read the first 2 bytes and check for a BOM and then decide
to choose bewteen 'utf-16-be' and 'utf-16-le'.)

> > Any of the following choices is acceptable (from the point of view of
> > not breaking the intended t# semantics; we can now start deciding
> > which we like best):
> 
> I think we have already agreed on using UTF-8 for the default
> encoding. It has quite a few advantages. See
> 
> 	http://czyborra.com/utf/
> 
> for a good overview of the pros and cons.

Of course.  I was just presenting the list as an argument that if
we changed our mind about the default encoding, t# should follow the
default encoding (and not pick an encoding by other means).

> > - utf-8
> > - latin-1
> > - ascii
> > - shift-jis
> > - lower byte of unicode ordinal
> > - some user- or os-specified multibyte encoding
> > 
> > As far as t# is concerned, for encodings that don't encode all of
> > Unicode, untranslatable characters could be dealt with in any number
> > of ways (raise an exception, ignore, replace with '?', make best
> > effort, etc.).
> 
> The usual Python way would be: raise an exception. This is what
> the proposal defines for Codecs in case an encoding/decoding
> mapping is not possible, BTW. (UTF-8 will always succeed on
> output.)

Did you read Andy Robinson's case study?  He suggested that for
certain encodings there may be other things you can do that are more
user-friendly than raising an exception, depending on the application.
I am proposing to leave this a detail of each specific translation.
There may even be translations that do the same thing except they have
a different behavior for untranslatable cases -- e.g. a strict version
that raises an exception and a non-strict version that replaces bad
characters with '?'.  I think this is one of the powers of having an
extensible set of encodings.

> > Given the current context, it should probably be the same as the
> > default encoding -- i.e., utf-8.  If we end up making the default
> > user-settable, we'll have to decide what to do with untranslatable
> > characters -- but that will probably be decided by the user too (it
> > would be a property of a specific translation specification).
> > 
> > In any case, I feel that t# could receive a multi-byte encoding,
> > s# should receive raw binary data, and they should correspond to
> > getcharbuffer and getreadbuffer, respectively.
> 
> Why would you want to have "s#" return the raw binary data for
> Unicode objects ? 

Because file.write() for a binary file, and other similar things
(e.g. the encryption engine example I mentioned above) must have
*some* way to get at the raw bits.

> Note that it is not mentioned anywhere that
> "s#" and "t#" do have to necessarily return different things
> (binary being a superset of text). I'd opt for "s#" and "t#" both
> returning UTF-8 data. This can be implemented by delegating the
> buffer slots to the  object (see below).

This would defeat the whole purpose of introducing t#.  We might as
well drop t# then altogether if we adopt this.

> > > Now Greg would chime in with the buffer interface and
> > > argue that it should make the underlying internal
> > > format accessible. This is a bad idea, IMHO, since you
> > > shouldn't really have to know what the internal data format
> > > is.
> > 
> > This is for C code.  Quite likely it *does* know what the internal
> > data format is!
> 
> C code can use the PyUnicode_* APIs to access the data. I
> don't think that argument parsing is powerful enough to
> provide the C code with enough information about the data
> contents, e.g. it can only state the encoding length, not the
> string length.

Typically, all the C code does is pass multibyte encoded strings on to
other library routines that know what to do to them, or simply give
them back unchanged at a later time.  It is essential to know the
number of bytes, for memory allocation purposes.  The number of
characters is totally immaterial (and multibyte-handling code knows
how to calculate the number of characters anyway).

> > > Defining "s#" to return UTF-8 data does not only
> > > make "s" and "s#" return the same data format (which should
> > > always be the case, IMO),
> > 
> > That was before t# was introduced.  No more, alas.  If you replace s#
> > with t#, I agree with you completely.
> 
> Done :-)
>  
> > > but also hides the internal
> > > format from the user and gives him a reliable cross-platform
> > > data representation of Unicode data (note that UTF-8 doesn't
> > > have the byte order problems of UTF-16).
> > >
> > > If you are still with, let's look at what "s" and "s#"
> > 
> > (and t#, which is more relevant here)
> > 
> > > do: they return pointers into data areas which have to
> > > be kept alive until the corresponding object dies.
> > >
> > > The only way to support this feature is by allocating
> > > a buffer for just this purpose (on the fly and only if
> > > needed to prevent excessive memory load). The other
> > > options of adding new magic parser markers or switching
> > > to more generic one all have one downside: you need to
> > > change existing code which is in conflict with the idea
> > > we started out with.
> > 
> > Agreed.  I think this was our thinking when Greg & I introduced t#.
> > My own preference would be to allocate a whole string object, not
> > just a buffer; this could then also be used for the .encode() method
> > using the default encoding.
> 
> Good point. I'll change  to , a Python
> string object created on request.
>  
> > > So, again, the question is: do we want this magical
> > > intergration or not ? Note that this is a design question,
> > > not one of memory consumption...
> > 
> > Yes, I want it.
> > 
> > Note that this doesn't guarantee that all old extensions will work
> > flawlessly when passed Unicode objects; but I think that it covers
> > most cases where you could have a reasonable expectation that it
> > works.
> > 
> > (Hm, unfortunately many reasonable expectations seem to involve
> > the current user's preferred encoding. :-( )
> 
> -- 
> Marc-Andre Lemburg

--Guido van Rossum (home page: http://www.python.org/~guido/)



From Mike.Da.Silva at uk.fid-intl.com  Mon Nov 15 17:01:59 1999
From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike)
Date: Mon, 15 Nov 1999 16:01:59 -0000
Subject: [Python-Dev] Some thoughts on the codecs...
Message-ID: 

Andy Robinson wrote:
1.	Stream interface
At the moment a codec has dump and load methods which read a (slice of a)
stream into a string in memory and vice versa.  As the proposal notes, this
could lead to errors if you take a slice out of a stream.   This is not just
due to character truncation; some Asian encodings are modal and have
shift-in and shift-out sequences as they move from Western single-byte
characters to double-byte ones.   It also seems a bit pointless to me as the
source (or target) is still a Unicode string in memory.
This is a real problem - a filter to convert big files between two encodings
should be possible without knowledge of the particular encoding, as should
one on the input/output of some server.  We can still give a default
implementation for single-byte encodings.
What's a good API for real stream conversion?   just
Codec.encodeStream(infile, outfile)  ?  or is it more useful to feed the
codec with data a chunk at a time?

A user defined chunking factor (suitably defaulted) would be useful for
processing large files.

2.	Data driven codecs
I really like codecs being objects, and believe we could build support for a
lot more encodings, a lot sooner than is otherwise possible, by making them
data driven rather making each one compiled C code with static mapping
tables.  What do people think about the approach below?
First of all, the ISO8859-1 series are straight mappings to Unicode code
points.  So one Python script could parse these files and build the mapping
table, and a very small data file could hold these encodings.  A compiled
helper function analogous to string.translate() could deal with most of
them.
Secondly, the double-byte ones involve a mixture of algorithms and data.
The worst cases I know are modal encodings which need a single-byte lookup
table, a double-byte lookup table, and have some very simple rules about
escape sequences in between them.  A simple state machine could still handle
these (and the single-byte mappings above become extra-simple special
cases); I could imagine feeding it a totally data-driven set of rules.  
Third, we can massively compress the mapping tables using a notation which
just lists contiguous ranges; and very often there are relationships between
encodings.  For example, "cpXYZ is just like cpXYY but with an extra
'smiley' at 0XFE32".  In these cases, a script can build a family of related
codecs in an auditable manner. 

The problem here is that we need to decide whether we are Unicode-centric,
or whether Unicode is just another supported encoding. If we are
Unicode-centric, then all code-page translations will require static mapping
tables between the appropriate Unicode character and the relevant code
points in the other encoding.  This would involve (worst case) 64k static
tables for each supported encoding.  Unfortunately this also precludes the
use of algorithmic conversions and or sparse conversion tables because most
of these transformations are relative to a source and target non-Unicode
encoding, eg JIS <---->EUCJIS.  If we are taking the IBM approach (see
CDRA), then we can mix and match approaches, and treat Unicode strings as
just Unicode, and normal strings as being any arbitrary MBCS encoding.

To guarantee the utmost interoperability and Unicode 3.0 (and beyond)
compliance, we should probably assume that all core encodings are relative
to Unicode as the pivot encoding.  This should hopefully avoid any gotcha's
with roundtrips between any two arbitrary native encodings.  The downside is
this will probably be slower than an optimised algorithmic transformation.

3.	What encodings to distribute?
The only clean answers to this are 'almost none', or 'everything that
Unicode 3.0 has a mapping for'.  The latter is going to add some weight to
the distribution.  What are people's feelings?  Do we ship any at all apart
from the Unicode ones?  Should new encodings be downloadable from
www.python.org  ?  Should there be an optional
package outside the main distribution?
Ship with Unicode encodings in the core, the rest should be an add on
package.

If we are truly Unicode-centric, this gives us the most value in terms of
accessing a Unicode character properties database, which will provide
language neutral case folding, Hankaku <----> Zenkaku folding (Japan
specific), and composition / normalisation between composed characters and
their component nonspacing characters.

Regards,
Mike da Silva



From captainrobbo at yahoo.com  Mon Nov 15 17:18:13 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Mon, 15 Nov 1999 08:18:13 -0800 (PST)
Subject: [Python-Dev] just say no...
Message-ID: <19991115161813.13111.rocketmail@web606.mail.yahoo.com>

--- Guido van Rossum  wrote:

> Did you read Andy Robinson's case study?  He 
> suggested that for certain encodings there may be 
> other things you can do that are more
> user-friendly than raising an exception, depending
> on the application. I am proposing to leave this a
> detail of each specific translation.
> There may even be translations that do the same
thing
> except they have a different behavior for 
> untranslatable cases -- e.g. a strict version
> that raises an exception and a non-strict version
> that replaces bad characters with '?'.  I think this
> is one of the powers of having an extensible set of 
> encodings.

This would be a desirable option in almost every case.
 Default is an exception (I want to know my data is
not clean), but an option to specify an error
character.  It is usually a question mark but Mike
tells me that some encodings specify the error
character to use.  

Example - I query a Sybase Unicode database containing
European accents or Japanese.  By default it will give
me question marks.  If I issue the command 'set
char_convert utf8', then I see the lot (as garbage,
but never mind).  If it always errored whenever a
query result contained unexpected data, it would be
almost impossible to maintain the database.

If I wrote my own codec class for a family of
encodings, I'd give it an even wider variety of
error-logging options - maybe a mode where it told me
where in the file the dodgy characters were.

We've already taken the key step by allowing codecs to
be separate objects registered at run-time,
implemented in either C or Python.  This means that
once again Python will have the most flexible solution
around.

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From jim at digicool.com  Mon Nov 15 17:29:13 1999
From: jim at digicool.com (Jim Fulton)
Date: Mon, 15 Nov 1999 11:29:13 -0500
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
Message-ID: <383034D9.6E1E74D4@digicool.com>

"A.M. Kuchling" wrote:
> 
> I noticed this in PyErr_Format(exception, format, va_alist):
> 
>         char buffer[500]; /* Caller is responsible for limiting the format */
>         ...
>         vsprintf(buffer, format, vargs);
> 
> Making the caller responsible for this is error-prone.  The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

I would prefer to see a different interface altogether:

  PyObject *PyErr_StringFormat(errtype, format, buildformat, ...)

So, you could generate an error like this:

  return PyErr_StringFormat(ErrorObject, 
     "You had too many, %d, foos. The last one was %s", 
     "iO", n, someObject)

I implemented this in cPickle. See cPickle_ErrFormat.
(Note that it always returns NULL.)

Jim

--
Jim Fulton           mailto:jim at digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.



From bwarsaw at cnri.reston.va.us  Mon Nov 15 17:54:10 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Mon, 15 Nov 1999 11:54:10 -0500 (EST)
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
	<199911151523.KAA27163@eric.cnri.reston.va.us>
Message-ID: <14384.15026.392781.151886@anthem.cnri.reston.va.us>

>>>>> "Guido" == Guido van Rossum  writes:

    Guido> Assuming that Linux and Solaris have vsnprintf(), can't we
    Guido> just use the configure script to detect it, and issue a
    Guido> warning blaming the platform for those platforms that don't
    Guido> have it?  That seems much simpler (from a maintenance
    Guido> perspective) than carrying our own implementation around
    Guido> (even if we can borrow the Apache version).

Mailman uses vsnprintf in it's C wrapper.  There's a simple configure
test...

# Checks for library functions.
AC_CHECK_FUNCS(vsnprintf)

...and for systems that don't have a vsnprintf, I modified a version
from GNU screen.  It may not have gone through the scrutiny of
Apache's implementation, but for Mailman it was more important that it
be GPL'd (not a Python requirement).

-Barry



From jim at digicool.com  Mon Nov 15 17:56:38 1999
From: jim at digicool.com (Jim Fulton)
Date: Mon, 15 Nov 1999 11:56:38 -0500
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
		<199911151523.KAA27163@eric.cnri.reston.va.us> <14384.10383.718373.432606@amarok.cnri.reston.va.us>
Message-ID: <38303B46.F6AEEDF1@digicool.com>

"Andrew M. Kuchling" wrote:
> 
> Guido van Rossum writes:
> >Assuming that Linux and Solaris have vsnprintf(), can't we just use
> >the configure script to detect it, and issue a warning blaming the
> >platform for those platforms that don't have it?  That seems much
> 
> But people using an already-installed Python binary won't see any such
> configure-time warning, and won't find out about the potential
> problem.  Plus, how do people fix the problem on platforms that don't
> have vsnprintf() -- switch to Solaris or Linux?  Not much of a
> solution.  (vsnprintf() isn't ANSI C, though it's a common extension,
> so platforms that lack it aren't really deficient.)
> 
> Hmm... could we maybe use Python's existing (string % vars) machinery?
>  No, that seems to be hard, because it would want
> PyObjects, and we can't know what Python types to convert the varargs
> to, unless we parse the format string (at which point we may as well
> get a vsnprintf() implementation.

It's easy. You use two format strings. One a Python string format, 
and the other a Py_BuildValue format. See my other note.

Jim


--
Jim Fulton           mailto:jim at digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.



From tismer at appliedbiometrics.com  Mon Nov 15 18:02:20 1999
From: tismer at appliedbiometrics.com (Christian Tismer)
Date: Mon, 15 Nov 1999 18:02:20 +0100
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us>
Message-ID: <38303C9C.42C5C830@appliedbiometrics.com>


Guido van Rossum wrote:
> 
> > I noticed this in PyErr_Format(exception, format, va_alist):
> >
> >       char buffer[500]; /* Caller is responsible for limiting the format */
> >       ...
> >       vsprintf(buffer, format, vargs);
> >
> > Making the caller responsible for this is error-prone.
> 
> Agreed.  The limit of 500 chars, while technically undocumented, is
> part of the specs for PyErr_Format (which is currently wholly
> undocumented).  The current callers all have explicit precautions, but
> of course I agree that this is a potential danger.

All but one (checked them all):
In ceval.c, function call_builtin, there is a possible security hole.
If an extension module happens to create a very long type name
(maybe just via a bug), we will crash.

	}
	PyErr_Format(PyExc_TypeError, "call of non-function (type %s)",
		     func->ob_type->tp_name);
	return NULL;
}

ciao - chris

-- 
Christian Tismer             :^)   
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101   :    *Starship* http://starship.python.net
10553 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home



From guido at CNRI.Reston.VA.US  Mon Nov 15 20:32:00 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 14:32:00 -0500
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: Your message of "Mon, 15 Nov 1999 18:02:20 +0100."
             <38303C9C.42C5C830@appliedbiometrics.com> 
References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us>  
            <38303C9C.42C5C830@appliedbiometrics.com> 
Message-ID: <199911151932.OAA28008@eric.cnri.reston.va.us>

> All but one (checked them all):

Thanks for checking.

> In ceval.c, function call_builtin, there is a possible security hole.
> If an extension module happens to create a very long type name
> (maybe just via a bug), we will crash.
> 
> 	}
> 	PyErr_Format(PyExc_TypeError, "call of non-function (type %s)",
> 		     func->ob_type->tp_name);
> 	return NULL;
> }

I would think that an extension module with a name of nearly 500
characters would draw a lot of attention as being ridiculous.  If
there was a bug through which you could make tp_name point to such a
long string, you could probably exploit that bug without having to use
this particular PyErr_Format() statement.

However, I agree it's better to be safe than sorry, so I've checked in
a fix making it %.400s.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tismer at appliedbiometrics.com  Mon Nov 15 20:41:14 1999
From: tismer at appliedbiometrics.com (Christian Tismer)
Date: Mon, 15 Nov 1999 20:41:14 +0100
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us>  
	            <38303C9C.42C5C830@appliedbiometrics.com> <199911151932.OAA28008@eric.cnri.reston.va.us>
Message-ID: <383061DA.CA5CB373@appliedbiometrics.com>


Guido van Rossum wrote:
> 
> > All but one (checked them all):

[ceval.c without limits]

> I would think that an extension module with a name of nearly 500
> characters would draw a lot of attention as being ridiculous.  If
> there was a bug through which you could make tp_name point to such a
> long string, you could probably exploit that bug without having to use
> this particular PyErr_Format() statement.

Of course this case is very unlikely.
My primary intent was to create such a mess without
an extension, and ExtensionClass seemed to be a candidate since
it synthetizes a type name at runtime (!).
This would have been dangerous since EC is in the heart of Zope.

But, I could not get at this special case since EC always
stands the class/instance checks and so this case can never happen :(

The above lousy result was just to say *something* after no success.

> However, I agree it's better to be safe than sorry, so I've checked in
> a fix making it %.400s.

cheap, consistent, fine - thanks - chris

-- 
Christian Tismer             :^)   
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101   :    *Starship* http://starship.python.net
10553 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home



From mal at lemburg.com  Mon Nov 15 20:04:59 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 20:04:59 +0100
Subject: [Python-Dev] just say no...
References:  <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us>  
	            <382F33AA.C3EE825A@lemburg.com> <199911151550.KAA27188@eric.cnri.reston.va.us>
Message-ID: <3830595B.348E8CC7@lemburg.com>

Guido van Rossum wrote:
> 
> [Misunderstanding in the reasoning behind "t#" and "s#"]
> 
> Thanks for not picking an argument.  Multibyte encodings typically
> have ASCII as a subset (in such a way that an ASCII string is
> represented as itself in bytes).  This is the characteristic that's
> needed in my view.
> 
> > It was my understanding that "t#" refers to single byte character
> > data. That's where the above arguments were aiming at...
> 
> t# refers to byte-encoded data.  Multibyte encodings are explicitly
> designed to be passed cleanly through processing steps that handle
> single-byte character data, as long as they are 8-bit clean and don't
> do too much processing.

Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
"8-bit clean" as you obviously did.
 
> > Perhaps I'm missing something...
> 
> The idea is that (1)/s# disallows any translation of the data, while
> (2)/t# requires translation of the data to an ASCII superset (possibly
> multibyte, such as UTF-8 or shift-JIS).  (2)/t# assumes that the data
> contains text and that if the text consists of only ASCII characters
> they are represented as themselves.  (1)/s# makes no such assumption.
> 
> In terms of implementation, Unicode objects should translate
> themselves to the default encoding for t# (if possible), but they
> should make the native representation available for s#.
> 
> For example, take an encryption engine.  While it is defined in terms
> of byte streams, there's no requirement that the bytes represent
> characters -- they could be the bytes of a GIF file, an MP3 file, or a
> gzipped tar file.  If we pass Unicode to an encryption engine, we want
> Unicode to come out at the other end, not UTF-8.  (If we had wanted to
> encrypt UTF-8, we should have fed it UTF-8.)
> 
> > > Note that the definition of the 's' format was left alone -- as
> > > before, it means you need an 8-bit text string not containing null
> > > bytes.
> >
> > This definition should then be changed to "text string without
> > null bytes" dropping the 8-bit reference.
> 
> Aha, I think there's a confusion about what "8-bit" means.  For me, a
> multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?
> (As far as I know, C uses char* to represent multibyte characters.)
> Maybe we should disambiguate it more explicitly?

There should be some definition for the two markers and the
ideas behind them in the API guide, I guess.
 
> > Hmm, I would strongly object to making "s#" return the internal
> > format. file.write() would then default to writing UTF-16 data
> > instead of UTF-8 data. This could result in strange errors
> > due to the UTF-16 format being endian dependent.
> 
> But this was the whole design.  file.write() needs to be changed to
> use s# when the file is open in binary mode and t# when the file is
> open in text mode.

Ok, that would make the situation a little clearer (even though
I expect the two different encodings to produce some FAQs). 

I still don't feel very comfortable about the fact that all
existing APIs using "s#" will suddenly receive UTF-16 data if
being passed Unicode objects: this probably won't get us the
"magical" Unicode integration we invision, since "t#" usage is not
very wide spread and character handling code will probably not
work well with UTF-16 encoded strings.

Anyway, we should probably try out both methods...

> > It would also break the symmetry between file.write(u) and
> > unicode(file.read()), since the default encoding is not used as
> > internal format for other reasons (see proposal).
> 
> If the file is encoded using UTF-16 or UCS-2, you should open it in
> binary mode and use unicode(file.read(), 'utf-16').  (Or perhaps the
> app should read the first 2 bytes and check for a BOM and then decide
> to choose bewteen 'utf-16-be' and 'utf-16-le'.)

Right, that's the idea (there is a note on this in the Standard
Codec section of the proposal).
 
> > > Any of the following choices is acceptable (from the point of view of
> > > not breaking the intended t# semantics; we can now start deciding
> > > which we like best):
> >
> > I think we have already agreed on using UTF-8 for the default
> > encoding. It has quite a few advantages. See
> >
> >       http://czyborra.com/utf/
> >
> > for a good overview of the pros and cons.
> 
> Of course.  I was just presenting the list as an argument that if
> we changed our mind about the default encoding, t# should follow the
> default encoding (and not pick an encoding by other means).

Ok.
 
> > > - utf-8
> > > - latin-1
> > > - ascii
> > > - shift-jis
> > > - lower byte of unicode ordinal
> > > - some user- or os-specified multibyte encoding
> > >
> > > As far as t# is concerned, for encodings that don't encode all of
> > > Unicode, untranslatable characters could be dealt with in any number
> > > of ways (raise an exception, ignore, replace with '?', make best
> > > effort, etc.).
> >
> > The usual Python way would be: raise an exception. This is what
> > the proposal defines for Codecs in case an encoding/decoding
> > mapping is not possible, BTW. (UTF-8 will always succeed on
> > output.)
> 
> Did you read Andy Robinson's case study?  He suggested that for
> certain encodings there may be other things you can do that are more
> user-friendly than raising an exception, depending on the application.
> I am proposing to leave this a detail of each specific translation.
> There may even be translations that do the same thing except they have
> a different behavior for untranslatable cases -- e.g. a strict version
> that raises an exception and a non-strict version that replaces bad
> characters with '?'.  I think this is one of the powers of having an
> extensible set of encodings.

Agreed, the Codecs should decide for themselves what to do. I'll
add a note to the next version of the proposal.
 
> > > Given the current context, it should probably be the same as the
> > > default encoding -- i.e., utf-8.  If we end up making the default
> > > user-settable, we'll have to decide what to do with untranslatable
> > > characters -- but that will probably be decided by the user too (it
> > > would be a property of a specific translation specification).
> > >
> > > In any case, I feel that t# could receive a multi-byte encoding,
> > > s# should receive raw binary data, and they should correspond to
> > > getcharbuffer and getreadbuffer, respectively.
> >
> > Why would you want to have "s#" return the raw binary data for
> > Unicode objects ?
> 
> Because file.write() for a binary file, and other similar things
> (e.g. the encryption engine example I mentioned above) must have
> *some* way to get at the raw bits.

What for ? Any lossless encoding should do the trick... UTF-8
is just as good as UTF-16 for binary files; plus it's more compact
for ASCII data. I don't really see a need to get explicitly
at the internal data representation because both encodings are
in fact "internal" w/r to Unicode objects.

The only argument I can come up with is that using UTF-16 for
binary files could (possibly) eliminate the UTF-8 conversion step
which is otherwise always needed.
 
> > Note that it is not mentioned anywhere that
> > "s#" and "t#" do have to necessarily return different things
> > (binary being a superset of text). I'd opt for "s#" and "t#" both
> > returning UTF-8 data. This can be implemented by delegating the
> > buffer slots to the  object (see below).
> 
> This would defeat the whole purpose of introducing t#.  We might as
> well drop t# then altogether if we adopt this.

Well... yes ;-)
 
> > > > Now Greg would chime in with the buffer interface and
> > > > argue that it should make the underlying internal
> > > > format accessible. This is a bad idea, IMHO, since you
> > > > shouldn't really have to know what the internal data format
> > > > is.
> > >
> > > This is for C code.  Quite likely it *does* know what the internal
> > > data format is!
> >
> > C code can use the PyUnicode_* APIs to access the data. I
> > don't think that argument parsing is powerful enough to
> > provide the C code with enough information about the data
> > contents, e.g. it can only state the encoding length, not the
> > string length.
> 
> Typically, all the C code does is pass multibyte encoded strings on to
> other library routines that know what to do to them, or simply give
> them back unchanged at a later time.  It is essential to know the
> number of bytes, for memory allocation purposes.  The number of
> characters is totally immaterial (and multibyte-handling code knows
> how to calculate the number of characters anyway).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Mon Nov 15 20:20:55 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 20:20:55 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>
Message-ID: <38305D17.60EC94D0@lemburg.com>

Andy Robinson wrote:
> 
> Some thoughts on the codecs...
> 
> 1. Stream interface
> At the moment a codec has dump and load methods which
> read a (slice of a) stream into a string in memory and
> vice versa.  As the proposal notes, this could lead to
> errors if you take a slice out of a stream.   This is
> not just due to character truncation; some Asian
> encodings are modal and have shift-in and shift-out
> sequences as they move from Western single-byte
> characters to double-byte ones.   It also seems a bit
> pointless to me as the source (or target) is still a
> Unicode string in memory.
> 
> This is a real problem - a filter to convert big files
> between two encodings should be possible without
> knowledge of the particular encoding, as should one on
> the input/output of some server.  We can still give a
> default implementation for single-byte encodings.
> 
> What's a good API for real stream conversion?   just
> Codec.encodeStream(infile, outfile)  ?  or is it more
> useful to feed the codec with data a chunk at a time?

The idea was to use Unicode as intermediate for all
encoding conversions. 

What you invision here are stream recoders. The can
easily be implemented as an useful addition to the Codec
subclasses, but I don't think that these have to go
into the core.
 
> 2. Data driven codecs
> I really like codecs being objects, and believe we
> could build support for a lot more encodings, a lot
> sooner than is otherwise possible, by making them data
> driven rather making each one compiled C code with
> static mapping tables.  What do people think about the
> approach below?
> 
> First of all, the ISO8859-1 series are straight
> mappings to Unicode code points.  So one Python script
> could parse these files and build the mapping table,
> and a very small data file could hold these encodings.
>   A compiled helper function analogous to
> string.translate() could deal with most of them.

The problem with these large tables is that currently
Python modules are not shared among processes since
every process builds its own table.

Static C data has the advantage of being shareable at
the OS level.

You can of course implement Python based lookup tables,
but these should be too large...
 
> Secondly, the double-byte ones involve a mixture of
> algorithms and data.  The worst cases I know are modal
> encodings which need a single-byte lookup table, a
> double-byte lookup table, and have some very simple
> rules about escape sequences in between them.  A
> simple state machine could still handle these (and the
> single-byte mappings above become extra-simple special
> cases); I could imagine feeding it a totally
> data-driven set of rules.
> 
> Third, we can massively compress the mapping tables
> using a notation which just lists contiguous ranges;
> and very often there are relationships between
> encodings.  For example, "cpXYZ is just like cpXYY but
> with an extra 'smiley' at 0XFE32".  In these cases, a
> script can build a family of related codecs in an
> auditable manner.

These are all great ideas, but I think they unnecessarily
complicate the proposal.
 
> 3. What encodings to distribute?
> The only clean answers to this are 'almost none', or
> 'everything that Unicode 3.0 has a mapping for'.  The
> latter is going to add some weight to the
> distribution.  What are people's feelings?  Do we ship
> any at all apart from the Unicode ones?  Should new
> encodings be downloadable from www.python.org?  Should
> there be an optional package outside the main
> distribution?

Since Codecs can be registered at runtime, there is quite
some potential there for extension writers coding their
own fast codecs. E.g. one could use mxTextTools as codec
engine working at C speeds.

I would propose to only add some very basic encodings to
the standard distribution, e.g. the ones mentioned under
Standard Codecs in the proposal:

  'utf-8':		8-bit variable length encoding
  'utf-16':		16-bit variable length encoding (litte/big endian)
  'utf-16-le':		utf-16 but explicitly little endian
  'utf-16-be':		utf-16 but explicitly big endian
  'ascii':		7-bit ASCII codepage
  'latin-1':		Latin-1 codepage
  'html-entities':	Latin-1 + HTML entities;
			see htmlentitydefs.py from the standard Pythin Lib
  'jis' (a popular version XXX):
			Japanese character encoding
  'unicode-escape':	See Unicode Constructors for a definition
  'native':		Dump of the Internal Format used by Python

Perhaps not even 'html-entities' (even though it would make
a cool replacement for cgi.escape()) and maybe we should
also place the JIS encoding into a separate Unicode package.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Mon Nov 15 20:26:16 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 20:26:16 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: 
Message-ID: <38305E58.28B20E24@lemburg.com>

"Da Silva, Mike" wrote:
> 
> Andy Robinson wrote:
> --
> 1.      Stream interface
> At the moment a codec has dump and load methods which read a (slice of a)
> stream into a string in memory and vice versa.  As the proposal notes, this
> could lead to errors if you take a slice out of a stream.   This is not just
> due to character truncation; some Asian encodings are modal and have
> shift-in and shift-out sequences as they move from Western single-byte
> characters to double-byte ones.   It also seems a bit pointless to me as the
> source (or target) is still a Unicode string in memory.
> This is a real problem - a filter to convert big files between two encodings
> should be possible without knowledge of the particular encoding, as should
> one on the input/output of some server.  We can still give a default
> implementation for single-byte encodings.
> What's a good API for real stream conversion?   just
> Codec.encodeStream(infile, outfile)  ?  or is it more useful to feed the
> codec with data a chunk at a time?
> --
> A user defined chunking factor (suitably defaulted) would be useful for
> processing large files.
> --
> 2.      Data driven codecs
> I really like codecs being objects, and believe we could build support for a
> lot more encodings, a lot sooner than is otherwise possible, by making them
> data driven rather making each one compiled C code with static mapping
> tables.  What do people think about the approach below?
> First of all, the ISO8859-1 series are straight mappings to Unicode code
> points.  So one Python script could parse these files and build the mapping
> table, and a very small data file could hold these encodings.  A compiled
> helper function analogous to string.translate() could deal with most of
> them.
> Secondly, the double-byte ones involve a mixture of algorithms and data.
> The worst cases I know are modal encodings which need a single-byte lookup
> table, a double-byte lookup table, and have some very simple rules about
> escape sequences in between them.  A simple state machine could still handle
> these (and the single-byte mappings above become extra-simple special
> cases); I could imagine feeding it a totally data-driven set of rules.
> Third, we can massively compress the mapping tables using a notation which
> just lists contiguous ranges; and very often there are relationships between
> encodings.  For example, "cpXYZ is just like cpXYY but with an extra
> 'smiley' at 0XFE32".  In these cases, a script can build a family of related
> codecs in an auditable manner.
> --
> The problem here is that we need to decide whether we are Unicode-centric,
> or whether Unicode is just another supported encoding. If we are
> Unicode-centric, then all code-page translations will require static mapping
> tables between the appropriate Unicode character and the relevant code
> points in the other encoding.  This would involve (worst case) 64k static
> tables for each supported encoding.  Unfortunately this also precludes the
> use of algorithmic conversions and or sparse conversion tables because most
> of these transformations are relative to a source and target non-Unicode
> encoding, eg JIS <---->EUCJIS.  If we are taking the IBM approach (see
> CDRA), then we can mix and match approaches, and treat Unicode strings as
> just Unicode, and normal strings as being any arbitrary MBCS encoding.
> 
> To guarantee the utmost interoperability and Unicode 3.0 (and beyond)
> compliance, we should probably assume that all core encodings are relative
> to Unicode as the pivot encoding.  This should hopefully avoid any gotcha's
> with roundtrips between any two arbitrary native encodings.  The downside is
> this will probably be slower than an optimised algorithmic transformation.

Optimizations should go into separate packages for direct EncodingA
-> EncodingB conversions. I don't think we need them in the core.

> --
> 3.      What encodings to distribute?
> The only clean answers to this are 'almost none', or 'everything that
> Unicode 3.0 has a mapping for'.  The latter is going to add some weight to
> the distribution.  What are people's feelings?  Do we ship any at all apart
> from the Unicode ones?  Should new encodings be downloadable from
> www.python.org  ?  Should there be an optional
> package outside the main distribution?
> --
> Ship with Unicode encodings in the core, the rest should be an add on
> package.
> 
> If we are truly Unicode-centric, this gives us the most value in terms of
> accessing a Unicode character properties database, which will provide
> language neutral case folding, Hankaku <----> Zenkaku folding (Japan
> specific), and composition / normalisation between composed characters and
> their component nonspacing characters.

>From the proposal:

"""
Unicode Character Properties:
-----------------------------

A separate module "unicodedata" should provide a compact interface to
all Unicode character properties defined in the standard's
UnicodeData.txt file.

Among other things, these properties provide ways to recognize
numbers, digits, spaces, whitespace, etc.

Since this module will have to provide access to all Unicode
characters, it will eventually have to contain the data from
UnicodeData.txt which takes up around 200kB. For this reason, the data
should be stored in static C data. This enables compilation as shared
module which the underlying OS can shared between processes (unlike
normal Python code modules).

XXX Define the interface...

"""

Special CJK packages can then access this data for the purposes
you mentioned above.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From guido at CNRI.Reston.VA.US  Mon Nov 15 22:37:28 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 16:37:28 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Mon, 15 Nov 1999 20:20:55 +0100."
             <38305D17.60EC94D0@lemburg.com> 
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>  
            <38305D17.60EC94D0@lemburg.com> 
Message-ID: <199911152137.QAA28280@eric.cnri.reston.va.us>

> Andy Robinson wrote:
> > 
> > Some thoughts on the codecs...
> > 
> > 1. Stream interface
> > At the moment a codec has dump and load methods which
> > read a (slice of a) stream into a string in memory and
> > vice versa.  As the proposal notes, this could lead to
> > errors if you take a slice out of a stream.   This is
> > not just due to character truncation; some Asian
> > encodings are modal and have shift-in and shift-out
> > sequences as they move from Western single-byte
> > characters to double-byte ones.   It also seems a bit
> > pointless to me as the source (or target) is still a
> > Unicode string in memory.
> > 
> > This is a real problem - a filter to convert big files
> > between two encodings should be possible without
> > knowledge of the particular encoding, as should one on
> > the input/output of some server.  We can still give a
> > default implementation for single-byte encodings.
> > 
> > What's a good API for real stream conversion?   just
> > Codec.encodeStream(infile, outfile)  ?  or is it more
> > useful to feed the codec with data a chunk at a time?

M.-A. Lemburg responds:

> The idea was to use Unicode as intermediate for all
> encoding conversions. 
> 
> What you invision here are stream recoders. The can
> easily be implemented as an useful addition to the Codec
> subclasses, but I don't think that these have to go
> into the core.

What I wanted was a codec API that acts somewhat like a buffered file;
the buffer makes it possible to efficient handle shift states.  This
is not exactly what Andy shows, but it's not what Marc's current spec
has either.

I had thought something more like what Java does: an output stream
codec's constructor takes a writable file object and the object
returned by the constructor has a write() method, a flush() method and
a close() method.  It acts like a buffering interface to the
underlying file; this allows it to generate the minimal number of
shift sequeuces.  Similar for input stream codecs.

Andy's file translation example could then be written as follows:

# assuming variables input_file, input_encoding, output_file,
# output_encoding, and constant BUFFER_SIZE

f = open(input_file, "rb")
f1 = unicodec.codecs[input_encoding].stream_reader(f)
g = open(output_file, "wb")
g1 = unicodec.codecs[output_encoding].stream_writer(f)

while 1:
      buffer = f1.read(BUFFER_SIZE)
      if not buffer:
	 break
      f2.write(buffer)

f2.close()
f1.close()

Note that we could possibly make these the only API that a codec needs
to provide; the string object <--> unicode object conversions can be
done using this and the cStringIO module.  (On the other hand it seems
a common case that would be quite useful.)

> > 2. Data driven codecs
> > I really like codecs being objects, and believe we
> > could build support for a lot more encodings, a lot
> > sooner than is otherwise possible, by making them data
> > driven rather making each one compiled C code with
> > static mapping tables.  What do people think about the
> > approach below?
> > 
> > First of all, the ISO8859-1 series are straight
> > mappings to Unicode code points.  So one Python script
> > could parse these files and build the mapping table,
> > and a very small data file could hold these encodings.
> >   A compiled helper function analogous to
> > string.translate() could deal with most of them.
> 
> The problem with these large tables is that currently
> Python modules are not shared among processes since
> every process builds its own table.
> 
> Static C data has the advantage of being shareable at
> the OS level.

Don't worry about it.  128K is too small to care, I think...

> You can of course implement Python based lookup tables,
> but these should be too large...
>  
> > Secondly, the double-byte ones involve a mixture of
> > algorithms and data.  The worst cases I know are modal
> > encodings which need a single-byte lookup table, a
> > double-byte lookup table, and have some very simple
> > rules about escape sequences in between them.  A
> > simple state machine could still handle these (and the
> > single-byte mappings above become extra-simple special
> > cases); I could imagine feeding it a totally
> > data-driven set of rules.
> > 
> > Third, we can massively compress the mapping tables
> > using a notation which just lists contiguous ranges;
> > and very often there are relationships between
> > encodings.  For example, "cpXYZ is just like cpXYY but
> > with an extra 'smiley' at 0XFE32".  In these cases, a
> > script can build a family of related codecs in an
> > auditable manner.
> 
> These are all great ideas, but I think they unnecessarily
> complicate the proposal.

Agreed, let's leave the *implementation* of codecs out of the current
efforts.

However I want to make sure that the *interface* to codecs is defined
right, because changing it will be expensive.  (This is Linus
Torvald's philosophy on drivers -- he doesn't care about bugs in
drivers, as they will get fixed; however he greatly cares about
defining the driver APIs correctly.)

> > 3. What encodings to distribute?
> > The only clean answers to this are 'almost none', or
> > 'everything that Unicode 3.0 has a mapping for'.  The
> > latter is going to add some weight to the
> > distribution.  What are people's feelings?  Do we ship
> > any at all apart from the Unicode ones?  Should new
> > encodings be downloadable from www.python.org?  Should
> > there be an optional package outside the main
> > distribution?
> 
> Since Codecs can be registered at runtime, there is quite
> some potential there for extension writers coding their
> own fast codecs. E.g. one could use mxTextTools as codec
> engine working at C speeds.

(Do you think you'll be able to extort some money from HP for these? :-)

> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
> 
>   'utf-8':		8-bit variable length encoding
>   'utf-16':		16-bit variable length encoding (litte/big endian)
>   'utf-16-le':		utf-16 but explicitly little endian
>   'utf-16-be':		utf-16 but explicitly big endian
>   'ascii':		7-bit ASCII codepage
>   'latin-1':		Latin-1 codepage
>   'html-entities':	Latin-1 + HTML entities;
> 			see htmlentitydefs.py from the standard Pythin Lib
>   'jis' (a popular version XXX):
> 			Japanese character encoding
>   'unicode-escape':	See Unicode Constructors for a definition
>   'native':		Dump of the Internal Format used by Python
> 
> Perhaps not even 'html-entities' (even though it would make
> a cool replacement for cgi.escape()) and maybe we should
> also place the JIS encoding into a separate Unicode package.

I'd drop html-entities, it seems too cutesie.  (And who uses these
anyway, outside browsers?)

For JIS (shift-JIS?) I hope that Andy can help us with some pointers
and validation.

And unicode-escape: now that you mention it, this is a section of
the proposal that I don't understand.  I quote it here:

| Python should provide a built-in constructor for Unicode strings which
| is available through __builtins__:
| 
|   u = unicode([,=])
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

What do you mean by this notation?  Since encoding names are not
always legal Python identifiers (most contain hyphens), I don't
understand what you really meant here.  Do you mean to say that it has
to be a keyword argument?  I would disagree; and then I would have
expected the notation [,encoding=].

| With the 'unicode-escape' encoding being defined as:
| 
|   u = u''
| 
| ? for single characters (and this includes all \XXX sequences except \uXXXX),
|   take the ordinal and interpret it as Unicode ordinal;
| 
| ? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX 
|   instead, e.g. \u03C0 to represent the character Pi.

I've looked at this several times and I don't see the difference
between the two bullets.  (Ironically, you are using a non-ASCII
character here that doesn't always display, depending on where I look
at your mail :-).

Can you give some examples?

Is u'\u0020' different from u'\x20' (a space)?

Does '\u0020' (no u prefix) have a meaning?

Also, I remember reading Tim Peters who suggested that a "raw unicode"
notation (ur"...") might be necessary, to encode regular expressions.
I tend to agree.

While I'm on the topic, I don't see in your proposal a description of
the source file character encoding.  Currently, this is undefined, and
in fact can be (ab)used to enter non-ASCII in string literals.  For
example, a programmer named Fran?ois might write a file containing
this statement:

  print "Written by Fran?ois." # (There's a cedilla in there!)

(He assumes his source character encoding is Latin-1, and he doesn't
want to have to type \347 when he can type a cedilla on his keyboard.)

If his source file (or .pyc file!)  is executed by a Japanese user,
this will probably print some garbage.

Using the new Unicode strings, Fran?ois could change his program as
follows:

  print unicode("Written by Fran?ois.", "latin-1")

Assuming that Fran?ois sets his sys.stdout to use Latin-1, while the
Japanese user sets his to shift-JIS (or whatever his kanjiterm uses).

But when the Japanese user views Fran?ois' source file, he will again
see garbage.  If he uses a generic tool to translate latin-1 files to
shift-JIS (assuming shift-JIS has a cedilla character) the program
will no longer work correctly -- the string "latin-1" has to be
changed to "shift-jis".

What should we do about this?  The safest and most radical solution is
to disallow non-ASCII source characters; Fran?ois will then have to
type

  print u"Written by Fran\u00E7ois."

but, knowing Fran?ois, he probably won't like this solution very much
(since he didn't like the \347 version either).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From andy at robanal.demon.co.uk  Mon Nov 15 22:41:21 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Mon, 15 Nov 1999 21:41:21 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38305D17.60EC94D0@lemburg.com>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>
Message-ID: <38307984.12653394@post.demon.co.uk>

On Mon, 15 Nov 1999 20:20:55 +0100, you wrote:

>These are all great ideas, but I think they unnecessarily
>complicate the proposal.

However, to claim that Python is properly internationalized, we will
need a large number of multi-byte encodings to be available.  It's a
large amount of work, it must be provably correct, and someone's going
to have to do it.  So if anyone with more C expertise than me - not
hard :-) - is interested

I'm not suggesting putting my points in the Unicode proposal - in
fact, I'm very happy we have a proposal which allows for extension,
and lets us work on the encodings separately (and later).

>Since Codecs can be registered at runtime, there is quite
>some potential there for extension writers coding their
>own fast codecs. E.g. one could use mxTextTools as codec
>engine working at C speeds.
Exactly my thoughts , although I was thinking of a more slimmed down
and specialized one.  The right tool might be usable for things like
compression algorithms too.  Separate project to the Unicode stuff,
but if anyone is interested, talk to me.

>I would propose to only add some very basic encodings to
>the standard distribution, e.g. the ones mentioned under
>Standard Codecs in the proposal:
>
>  'utf-8':		8-bit variable length encoding
>  'utf-16':		16-bit variable length encoding (litte/big endian)
>  'utf-16-le':		utf-16 but explicitly little endian
>  'utf-16-be':		utf-16 but explicitly big endian
>  'ascii':		7-bit ASCII codepage
>  'latin-1':		Latin-1 codepage
>  'html-entities':	Latin-1 + HTML entities;
>			see htmlentitydefs.py from the standard Pythin Lib
>  'jis' (a popular version XXX):
>			Japanese character encoding
>  'unicode-escape':	See Unicode Constructors for a definition
>  'native':		Dump of the Internal Format used by Python
>
Leave JISXXX and the CJK stuff out.  If you get into Japanese, you
really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there
are lots of options about how to do it.  The other ones are
algorithmic and can be small and fast and fit into the core.

Ditto with HTML, and maybe even escaped-unicode too.

In summary, the current discussion is clearly doing the right things,
but is only covering a small percentage of what needs to be done to
internationalize Python fully.

- Andy




From guido at CNRI.Reston.VA.US  Mon Nov 15 22:49:26 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 16:49:26 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Mon, 15 Nov 1999 21:41:21 GMT."
             <38307984.12653394@post.demon.co.uk> 
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>  
            <38307984.12653394@post.demon.co.uk> 
Message-ID: <199911152149.QAA28345@eric.cnri.reston.va.us>

> In summary, the current discussion is clearly doing the right things,
> but is only covering a small percentage of what needs to be done to
> internationalize Python fully.

Agreed.  So let's focus on defining interfaces that are correct and
convenient so others who want to add codecs won't have to fight our
architecture!

Is the current architecture good enough so that the Japanese codecs
will fit in it?  (I'm particularly worried about the stream codecs,
see my previous message.)

--Guido van Rossum (home page: http://www.python.org/~guido/)




From andy at robanal.demon.co.uk  Mon Nov 15 22:58:34 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Mon, 15 Nov 1999 21:58:34 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <199911152149.QAA28345@eric.cnri.reston.va.us>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>   <38307984.12653394@post.demon.co.uk> <199911152149.QAA28345@eric.cnri.reston.va.us>
Message-ID: <3831806d.14422147@post.demon.co.uk>

On Mon, 15 Nov 1999 16:49:26 -0500, you wrote:

>> In summary, the current discussion is clearly doing the right things,
>> but is only covering a small percentage of what needs to be done to
>> internationalize Python fully.
>
>Agreed.  So let's focus on defining interfaces that are correct and
>convenient so others who want to add codecs won't have to fight our
>architecture!
>
>Is the current architecture good enough so that the Japanese codecs
>will fit in it?  (I'm particularly worried about the stream codecs,
>see my previous message.)
>
No, I don't think it is good enough.  We need a stream codec, and as
you said the string and file interfaces can be built out of that.  

You guys will know better than me what the best patterns for that
are...

- Andy







From andy at robanal.demon.co.uk  Mon Nov 15 23:30:53 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Mon, 15 Nov 1999 22:30:53 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>   <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <383086da.16067684@post.demon.co.uk>

On Mon, 15 Nov 1999 16:37:28 -0500, you wrote:

># assuming variables input_file, input_encoding, output_file,
># output_encoding, and constant BUFFER_SIZE
>
>f = open(input_file, "rb")
>f1 = unicodec.codecs[input_encoding].stream_reader(f)
>g = open(output_file, "wb")
>g1 = unicodec.codecs[output_encoding].stream_writer(f)
>
>while 1:
>      buffer = f1.read(BUFFER_SIZE)
>      if not buffer:
>	 break
>      f2.write(buffer)
>
>f2.close()
>f1.close()
>
>Note that we could possibly make these the only API that a codec needs
>to provide; the string object <--> unicode object conversions can be
>done using this and the cStringIO module.  (On the other hand it seems
>a common case that would be quite useful.)
Perfect.  I'd keep the string ones - easy to implement but a big
convenience.

The proposal also says:
>For explicit handling of Unicode using files, the unicodec module
>could provide stream wrappers which provide transparent
>encoding/decoding for any open stream (file-like object):
>
>  import unicodec
>  file = open('mytext.txt','rb')
>  ufile = unicodec.stream(file,'utf-16')
>  u = ufile.read()
>  ...
>  ufile.close()

It seems to me that if we go for stream_reader, it replaces this bit
of the proposal too - no need for unicodec to provide anything.  If
you want to have a convenience function there to save a line or two,
you could have
	unicodec.open(filename, mode, encoding)
which returned a stream_reader.


- Andy




From mal at lemburg.com  Mon Nov 15 23:54:38 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 23:54:38 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>  
	            <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <38308F2E.44B9C6BF@lemburg.com>

[I'll get back on this tomorrow, just some quick notes here...]

Guido van Rossum wrote:
> 
> > Andy Robinson wrote:
> > >
> > > Some thoughts on the codecs...
> > >
> > > 1. Stream interface
> > > At the moment a codec has dump and load methods which
> > > read a (slice of a) stream into a string in memory and
> > > vice versa.  As the proposal notes, this could lead to
> > > errors if you take a slice out of a stream.   This is
> > > not just due to character truncation; some Asian
> > > encodings are modal and have shift-in and shift-out
> > > sequences as they move from Western single-byte
> > > characters to double-byte ones.   It also seems a bit
> > > pointless to me as the source (or target) is still a
> > > Unicode string in memory.
> > >
> > > This is a real problem - a filter to convert big files
> > > between two encodings should be possible without
> > > knowledge of the particular encoding, as should one on
> > > the input/output of some server.  We can still give a
> > > default implementation for single-byte encodings.
> > >
> > > What's a good API for real stream conversion?   just
> > > Codec.encodeStream(infile, outfile)  ?  or is it more
> > > useful to feed the codec with data a chunk at a time?
> 
> M.-A. Lemburg responds:
> 
> > The idea was to use Unicode as intermediate for all
> > encoding conversions.
> >
> > What you invision here are stream recoders. The can
> > easily be implemented as an useful addition to the Codec
> > subclasses, but I don't think that these have to go
> > into the core.
> 
> What I wanted was a codec API that acts somewhat like a buffered file;
> the buffer makes it possible to efficient handle shift states.  This
> is not exactly what Andy shows, but it's not what Marc's current spec
> has either.
> 
> I had thought something more like what Java does: an output stream
> codec's constructor takes a writable file object and the object
> returned by the constructor has a write() method, a flush() method and
> a close() method.  It acts like a buffering interface to the
> underlying file; this allows it to generate the minimal number of
> shift sequeuces.  Similar for input stream codecs.

The Codecs provide implementations for encoding and decoding,
they are not intended as complete wrappers for e.g. files or
sockets.

The unicodec module will define a generic stream wrapper
(which is yet to be defined) for dealing with files, sockets,
etc. It will use the codec registry to do the actual codec
work.
 
>From the proposal:
"""
For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(,,) could be provided as
    short-hand for unicodec.file(open(,),) which
    also assures that  contains the 'b' character when needed.

XXX Specify the wrapper(s)...

    Open issues: what to do with Python strings
    fed to the .write() method (may need to know the encoding of the
    strings) and when/if to return Python strings through the .read()
    method.

    Perhaps we need more than one type of wrapper here.
"""

> Andy's file translation example could then be written as follows:
> 
> # assuming variables input_file, input_encoding, output_file,
> # output_encoding, and constant BUFFER_SIZE
> 
> f = open(input_file, "rb")
> f1 = unicodec.codecs[input_encoding].stream_reader(f)
> g = open(output_file, "wb")
> g1 = unicodec.codecs[output_encoding].stream_writer(f)
> 
> while 1:
>       buffer = f1.read(BUFFER_SIZE)
>       if not buffer:
>          break
>       f2.write(buffer)
> 
> f2.close()
> f1.close()

 
> Note that we could possibly make these the only API that a codec needs
> to provide; the string object <--> unicode object conversions can be
> done using this and the cStringIO module.  (On the other hand it seems
> a common case that would be quite useful.)

You wouldn't want to go via cStringIO for *every* encoding
translation.

The Codec interface defines two pairs of methods
on purpose: one which works internally (ie. directly between
strings and Unicode objects), and one which works externally
(directly between a stream and Unicode objects).

> > > 2. Data driven codecs
> > > I really like codecs being objects, and believe we
> > > could build support for a lot more encodings, a lot
> > > sooner than is otherwise possible, by making them data
> > > driven rather making each one compiled C code with
> > > static mapping tables.  What do people think about the
> > > approach below?
> > >
> > > First of all, the ISO8859-1 series are straight
> > > mappings to Unicode code points.  So one Python script
> > > could parse these files and build the mapping table,
> > > and a very small data file could hold these encodings.
> > >   A compiled helper function analogous to
> > > string.translate() could deal with most of them.
> >
> > The problem with these large tables is that currently
> > Python modules are not shared among processes since
> > every process builds its own table.
> >
> > Static C data has the advantage of being shareable at
> > the OS level.
> 
> Don't worry about it.  128K is too small to care, I think...

Huh ? 128K for every process using Python ? That quickly
sums up to lots of megabytes lying around pretty much unused.

> > You can of course implement Python based lookup tables,
> > but these should be too large...
> >
> > > Secondly, the double-byte ones involve a mixture of
> > > algorithms and data.  The worst cases I know are modal
> > > encodings which need a single-byte lookup table, a
> > > double-byte lookup table, and have some very simple
> > > rules about escape sequences in between them.  A
> > > simple state machine could still handle these (and the
> > > single-byte mappings above become extra-simple special
> > > cases); I could imagine feeding it a totally
> > > data-driven set of rules.
> > >
> > > Third, we can massively compress the mapping tables
> > > using a notation which just lists contiguous ranges;
> > > and very often there are relationships between
> > > encodings.  For example, "cpXYZ is just like cpXYY but
> > > with an extra 'smiley' at 0XFE32".  In these cases, a
> > > script can build a family of related codecs in an
> > > auditable manner.
> >
> > These are all great ideas, but I think they unnecessarily
> > complicate the proposal.
> 
> Agreed, let's leave the *implementation* of codecs out of the current
> efforts.
> 
> However I want to make sure that the *interface* to codecs is defined
> right, because changing it will be expensive.  (This is Linus
> Torvald's philosophy on drivers -- he doesn't care about bugs in
> drivers, as they will get fixed; however he greatly cares about
> defining the driver APIs correctly.)
> 
> > > 3. What encodings to distribute?
> > > The only clean answers to this are 'almost none', or
> > > 'everything that Unicode 3.0 has a mapping for'.  The
> > > latter is going to add some weight to the
> > > distribution.  What are people's feelings?  Do we ship
> > > any at all apart from the Unicode ones?  Should new
> > > encodings be downloadable from www.python.org?  Should
> > > there be an optional package outside the main
> > > distribution?
> >
> > Since Codecs can be registered at runtime, there is quite
> > some potential there for extension writers coding their
> > own fast codecs. E.g. one could use mxTextTools as codec
> > engine working at C speeds.
> 
> (Do you think you'll be able to extort some money from HP for these? :-)

Don't know, it depends on what their specs look like. I use
mxTextTools for fast HTML file processing. It uses a small
Turing machine with some extra magic and is progammable via
Python tuples.
 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8':            8-bit variable length encoding
> >   'utf-16':           16-bit variable length encoding (litte/big endian)
> >   'utf-16-le':                utf-16 but explicitly little endian
> >   'utf-16-be':                utf-16 but explicitly big endian
> >   'ascii':            7-bit ASCII codepage
> >   'latin-1':          Latin-1 codepage
> >   'html-entities':    Latin-1 + HTML entities;
> >                       see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> >                       Japanese character encoding
> >   'unicode-escape':   See Unicode Constructors for a definition
> >   'native':           Dump of the Internal Format used by Python
> >
> > Perhaps not even 'html-entities' (even though it would make
> > a cool replacement for cgi.escape()) and maybe we should
> > also place the JIS encoding into a separate Unicode package.
> 
> I'd drop html-entities, it seems too cutesie.  (And who uses these
> anyway, outside browsers?)

Ok.
 
> For JIS (shift-JIS?) I hope that Andy can help us with some pointers
> and validation.
> 
> And unicode-escape: now that you mention it, this is a section of
> the proposal that I don't understand.  I quote it here:
> 
> | Python should provide a built-in constructor for Unicode strings which
> | is available through __builtins__:
> |
> |   u = unicode([,=])
>                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I meant this as optional second argument defaulting to
whatever we define  to mean, e.g. 'utf-8'.

u = unicode("string","utf-8") == unicode("string")

The  argument must be a string identifying one
of the registered codecs.
 
> | With the 'unicode-escape' encoding being defined as:
> |
> |   u = u''
> |
> | ? for single characters (and this includes all \XXX sequences except \uXXXX),
> |   take the ordinal and interpret it as Unicode ordinal;
> |
> | ? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX
> |   instead, e.g. \u03C0 to represent the character Pi.
> 
> I've looked at this several times and I don't see the difference
> between the two bullets.  (Ironically, you are using a non-ASCII
> character here that doesn't always display, depending on where I look
> at your mail :-).

The first bullet covers the normal Python string characters
and escapes, e.g. \n and \267 (the center dot ;-), while the
second explains how \uXXXX is interpreted.
 
> Can you give some examples?
> 
> Is u'\u0020' different from u'\x20' (a space)?

No, they both map to the same Unicode ordinal.

> Does '\u0020' (no u prefix) have a meaning?

No, \uXXXX is only defined for u"" strings or strings that are
used to build Unicode objects with this encoding:

u = u'\u0020' == unicode(r'\u0020','unicode-escape')

Note that writing \uXX is an error, e.g. u"\u12 " will cause
cause a syntax error.
 
Aside: I just noticed that '\x2010' doesn't give '\x20' + '10'
but instead '\x10' -- is this intended ?

> Also, I remember reading Tim Peters who suggested that a "raw unicode"
> notation (ur"...") might be necessary, to encode regular expressions.
> I tend to agree.

This can be had via unicode():

u = unicode(r'\a\b\c\u0020','unicode-escaped')

If that's too long, define a ur() function which wraps up the
above line in a function.

> While I'm on the topic, I don't see in your proposal a description of
> the source file character encoding.  Currently, this is undefined, and
> in fact can be (ab)used to enter non-ASCII in string literals.  For
> example, a programmer named Fran?ois might write a file containing
> this statement:
> 
>   print "Written by Fran?ois." # (There's a cedilla in there!)
> 
> (He assumes his source character encoding is Latin-1, and he doesn't
> want to have to type \347 when he can type a cedilla on his keyboard.)
> 
> If his source file (or .pyc file!)  is executed by a Japanese user,
> this will probably print some garbage.
> 
> Using the new Unicode strings, Fran?ois could change his program as
> follows:
> 
>   print unicode("Written by Fran?ois.", "latin-1")
> 
> Assuming that Fran?ois sets his sys.stdout to use Latin-1, while the
> Japanese user sets his to shift-JIS (or whatever his kanjiterm uses).
> 
> But when the Japanese user views Fran?ois' source file, he will again
> see garbage.  If he uses a generic tool to translate latin-1 files to
> shift-JIS (assuming shift-JIS has a cedilla character) the program
> will no longer work correctly -- the string "latin-1" has to be
> changed to "shift-jis".
> 
> What should we do about this?  The safest and most radical solution is
> to disallow non-ASCII source characters; Fran?ois will then have to
> type
> 
>   print u"Written by Fran\u00E7ois."
> 
> but, knowing Fran?ois, he probably won't like this solution very much
> (since he didn't like the \347 version either).

I think best is to leave it undefined... as with all files,
only the programmer knows what format and encoding it contains,
e.g. a Japanese programmer might want to use a shift-JIS editor
to enter strings directly in shift-JIS via

u = unicode("...shift-JIS encoded text...","shift-jis")

Of course, this is not readable using an ASCII editor, but
Python will continue to produce the intended string.
NLS strings don't belong into program text anyway: i10n usually
takes the gettext() approach to handle these issues.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From andy at robanal.demon.co.uk  Tue Nov 16 01:09:28 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Tue, 16 Nov 1999 00:09:28 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38308F2E.44B9C6BF@lemburg.com>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>   <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com>
Message-ID: <3839a078.22625844@post.demon.co.uk>

On Mon, 15 Nov 1999 23:54:38 +0100, you wrote:

>[I'll get back on this tomorrow, just some quick notes here...]
>The Codecs provide implementations for encoding and decoding,
>they are not intended as complete wrappers for e.g. files or
>sockets.
>
>The unicodec module will define a generic stream wrapper
>(which is yet to be defined) for dealing with files, sockets,
>etc. It will use the codec registry to do the actual codec
>work.
> 
>XXX unicodec.file(,,) could be provided as
>    short-hand for unicodec.file(open(,),) which
>    also assures that  contains the 'b' character when needed.
>
>The Codec interface defines two pairs of methods
>on purpose: one which works internally (ie. directly between
>strings and Unicode objects), and one which works externally
>(directly between a stream and Unicode objects).

That's the problem Guido and I are worried about.  Your present API is
not enough to build stream encoders.  The 'slurp it into a unicode
string in one go' approach fails for big files or for network
connections.  And you just cannot build a generic stream reader/writer
by slicing it into strings.   The solution must be specific to the
codec - only it knows how much to buffer, when to flip states etc.  

So the codec should provide proper stream reading and writing
services.  

Unicodec can then wrap those up in labour-saving ways - I'm not fussy
which but I like the one-line file-open utility.


- Andy








From tim_one at email.msn.com  Tue Nov 16 06:38:32 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:38:32 -0500
Subject: [Python-Dev] Unicode proposal: %-formatting ?
In-Reply-To: <382AE7D9.147D58CB@lemburg.com>
Message-ID: <000001bf2ff4$d36e2540$042d153f@tim>

[MAL]
> I wonder how we could add %-formatting to Unicode strings without
> duplicating the PyString_Format() logic.
>
> First, do we need Unicode object %-formatting at all ?

Sure -- in the end, all the world speaks Unicode natively and encodings
become historical baggage.  Granted I won't live that long, but I may last
long enough to see encodings become almost purely an I/O hassle, with all
computation done in Unicode.

> Second, here is an emulation using strings and 
> that should give an idea of one could work with the different
> encodings:
>
>     s = '%s %i abc???' # a Latin-1 encoded string
>     t = (u,3)

What's u?  A Unicode object?  Another Latin-1 string?  A default-encoded
string?  How does the following know the difference?

>     # Convert Latin-1 s to a  string via Unicode
>     s1 = unicode(s,'latin-1').encode()
>
>     # The '%s' will now add u in 
>     s2 = s1 % t
>
>     # Finally, convert the  encoded string to Unicode
>     u1 = unicode(s2)

I don't expect this actually works:  for example, change %s to %4s.
Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to
know that some (or all) characters in u consume multiple bytes, so can't
extract "the right" number of bytes from u.  I think % formating has to know
the truth of what you're doing.

> Note that .encode() defaults to the current setting of
> .
>
> Provided u maps to Latin-1, an alternative would be:
>
>     u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1')

More interesting is fmt % tuple where everything is Unicode; people can muck
with Latin-1 directly today using regular strings, so the example above
mostly shows artificial convolution.





From tim_one at email.msn.com  Tue Nov 16 06:38:40 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:38:40 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <382BDD81.458D3125@lemburg.com>
Message-ID: <000101bf2ff4$d636bb20$042d153f@tim>

[MAL, on raw Unicode strings]
> ...
> Agreed... note that you could also write your own codec for just this
> reason and then use:
>
> u = unicode('....\u1234...\...\...','raw-unicode-escaped')
>
> Put that into a function called 'ur' and you have:
>
> u = ur('...\u4545...\...\...')
>
> which is not that far away from ur'...' w/r to cosmetics.

Well, not quite.  In general you need to pass raw strings:

u = unicode(r'....\u1234...\...\...','raw-unicode-escaped')
            ^
u = ur(r'...\u4545...\...\...')
       ^

else Python will replace all the other backslash sequences.  This is a
crucial distinction at times; e.g., else \b in a Unicode regexp will expand
into a backspace character before the regexp processor ever sees it (\b is
supposed to be a word boundary assertion).





From tim_one at email.msn.com  Tue Nov 16 06:44:42 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:44:42 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000201bf2ff5$ae6aefc0$042d153f@tim>

[Tim, wonders why Perl and Tcl went w/ UTF-8 internally]

[Greg Stein]
> Probably for the exact reason that you stated in your messages: many
> 8-bit (7-bit?) functions continue to work quite well when given a
> UTF-8-encoded string. i.e. they didn't have to rewrite the entire
> Perl/TCL interpreter to deal with a new string type.
>
> I'd guess it is a helluva lot easier for us to add a Python Type than
> for Perl or TCL to whack around with new string types (since they use
> strings so heavily).

Sounds convincing to me!  Bumped into an old thread on c.l.p.m. that
suggested Perl was also worried about UCS-2's 64K code point limit.  But I'm
already on record as predicting we'll regret any decision .





From tim_one at email.msn.com  Tue Nov 16 06:52:12 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:52:12 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000501bf2ff6$ba943a80$042d153f@tim>

[Da Silva, Mike]
> ...
> 5.	UTF-16 requires string operations that do not make assumptions
> about nulls - this means re-implementing most of the C runtime
> functions to work with unsigned shorts.

Python strings are already null-friendly, so Python has already recoded
everything it needs to get away from the no-null assumption; stropmodule.c
is < 1,500 lines of code, and MAL can turn it into C++ template functions in
his sleep .





From tim_one at email.msn.com  Tue Nov 16 06:56:18 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:56:18 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <19991112121303.27452.rocketmail@ web605.yahoomail.com>
Message-ID: <000601bf2ff7$4d8a4c80$042d153f@tim>

[Andy Robinson]
> ...
> I presume no one is actually advocating dropping
> ordinary Python strings, or the ability to do
>    rawdata = open('myfile.txt', 'rb').read()
> without any transformations?

If anyone has advocated either, they've successfully hidden it from me.
Anyone?





From tim_one at email.msn.com  Tue Nov 16 07:09:04 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:09:04 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382BF6C3.D79840EC@lemburg.com>
Message-ID: <000701bf2ff9$15cecda0$042d153f@tim>

[MAL]
> BTW, wouldn't it be possible to take pcre and have it
> use Py_Unicode instead of char ? [Of course, there would have to
> be some extensions for character classes etc.]

No, alas.  The assumption that characters are 8 bits is ubiquitous, in both
obvious and subtle ways.

if ((start_bits[c/8] & (1 << (c&7))) == 0) start_match++; else break;





From tim_one at email.msn.com  Tue Nov 16 07:19:16 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:19:16 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C3749.198EEBC6@lemburg.com>
Message-ID: <000801bf2ffa$82273400$042d153f@tim>

[MAL]
> sys.bom should return the byte order mark (BOM) for the format used
> internally. The unicodec module should provide symbols for all
> possible values of this variable:
>
>   BOM_BE: '\376\377' 
>     (corresponds to Unicode 0x0000FEFF in UTF-16 
>      == ZERO WIDTH NO-BREAK SPACE)
>
>   BOM_LE: '\377\376' 
>     (corresponds to Unicode 0x0000FFFE in UTF-16 
>      == illegal Unicode character)
>
>   BOM4_BE: '\000\000\377\376'
>     (corresponds to Unicode 0x0000FEFF in UCS-4)

Should be
    BOM4_BE: '\000\000\376\377'   
 
>   BOM4_LE: '\376\377\000\000'
>     (corresponds to Unicode 0x0000FFFE in UCS-4)

Should be
    BOM4_LE: '\377\376\000\000'





From tim_one at email.msn.com  Tue Nov 16 07:31:39 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:31:39 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
Message-ID: <000901bf2ffc$3d4bb8e0$042d153f@tim>

[Fred L. Drake, Jr.]
> ...
>   I wasn't suggesting the PyStringObject be changed, only that the
> PyUnicodeObject could maintain a reference.  Consider:
>
>         s = fp.read()
>         u = unicode(s, 'utf-8')
>
> u would now hold a reference to s, and s/s# would return a pointer
> into s instead of re-building the UTF-8 form.  I talked myself out of
> this because it would be too easy to keep a lot more string objects
> around than were actually needed.

Yet another use for a weak reference <0.5 wink>.





From tim_one at email.msn.com  Tue Nov 16 07:41:44 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:41:44 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000b01bf2ffd$a5ad69a0$042d153f@tim>

[MAL]
>   BOM_BE: '\376\377'
>     (corresponds to Unicode 0x0000FEFF in UTF-16
>      == ZERO WIDTH NO-BREAK SPACE)

[Greg Stein]
> Are you sure about that interpretation? I thought the BOM characters
> (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.

I can't speak to MAL's degree of certainty , but he's right about this
stuff.  There is only one BOM character, U+FEFF, which is the zero-width
no-break space.  The byte-swapped form is not only reserved, it's guaranteed
never to be assigned to a character.





From tim_one at email.msn.com  Tue Nov 16 08:47:06 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 02:47:06 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <000d01bf3006$c7823700$042d153f@tim>

[Guido]
> ...
> While I'm on the topic, I don't see in your proposal a description of
> the source file character encoding.  Currently, this is undefined, and
> in fact can be (ab)used to enter non-ASCII in string literals.
> ...
> What should we do about this?  The safest and most radical solution is
> to disallow non-ASCII source characters; Fran?ois will then have to
> type
>
>   print u"Written by Fran\u00E7ois."
>
> but, knowing Fran?ois, he probably won't like this solution very much
> (since he didn't like the \347 version either).

So long as Python opens source files using libc text mode, it can't
guarantee more than C does:  the presence of any character other than tab,
newline, and ASCII 32-126 inclusive renders the file contents undefined.

Go beyond that, and you've got the same problem as mailers and browsers, and
so also the same solution:  open source files in binary mode, and add a
pragma specifying the intended charset.

As a practical matter, declare that Python source is Latin-1 for now, and
declare any *system* that doesn't support that non-conforming .

python-is-the-measure-of-all-things-ly y'rs  - tim





From tim_one at email.msn.com  Tue Nov 16 08:47:08 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 02:47:08 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38308F2E.44B9C6BF@lemburg.com>
Message-ID: <000e01bf3006$c8c11fa0$042d153f@tim>

[Guido]
>> Does '\u0020' (no u prefix) have a meaning?

[MAL]
> No, \uXXXX is only defined for u"" strings or strings that are
> used to build Unicode objects with this encoding:

I believe your intent is that '\u0020' be exactly those 6 characters, just
as today.  That is, it does have a meaning, but its meaning differs between
Unicode string literals and regular string literals.

> Note that writing \uXX is an error, e.g. u"\u12 " will cause
> cause a syntax error.

Although I believe your intent  is that, just as today, '\u12' is not
an error.

> Aside: I just noticed that '\x2010' doesn't give '\x20' + '10'
> but instead '\x10' -- is this intended ?

Yes; see 2.4.1 ("String literals") of the Lang Ref.  Blame the C committee
for not defining \x in a platform-independent way.  Note that a Python \x
escape consumes *all* following hex characters, no matter how many -- and
ignores all but the last two.

> This [raw Unicode strings] can be had via unicode():
>
> u = unicode(r'\a\b\c\u0020','unicode-escaped')
>
> If that's too long, define a ur() function which wraps up the
> above line in a function.

As before, I think that's fine for now, but won't stand forever.





From fredrik at pythonware.com  Tue Nov 16 09:39:20 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 09:39:20 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>             <38305D17.60EC94D0@lemburg.com>  <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <010001bf300e$14741310$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> I had thought something more like what Java does: an output stream
> codec's constructor takes a writable file object and the object
> returned by the constructor has a write() method, a flush() method and
> a close() method.  It acts like a buffering interface to the
> underlying file; this allows it to generate the minimal number of
> shift sequeuces.  Similar for input stream codecs.

note that the html/sgml/xml parsers generally
support the feed/close protocol.  to be able
to use these codecs in that context, we need

1) codes written according to the "data
   consumer model", instead of the "stream"
   model.

        class myDecoder:
            def __init__(self, target):
                self.target = target
                self.state = ...
            def feed(self, data):
                ... extract as much data as possible ...
                self.target.feed(extracted data)
            def close(self):
                ... extract what's left ...
                self.target.feed(additional data)
                self.target.close()

or

2) make threads mandatory, just like in Java.

or

3) add light-weight threads (ala stackless python)
   to the interpreter...

(I vote for alternative 3, but that's another story ;-)






From fredrik at pythonware.com  Tue Nov 16 09:58:50 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 09:58:50 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf2ff4$d636bb20$042d153f@tim>
Message-ID: <016a01bf3010$cde52620$f29b12c2@secret.pythonware.com>

Tim Peters  wrote:
> (\b is supposed to be a word boundary assertion).

in some places, that is.



    Main Entry: reg?u?lar
    Pronunciation: 're-gy&-l&r, 're-g(&-)l&r

    1 : belonging to a religious order
    2 a : formed, built, arranged, or ordered according
    to some established rule, law, principle, or type ...
    3 a : ORDERLY, METHODICAL  ...
    4 a : constituted, conducted, or done in conformity
    with established or prescribed usages, rules, or
    discipline ...




From jack at oratrix.nl  Tue Nov 16 12:05:55 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Tue, 16 Nov 1999 12:05:55 +0100
Subject: [Python-Dev] Some thoughts on the codecs... 
In-Reply-To: Message by "M.-A. Lemburg"  ,
	     Mon, 15 Nov 1999 20:20:55 +0100 , <38305D17.60EC94D0@lemburg.com> 
Message-ID: <19991116110555.8B43335BB1E@snelboot.oratrix.nl>

> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
> 
>   'utf-8':		8-bit variable length encoding
>   'utf-16':		16-bit variable length encoding (litte/big endian)
>   'utf-16-le':		utf-16 but explicitly little endian
>   'utf-16-be':		utf-16 but explicitly big endian
>   'ascii':		7-bit ASCII codepage
>   'latin-1':		Latin-1 codepage
>   'html-entities':	Latin-1 + HTML entities;
> 			see htmlentitydefs.py from the standard Pythin Lib
>   'jis' (a popular version XXX):
> 			Japanese character encoding
>   'unicode-escape':	See Unicode Constructors for a definition
>   'native':		Dump of the Internal Format used by Python

I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets 
(their equivalents of latin-1) too, as documents in these encoding are pretty 
ubiquitous. But maybe these should only be added on the respective platforms.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From mal at lemburg.com  Tue Nov 16 09:35:28 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 09:35:28 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <000e01bf3006$c8c11fa0$042d153f@tim>
Message-ID: <38311750.22D17EC1@lemburg.com>

Tim Peters wrote:
> 
> [Guido]
> >> Does '\u0020' (no u prefix) have a meaning?
> 
> [MAL]
> > No, \uXXXX is only defined for u"" strings or strings that are
> > used to build Unicode objects with this encoding:
> 
> I believe your intent is that '\u0020' be exactly those 6 characters, just
> as today.  That is, it does have a meaning, but its meaning differs between
> Unicode string literals and regular string literals.

Right.
 
> > Note that writing \uXX is an error, e.g. u"\u12 " will cause
> > cause a syntax error.
> 
> Although I believe your intent  is that, just as today, '\u12' is not
> an error.

Right again :-) "\u12" gives a 4 byte string, u"\u12" produces an
exception.
 
> > Aside: I just noticed that '\x2010' doesn't give '\x20' + '10'
> > but instead '\x10' -- is this intended ?
> 
> Yes; see 2.4.1 ("String literals") of the Lang Ref.  Blame the C committee
> for not defining \x in a platform-independent way.  Note that a Python \x
> escape consumes *all* following hex characters, no matter how many -- and
> ignores all but the last two.

Strange definition...
 
> > This [raw Unicode strings] can be had via unicode():
> >
> > u = unicode(r'\a\b\c\u0020','unicode-escaped')
> >
> > If that's too long, define a ur() function which wraps up the
> > above line in a function.
> 
> As before, I think that's fine for now, but won't stand forever.

If Guido agrees to ur"", I can put that into the proposal too
-- it's just that things are starting to get a little crowded
for a strawman proposal ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 11:50:31 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:50:31 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk>
Message-ID: <383136F7.AB73A90@lemburg.com>

Andy Robinson wrote:
> 
> Leave JISXXX and the CJK stuff out.  If you get into Japanese, you
> really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there
> are lots of options about how to do it.  The other ones are
> algorithmic and can be small and fast and fit into the core.
> 
> Ditto with HTML, and maybe even escaped-unicode too.

So I can drop JIS ? [I won't be able to drop the escaped unicode
codec because this is needed for u"" and ur"".]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 11:42:19 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:42:19 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf2ff4$d636bb20$042d153f@tim>
Message-ID: <3831350B.8F69CB6D@lemburg.com>

Tim Peters wrote:
> 
> [MAL, on raw Unicode strings]
> > ...
> > Agreed... note that you could also write your own codec for just this
> > reason and then use:
> >
> > u = unicode('....\u1234...\...\...','raw-unicode-escaped')
> >
> > Put that into a function called 'ur' and you have:
> >
> > u = ur('...\u4545...\...\...')
> >
> > which is not that far away from ur'...' w/r to cosmetics.
> 
> Well, not quite.  In general you need to pass raw strings:
> 
> u = unicode(r'....\u1234...\...\...','raw-unicode-escaped')
>             ^
> u = ur(r'...\u4545...\...\...')
>        ^
> 
> else Python will replace all the other backslash sequences.  This is a
> crucial distinction at times; e.g., else \b in a Unicode regexp will expand
> into a backspace character before the regexp processor ever sees it (\b is
> supposed to be a word boundary assertion).

Right.

Here is a sample implementation of what I had in mind:

""" Demo for 'unicode-escape' encoding.
"""
import struct,string,re

pack_format = '>H'

def convert_string(s):

    l = map(None,s)
    for i in range(len(l)):
	l[i] = struct.pack(pack_format,ord(l[i]))
    return l

u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})')

def unicode_unescape(s):

    l = []
    start = 0
    while start < len(s):
	m = u_escape.search(s,start)
	if not m:
	    l[len(l):] = convert_string(s[start:])
	    break
	m_start,m_end = m.span()
	if m_start > start:
	    l[len(l):] = convert_string(s[start:m_start])
	hexcode = m.group(1)
	#print hexcode,start,m_start
	if len(hexcode) != 4:
	    raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode
	ordinal = string.atoi(hexcode,16)
	l.append(struct.pack(pack_format,ordinal))
	start = m_end
    #print l
    return string.join(l,'')
    
def hexstr(s,sep=''):

    return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 11:40:42 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:40:42 +0100
Subject: [Python-Dev] Unicode proposal: %-formatting ?
References: <000001bf2ff4$d36e2540$042d153f@tim>
Message-ID: <383134AA.4B49D178@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > I wonder how we could add %-formatting to Unicode strings without
> > duplicating the PyString_Format() logic.
> >
> > First, do we need Unicode object %-formatting at all ?
> 
> Sure -- in the end, all the world speaks Unicode natively and encodings
> become historical baggage.  Granted I won't live that long, but I may last
> long enough to see encodings become almost purely an I/O hassle, with all
> computation done in Unicode.
> 
> > Second, here is an emulation using strings and 
> > that should give an idea of one could work with the different
> > encodings:
> >
> >     s = '%s %i abc???' # a Latin-1 encoded string
> >     t = (u,3)
> 
> What's u?  A Unicode object?  Another Latin-1 string?  A default-encoded
> string?  How does the following know the difference?

u refers to a Unicode object in the proposal. Sorry, forgot to
mention that.
 
> >     # Convert Latin-1 s to a  string via Unicode
> >     s1 = unicode(s,'latin-1').encode()
> >
> >     # The '%s' will now add u in 
> >     s2 = s1 % t
> >
> >     # Finally, convert the  encoded string to Unicode
> >     u1 = unicode(s2)
> 
> I don't expect this actually works:  for example, change %s to %4s.
> Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to
> know that some (or all) characters in u consume multiple bytes, so can't
> extract "the right" number of bytes from u.  I think % formating has to know
> the truth of what you're doing.

Hmm, guess you're right... format parameters should indeed refer
to characters rather than number of encoding bytes.

This means a new PyUnicode_Format() implementation mapping
Unicode format objects to Unicode objects.
 
> > Note that .encode() defaults to the current setting of
> > .
> >
> > Provided u maps to Latin-1, an alternative would be:
> >
> >     u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1')
> 
> More interesting is fmt % tuple where everything is Unicode; people can muck
> with Latin-1 directly today using regular strings, so the example above
> mostly shows artificial convolution.

... hmm, there is a problem there: how should the PyUnicode_Format()
API deal with '%s' when it sees a Unicode object as argument ?

E.g. what would you get in these cases:

u = u"%s %s" % (u"abc", "abc")

Perhaps we need a new marker for "insert Unicode object here".

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 11:48:13 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:48:13 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>   <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com> <3839a078.22625844@post.demon.co.uk>
Message-ID: <3831366D.8A09E194@lemburg.com>

Andy Robinson wrote:
> 
> On Mon, 15 Nov 1999 23:54:38 +0100, you wrote:
> 
> >[I'll get back on this tomorrow, just some quick notes here...]
> >The Codecs provide implementations for encoding and decoding,
> >they are not intended as complete wrappers for e.g. files or
> >sockets.
> >
> >The unicodec module will define a generic stream wrapper
> >(which is yet to be defined) for dealing with files, sockets,
> >etc. It will use the codec registry to do the actual codec
> >work.
> >
> >XXX unicodec.file(,,) could be provided as
> >    short-hand for unicodec.file(open(,),) which
> >    also assures that  contains the 'b' character when needed.
> >
> >The Codec interface defines two pairs of methods
> >on purpose: one which works internally (ie. directly between
> >strings and Unicode objects), and one which works externally
> >(directly between a stream and Unicode objects).
> 
> That's the problem Guido and I are worried about.  Your present API is
> not enough to build stream encoders.  The 'slurp it into a unicode
> string in one go' approach fails for big files or for network
> connections.  And you just cannot build a generic stream reader/writer
> by slicing it into strings.   The solution must be specific to the
> codec - only it knows how much to buffer, when to flip states etc.
> 
> So the codec should provide proper stream reading and writing
> services.

I guess I'll have to rethink the Codec specs. Some leads:

1. introduce a new StreamCodec class which is designed for
   handling stream encoding and decoding (and supports
   state)

2. give more information to the unicodec registry: 
   one could register classes instead of instances which the Unicode
   imlementation would then instantiate whenever it needs to
   apply the conversion; since this is only needed for encodings
   maintaining state, the registery would only have to do the
   instantiation for these codecs and could use cached instances for
   stateless codecs.
 
> Unicodec can then wrap those up in labour-saving ways - I'm not fussy
> which but I like the one-line file-open utility.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fredrik at pythonware.com  Tue Nov 16 12:38:31 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 12:38:31 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>
Message-ID: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com>

> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
> 
>   'utf-8': 8-bit variable length encoding
>   'utf-16': 16-bit variable length encoding (litte/big endian)
>   'utf-16-le': utf-16 but explicitly little endian
>   'utf-16-be': utf-16 but explicitly big endian
>   'ascii': 7-bit ASCII codepage
>   'latin-1': Latin-1 codepage
>   'html-entities': Latin-1 + HTML entities;
> see htmlentitydefs.py from the standard Pythin Lib
>   'jis' (a popular version XXX):
> Japanese character encoding
>   'unicode-escape': See Unicode Constructors for a definition
>   'native': Dump of the Internal Format used by Python

since this is already very close, maybe we could adopt
the naming guidelines from XML:

    In an encoding declaration, the values "UTF-8", "UTF-16",
    "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
    for the various encodings and transformations of
    Unicode/ISO/IEC 10646, the values "ISO-8859-1",
    "ISO-8859-2", ... "ISO-8859-9" should be used for the parts
    of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS",
    and "EUC-JP" should be used for the various encoded
    forms of JIS X-0208-1997.

    XML processors may recognize other encodings; it is
    recommended that character encodings registered
    (as charsets) with the Internet Assigned Numbers
    Authority [IANA], other than those just listed,
    should be referred to using their registered names.

    Note that these registered names are defined to be
    case-insensitive, so processors wishing to match
    against them should do so in a case-insensitive way.

(ie "iso-8859-1" instead of "latin-1", etc -- at least as
aliases...).






From gstein at lyra.org  Tue Nov 16 12:45:48 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 03:45:48 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com>
Message-ID: 

On Tue, 16 Nov 1999, Fredrik Lundh wrote:
>...
> since this is already very close, maybe we could adopt
> the naming guidelines from XML:
> 
>     In an encoding declaration, the values "UTF-8", "UTF-16",
>     "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
>     for the various encodings and transformations of
>     Unicode/ISO/IEC 10646, the values "ISO-8859-1",
>     "ISO-8859-2", ... "ISO-8859-9" should be used for the parts
>     of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS",
>     and "EUC-JP" should be used for the various encoded
>     forms of JIS X-0208-1997.
> 
>     XML processors may recognize other encodings; it is
>     recommended that character encodings registered
>     (as charsets) with the Internet Assigned Numbers
>     Authority [IANA], other than those just listed,
>     should be referred to using their registered names.
> 
>     Note that these registered names are defined to be
>     case-insensitive, so processors wishing to match
>     against them should do so in a case-insensitive way.
> 
> (ie "iso-8859-1" instead of "latin-1", etc -- at least as
> aliases...).

+1

(as we'd say in Apache-land... :-)

-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Tue Nov 16 13:04:47 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 04:04:47 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <3830595B.348E8CC7@lemburg.com>
Message-ID: 

On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> Guido van Rossum wrote:
>...
> > t# refers to byte-encoded data.  Multibyte encodings are explicitly
> > designed to be passed cleanly through processing steps that handle
> > single-byte character data, as long as they are 8-bit clean and don't
> > do too much processing.
> 
> Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> "8-bit clean" as you obviously did.

Hrm. That might be dangerous. Many of the functions that use "t#" assume
that each character is 8-bits long. i.e. the returned length == the number
of characters.

I'm not sure what the implications would be if you interpret the semantics
of "t#" as multi-byte characters.

>...
> > For example, take an encryption engine.  While it is defined in terms
> > of byte streams, there's no requirement that the bytes represent
> > characters -- they could be the bytes of a GIF file, an MP3 file, or a
> > gzipped tar file.  If we pass Unicode to an encryption engine, we want
> > Unicode to come out at the other end, not UTF-8.  (If we had wanted to
> > encrypt UTF-8, we should have fed it UTF-8.)

Heck. I just want to quickly throw the data onto my disk. I'll write a
BOM, following by the raw data. Done. It's even portable.

>...
> > Aha, I think there's a confusion about what "8-bit" means.  For me, a
> > multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?

Maybe. I don't see multi-byte characters as 8-bit (in the sense of the "t"
format).

> > (As far as I know, C uses char* to represent multibyte characters.)
> > Maybe we should disambiguate it more explicitly?

We can disambiguate with a new format character, or we can clarify the
semantics of "t" to mean single- *or* multi- byte characters. Again, I
think there may be trouble if the semantics of "t" are defined to allow
multibyte characters.

> There should be some definition for the two markers and the
> ideas behind them in the API guide, I guess.

Certainly.

[ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]

> > > Hmm, I would strongly object to making "s#" return the internal
> > > format. file.write() would then default to writing UTF-16 data
> > > instead of UTF-8 data. This could result in strange errors
> > > due to the UTF-16 format being endian dependent.
> > 
> > But this was the whole design.  file.write() needs to be changed to
> > use s# when the file is open in binary mode and t# when the file is
> > open in text mode.

Interesting idea, but that presumes that "t" will be defined for the
Unicode
object (i.e. it implements the getcharbuffer type slot). Because of the
multi-byte problem, I don't think it will.
[ not to mention, that I don't think the Unicode object should implicitly
  do a UTF-8 conversion and hold a ref to the resulting string ]

>...
> I still don't feel very comfortable about the fact that all
> existing APIs using "s#" will suddenly receive UTF-16 data if
> being passed Unicode objects: this probably won't get us the
> "magical" Unicode integration we invision, since "t#" usage is not
> very wide spread and character handling code will probably not
> work well with UTF-16 encoded strings.

I'm not sure that we should definitely go for "magical." Perl has magic in
it, and that is one of its worst faults. Go for clean and predictable, and
leave as much logic to the Python level as possible. The interpreter
should provide a minimum of functionality, rather than second-guessing and
trying to be neat and sneaky with its operation.

>...
> > Because file.write() for a binary file, and other similar things
> > (e.g. the encryption engine example I mentioned above) must have
> > *some* way to get at the raw bits.
> 
> What for ?

How about: "because I'm the application developer, and I say that I want
the raw bytes in the file."

> Any lossless encoding should do the trick... UTF-8
> is just as good as UTF-16 for binary files; plus it's more compact
> for ASCII data. I don't really see a need to get explicitly
> at the internal data representation because both encodings are
> in fact "internal" w/r to Unicode objects.
> 
> The only argument I can come up with is that using UTF-16 for
> binary files could (possibly) eliminate the UTF-8 conversion step
> which is otherwise always needed.

The argument that I come up with is "don't tell me how to design my
storage format, and don't make Python force me into one."

If I want to write Unicode text to a file, the most natural thing to do
is:

open('file', 'w').write(u)

If you do a conversion on me, then I'm not writing Unicode. I've got to go
and do some nasty conversion which just monkeys up my program.

If I have a Unicode object, but I *want* to write UTF-8 to the file, then
the cleanest thing is:

open('file', 'w').write(encode(u, 'utf-8'))

This is clear that I've got a Unicode object input, but I'm writing UTF-8.

I have a second argument, too: See my first argument. :-)

Really... this is kind of what Fredrik was trying to say: don't get in the
way of the application programmer. Give them tools, but avoid policy and
gimmicks and other "magic".

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Tue Nov 16 13:09:17 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 04:09:17 -0800 (PST)
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: 

On Mon, 15 Nov 1999, Guido van Rossum wrote:
>...
> > The problem with these large tables is that currently
> > Python modules are not shared among processes since
> > every process builds its own table.
> > 
> > Static C data has the advantage of being shareable at
> > the OS level.
> 
> Don't worry about it.  128K is too small to care, I think...

This is the reason Python starts up so slow and has a large memory
footprint. There hasn't been any concern for moving stuff into shared data
pages. As a result, a process must map in a bunch of vmem pages, for no
other reason than to allocate Python structures in that memory and copy
constants in.

Go start Perl 100 times, then do the same with Python. Python is
significantly slower. I've actually written a web app in PHP because
another one that I did in Python had slow response time.
[ yah: the Real Man Answer is to write a real/good mod_python. ]

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From captainrobbo at yahoo.com  Tue Nov 16 13:18:19 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 16 Nov 1999 04:18:19 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
Message-ID: <19991116121819.21509.rocketmail@web606.mail.yahoo.com>


--- "M.-A. Lemburg"  wrote:
> So I can drop JIS ? [I won't be able to drop the
> escaped unicode
> codec because this is needed for u"" and ur"".]

Drop Japanese from the core language.  

JIS0208 is a big character set with three popular
encodings (Shift-JIS, EUC-JP and JIS), and a host of
slight variations; it has 6879 characters, and there
are a range of options a user might need to set for it
to be useful.  So let's assume for now this a separate
package.  There's a good chance I'll do it but it is
not a small job.  If you start statically linking in
tables of 7000 characters for one Asian language,
you'll have to do the lot.

As for the single-byte Latin ones, a prototype Python
module could be whipped up in a couple of evenings,
and a tiny C function which does single-byte to
double-byte mappings and vice versa could make it
fast.  We can have an extensible, data driven solution
in no time without having to build it into the core.

The way I see it, to claim that python has i18n, a
serious effort is needed to ensure every major
encoding in the world is available to Python users.  
But that's separate to the core languages.  Your spec
should only cover what is going to be hard-coded into
Python.  

I'd like to see one paragraph in your spec stating
that our architecture seperates the encodings
themselves from the core language changes, and that
getting them sorted is a logically separate (but
important) project.  Ideally, we could put together a
separate proposal for the encoding library itself and
run it by some world class experts in that field, but
after yours is done.


- Andy

 



=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From guido at CNRI.Reston.VA.US  Tue Nov 16 14:28:42 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 08:28:42 -0500
Subject: [Python-Dev] Unicode proposal: %-formatting ?
In-Reply-To: Your message of "Tue, 16 Nov 1999 11:40:42 +0100."
             <383134AA.4B49D178@lemburg.com> 
References: <000001bf2ff4$d36e2540$042d153f@tim>  
            <383134AA.4B49D178@lemburg.com> 
Message-ID: <199911161328.IAA29042@eric.cnri.reston.va.us>

> ... hmm, there is a problem there: how should the PyUnicode_Format()
> API deal with '%s' when it sees a Unicode object as argument ?
> 
> E.g. what would you get in these cases:
> 
> u = u"%s %s" % (u"abc", "abc")


From guido at CNRI.Reston.VA.US  Tue Nov 16 14:45:17 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 08:45:17 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Tue, 16 Nov 1999 04:04:47 PST."
              
References:  
Message-ID: <199911161345.IAA29064@eric.cnri.reston.va.us>

> > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> > "8-bit clean" as you obviously did.
> 
> Hrm. That might be dangerous. Many of the functions that use "t#" assume
> that each character is 8-bits long. i.e. the returned length == the number
> of characters.
> 
> I'm not sure what the implications would be if you interpret the semantics
> of "t#" as multi-byte characters.

Hrm.  Can you quote examples of users of t# who would be confused by
multibyte characters?  I guess that there are quite a few places where
they will be considered illegal, but that's okay -- the string will be
parsed at some point and rejected, e.g. as an illegal filename,
hostname or whatever.  On the other hand, there are quite some places
where I would think that multibyte characters would do just the right
thing.  Many places using t# could just as well be using 's' except
they need to know the length and they don't want to call strlen().
In all cases I've looked at, the reason they need the length because
they are allocating a buffer (or checking whether it fits in a
statically allocated buffer) -- and there the number of bytes in a
multibyte string is just fine.

Note that I take the same stance on 's' -- it should return multibyte
characters.

> > What for ?
> 
> How about: "because I'm the application developer, and I say that I want
> the raw bytes in the file."

Here I'm with you, man!

> Greg Stein, http://www.lyra.org/

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gward at cnri.reston.va.us  Tue Nov 16 15:10:33 1999
From: gward at cnri.reston.va.us (Greg Ward)
Date: Tue, 16 Nov 1999 09:10:33 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: ; from gstein@lyra.org on Tue, Nov 16, 1999 at 04:09:17AM -0800
References: <199911152137.QAA28280@eric.cnri.reston.va.us> 
Message-ID: <19991116091032.A4063@cnri.reston.va.us>

On 16 November 1999, Greg Stein said:
> This is the reason Python starts up so slow and has a large memory
> footprint. There hasn't been any concern for moving stuff into shared data
> pages. As a result, a process must map in a bunch of vmem pages, for no
> other reason than to allocate Python structures in that memory and copy
> constants in.
> 
> Go start Perl 100 times, then do the same with Python. Python is
> significantly slower. I've actually written a web app in PHP because
> another one that I did in Python had slow response time.
> [ yah: the Real Man Answer is to write a real/good mod_python. ]

I don't think this is the only factor in startup overhead.  Try looking
into the number of system calls for the trivial startup case of each
interpreter:

  $ truss perl -e 1 2> perl.log 
  $ truss python -c 1 2> python.log

(This is on Solaris; I did the same thing on Linux with "strace", and on
IRIX with "par -s -SS".  Dunno about other Unices.)  The results are
interesting, and useful despite the platform and version disparities.

(For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on
Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX.  The Solaris is 2.6,
using the Official CNRI Python Build by Barry, and the ditto Perl build
by me; the Linux system is starship, using whatever Perl and Python the
Starship Masters provide us with; the IRIX box is an elderly but
well-maintained SGI Challenge running IRIX 5.3.)

Also, this is with an empty PYTHONPATH.  The Solaris build of Python has
different prefix and exec_prefix, but on the Linux and IRIX builds, they
are the same.  (I think this will reflect poorly on the Solaris
version.)  PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect
startup of the trivial "1" script, so I haven't paid attention to them.

First, the size of log files (in lines), i.e. number of system calls:

               Solaris     Linux    IRIX[1]
  Perl              88        85      70
  Python           425       316     257

[1] after chopping off the summary counts from the "par" output -- ie.
    these really are the number of system calls, not the number of
    lines in the log files

Next, the number of "open" calls:

               Solaris     Linux    IRIX
  Perl             16         10       9
  Python          107         71      48

(It looks as though *all* of the Perl 'open' calls are due to the
dynamic linker going through /usr/lib and/or /lib.)

And the number of unsuccessful "open" calls:

               Solaris     Linux    IRIX
  Perl              6          1       3
  Python           77         49      32

Number of "mmap" calls:

               Solaris     Linux    IRIX
  Perl              25        25       1
  Python            36        24       1

...nope, guess we can't blame mmap for any Perl/Python startup
disparity.

How about "brk":

               Solaris     Linux    IRIX
  Perl               6        11      12
  Python            47        39      25

...ok, looks like Greg's gripe about memory holds some water.

Rerunning "truss" on Solaris with "python -S -c 1" drastically reduces
the startup overhead as measured by "number of system calls".  Some
quick timing experiments show a drastic speedup (in wall-clock time) by
adding "-S": about 37% faster under Solaris, 56% faster under Linux, and
35% under IRIX.  These figures should be taken with a large grain of
salt, as the Linux and IRIX systems were fairly well loaded at the time,
and the wall-clock results I measured had huge variance.  Still, it gets
the point across.

Oh, also for the record, all timings were done like:

   perl -e 'for $i (1 .. 100) { system "python", "-S", "-c", "1"; }'

because I wanted to guarantee no shell was involved in the Python
startup.

        Greg
-- 
Greg Ward - software developer                    gward at cnri.reston.va.us
Corporation for National Research Initiatives    
1895 Preston White Drive                           voice: +1-703-620-8990
Reston, Virginia, USA  20191-5434                    fax: +1-703-620-0913



From mal at lemburg.com  Tue Nov 16 12:33:07 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 12:33:07 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991116110555.8B43335BB1E@snelboot.oratrix.nl>
Message-ID: <383140F3.EDDB307A@lemburg.com>

Jack Jansen wrote:
> 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8':            8-bit variable length encoding
> >   'utf-16':           16-bit variable length encoding (litte/big endian)
> >   'utf-16-le':                utf-16 but explicitly little endian
> >   'utf-16-be':                utf-16 but explicitly big endian
> >   'ascii':            7-bit ASCII codepage
> >   'latin-1':          Latin-1 codepage
> >   'html-entities':    Latin-1 + HTML entities;
> >                       see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> >                       Japanese character encoding
> >   'unicode-escape':   See Unicode Constructors for a definition
> >   'native':           Dump of the Internal Format used by Python
> 
> I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets
> (their equivalents of latin-1) too, as documents in these encoding are pretty
> ubiquitous. But maybe these should only be added on the respective platforms.

Good idea. What code pages would that be ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 15:13:25 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 15:13:25 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.6
References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com>
Message-ID: <38316685.7977448D@lemburg.com>

FYI, I've uploaded a new version of the proposal which incorporates
many things we have discussed lately, e.g. the buffer interface,
"s#" vs. "t#", etc.

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ? Unicode objects support for %-formatting

    ? specifying StreamCodecs

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Tue Nov 16 13:54:51 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 13:54:51 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com>
Message-ID: <3831541B.B242FFA9@lemburg.com>

Fredrik Lundh wrote:
> 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8': 8-bit variable length encoding
> >   'utf-16': 16-bit variable length encoding (litte/big endian)
> >   'utf-16-le': utf-16 but explicitly little endian
> >   'utf-16-be': utf-16 but explicitly big endian
> >   'ascii': 7-bit ASCII codepage
> >   'latin-1': Latin-1 codepage
> >   'html-entities': Latin-1 + HTML entities;
> > see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> > Japanese character encoding
> >   'unicode-escape': See Unicode Constructors for a definition
> >   'native': Dump of the Internal Format used by Python
> 
> since this is already very close, maybe we could adopt
> the naming guidelines from XML:
> 
>     In an encoding declaration, the values "UTF-8", "UTF-16",
>     "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
>     for the various encodings and transformations of
>     Unicode/ISO/IEC 10646, the values "ISO-8859-1",
>     "ISO-8859-2", ... "ISO-8859-9" should be used for the parts
>     of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS",
>     and "EUC-JP" should be used for the various encoded
>     forms of JIS X-0208-1997.
> 
>     XML processors may recognize other encodings; it is
>     recommended that character encodings registered
>     (as charsets) with the Internet Assigned Numbers
>     Authority [IANA], other than those just listed,
>     should be referred to using their registered names.
> 
>     Note that these registered names are defined to be
>     case-insensitive, so processors wishing to match
>     against them should do so in a case-insensitive way.
> 
> (ie "iso-8859-1" instead of "latin-1", etc -- at least as
> aliases...).

>From the proposal:
"""
General Remarks:
----------------

? Unicode encoding names should be lower case on output and
  case-insensitive on input (they will be converted to lower case
  by all APIs taking an encoding name as input).

  Encoding names should follow the name conventions as used by the
  Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
  written as 'utf-16'.
"""

Is there a naming scheme definition for these encoding names?
(The quote you gave above doesn't really sound like a definition
to me.)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 14:15:19 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 14:15:19 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991116121819.21509.rocketmail@web606.mail.yahoo.com>
Message-ID: <383158E7.BC574A1F@lemburg.com>

Andy Robinson wrote:
> 
> --- "M.-A. Lemburg"  wrote:
> > So I can drop JIS ? [I won't be able to drop the
> > escaped unicode
> > codec because this is needed for u"" and ur"".]
> 
> Drop Japanese from the core language.

Done ... that one was easy ;-)
 
> JIS0208 is a big character set with three popular
> encodings (Shift-JIS, EUC-JP and JIS), and a host of
> slight variations; it has 6879 characters, and there
> are a range of options a user might need to set for it
> to be useful.  So let's assume for now this a separate
> package.  There's a good chance I'll do it but it is
> not a small job.  If you start statically linking in
> tables of 7000 characters for one Asian language,
> you'll have to do the lot.
> 
> As for the single-byte Latin ones, a prototype Python
> module could be whipped up in a couple of evenings,
> and a tiny C function which does single-byte to
> double-byte mappings and vice versa could make it
> fast.  We can have an extensible, data driven solution
> in no time without having to build it into the core.

Perhaps these helper function could be intergrated into
the core to avoid compilation when adding a new codec.

> The way I see it, to claim that python has i18n, a
> serious effort is needed to ensure every major
> encoding in the world is available to Python users.
> But that's separate to the core languages.  Your spec
> should only cover what is going to be hard-coded into
> Python.

Right.
 
> I'd like to see one paragraph in your spec stating
> that our architecture seperates the encodings
> themselves from the core language changes, and that
> getting them sorted is a logically separate (but
> important) project.  Ideally, we could put together a
> separate proposal for the encoding library itself and
> run it by some world class experts in that field, but
> after yours is done.

I've added:
All other encoding such as the CJK ones to support Asian scripts
should be implemented in seperate packages which do not get included
in the core Python distribution and are not a part of this proposal.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 14:06:39 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 14:06:39 +0100
Subject: [Python-Dev] just say no...
References: 
Message-ID: <383156DF.2209053F@lemburg.com>

Greg Stein wrote:
> 
> On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> >...
> > > t# refers to byte-encoded data.  Multibyte encodings are explicitly
> > > designed to be passed cleanly through processing steps that handle
> > > single-byte character data, as long as they are 8-bit clean and don't
> > > do too much processing.
> >
> > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> > "8-bit clean" as you obviously did.
> 
> Hrm. That might be dangerous. Many of the functions that use "t#" assume
> that each character is 8-bits long. i.e. the returned length == the number
> of characters.
> 
> I'm not sure what the implications would be if you interpret the semantics
> of "t#" as multi-byte characters.

FYI, the next version of the proposal now says "s#" gives you
UTF-16 and "t#" returns UTF-8. File objects opened in text mode
will use "t#" and binary ones use "s#".

I'll just use explicit u.encode('utf-8') calls if I want to write
UTF-8 to binary files -- perhaps everyone else should too ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From akuchlin at mems-exchange.org  Tue Nov 16 15:35:39 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 Nov 1999 09:35:39 -0500 (EST)
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <19991116091032.A4063@cnri.reston.va.us>
References: <199911152137.QAA28280@eric.cnri.reston.va.us>
	
	<19991116091032.A4063@cnri.reston.va.us>
Message-ID: <14385.27579.292173.433577@amarok.cnri.reston.va.us>

Greg Ward writes:
>Next, the number of "open" calls:
>               Solaris     Linux    IRIX
>  Perl             16         10       9
>  Python          107         71      48

Running 'python -v' explains this:

amarok akuchlin>python -v
# /usr/local/lib/python1.5/exceptions.pyc matches /usr/local/lib/python1.5/exceptions.py
import exceptions # precompiled from /usr/local/lib/python1.5/exceptions.pyc
# /usr/local/lib/python1.5/site.pyc matches /usr/local/lib/python1.5/site.py
import site # precompiled from /usr/local/lib/python1.5/site.pyc
# /usr/local/lib/python1.5/os.pyc matches /usr/local/lib/python1.5/os.py
import os # precompiled from /usr/local/lib/python1.5/os.pyc
import posix # builtin
# /usr/local/lib/python1.5/posixpath.pyc matches /usr/local/lib/python1.5/posixpath.py
import posixpath # precompiled from /usr/local/lib/python1.5/posixpath.pyc
# /usr/local/lib/python1.5/stat.pyc matches /usr/local/lib/python1.5/stat.py
import stat # precompiled from /usr/local/lib/python1.5/stat.pyc
# /usr/local/lib/python1.5/UserDict.pyc matches /usr/local/lib/python1.5/UserDict.py
import UserDict # precompiled from /usr/local/lib/python1.5/UserDict.pyc
Python 1.5.2 (#80, May 25 1999, 18:06:07)  [GCC 2.8.1] on sunos5
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
import readline # dynamically loaded from /usr/local/lib/python1.5/lib-dynload/readline.so

And each import tries several different forms of the module name:

stat("/usr/local/lib/python1.5/os", 0xEFFFD5E0) Err#2 ENOENT
open("/usr/local/lib/python1.5/os.so", O_RDONLY) Err#2 ENOENT
open("/usr/local/lib/python1.5/osmodule.so", O_RDONLY) Err#2 ENOENT
open("/usr/local/lib/python1.5/os.py", O_RDONLY) = 4

I don't see how this is fixable, unless we strip down site.py, which
drags in os, which drags in os.path and stat and UserDict. 

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I'm going stir-crazy, and I've joined the ranks of the walking brain-dead, but
otherwise I'm just peachy.
    -- Lyta Hall on parenthood, in SANDMAN #40: "Parliament of Rooks"




From guido at CNRI.Reston.VA.US  Tue Nov 16 15:43:07 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 09:43:07 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Tue, 16 Nov 1999 14:06:39 +0100."
             <383156DF.2209053F@lemburg.com> 
References:   
            <383156DF.2209053F@lemburg.com> 
Message-ID: <199911161443.JAA29149@eric.cnri.reston.va.us>

> FYI, the next version of the proposal now says "s#" gives you
> UTF-16 and "t#" returns UTF-8. File objects opened in text mode
> will use "t#" and binary ones use "s#".

Good.

> I'll just use explicit u.encode('utf-8') calls if I want to write
> UTF-8 to binary files -- perhaps everyone else should too ;-)

You could write UTF-8 to files opened in text mode too; at least most
actual systems will leave the UTF-8 escapes alone and just to LF ->
CRLF translation, which should be fine.

--Guido van Rossum (home page: http://www.python.org/~guido/)




From fdrake at acm.org  Tue Nov 16 15:50:55 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 09:50:55 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000901bf2ffc$3d4bb8e0$042d153f@tim>
References: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
	<000901bf2ffc$3d4bb8e0$042d153f@tim>
Message-ID: <14385.28495.685427.598748@weyr.cnri.reston.va.us>

Tim Peters writes:
 > Yet another use for a weak reference <0.5 wink>.

  Those just keep popping up!  I seem to recall Diane Hackborne
actually implemented these under the name "vref" long ago; perhaps
that's worth revisiting after all?  (Not the implementation so much as 
the idea.)  I think to make it general would cost one PyObject* in
each object's structure, and some code in some constructors (maybe),
and all destructors, but not much.
  Is this worth pursuing, or is it locked out of the core because of
the added space for the PyObject*?  (Note that the concept isn't
necessarily useful for all object types -- numbers in particular --
but it only makes sense to bother if it works for everything, even if
it's not very useful in some cases.)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fdrake at acm.org  Tue Nov 16 16:12:43 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 10:12:43 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: 
References: <3830595B.348E8CC7@lemburg.com>
	
Message-ID: <14385.29803.459364.456840@weyr.cnri.reston.va.us>

Greg Stein writes:
 > [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]

  And the sooner I receive them, the sooner they can be integrated!
Any plans to get them to me?  I'll probably want to do another release 
before the IPC8.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From mal at lemburg.com  Tue Nov 16 15:36:54 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 15:36:54 +0100
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us>
Message-ID: <38316C06.8B0E1D7B@lemburg.com>

Greg Ward wrote:
> 
> > Go start Perl 100 times, then do the same with Python. Python is
> > significantly slower. I've actually written a web app in PHP because
> > another one that I did in Python had slow response time.
> > [ yah: the Real Man Answer is to write a real/good mod_python. ]
> 
> I don't think this is the only factor in startup overhead.  Try looking
> into the number of system calls for the trivial startup case of each
> interpreter:
> 
>   $ truss perl -e 1 2> perl.log
>   $ truss python -c 1 2> python.log
> 
> (This is on Solaris; I did the same thing on Linux with "strace", and on
> IRIX with "par -s -SS".  Dunno about other Unices.)  The results are
> interesting, and useful despite the platform and version disparities.
> 
> (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on
> Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX.  The Solaris is 2.6,
> using the Official CNRI Python Build by Barry, and the ditto Perl build
> by me; the Linux system is starship, using whatever Perl and Python the
> Starship Masters provide us with; the IRIX box is an elderly but
> well-maintained SGI Challenge running IRIX 5.3.)
> 
> Also, this is with an empty PYTHONPATH.  The Solaris build of Python has
> different prefix and exec_prefix, but on the Linux and IRIX builds, they
> are the same.  (I think this will reflect poorly on the Solaris
> version.)  PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect
> startup of the trivial "1" script, so I haven't paid attention to them.

For kicks I've done a similar test with cgipython, the 
one file version of Python 1.5.2:
 
> First, the size of log files (in lines), i.e. number of system calls:
> 
>                Solaris     Linux    IRIX[1]
>   Perl              88        85      70
>   Python           425       316     257

    cgipython                  182 
 
> [1] after chopping off the summary counts from the "par" output -- ie.
>     these really are the number of system calls, not the number of
>     lines in the log files
> 
> Next, the number of "open" calls:
> 
>                Solaris     Linux    IRIX
>   Perl             16         10       9
>   Python          107         71      48

    cgipython                   33 

> (It looks as though *all* of the Perl 'open' calls are due to the
> dynamic linker going through /usr/lib and/or /lib.)
> 
> And the number of unsuccessful "open" calls:
> 
>                Solaris     Linux    IRIX
>   Perl              6          1       3
>   Python           77         49      32

    cgipython                   28

Note that cgipython does search for sitecutomize.py.

> 
> Number of "mmap" calls:
> 
>                Solaris     Linux    IRIX
>   Perl              25        25       1
>   Python            36        24       1

    cgipython                   13

> 
> ...nope, guess we can't blame mmap for any Perl/Python startup
> disparity.
> 
> How about "brk":
> 
>                Solaris     Linux    IRIX
>   Perl               6        11      12
>   Python            47        39      25

    cgipython                   41 (?)

So at least in theory, using cgipython for the intended
purpose should gain some performance.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 17:00:58 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 17:00:58 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
Message-ID: <38317FBA.4F3D6B1F@lemburg.com>

Here is a new proposal for the codec interface:

class Codec:

    def encode(self,u,slice=None):
	
	""" Return the Unicode object u encoded as Python string.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is encoded.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	"""
	...

    def decode(self,s,slice=None):

	""" Return an equivalent Unicode object for the encoded Python
	    string s.

	    If slice is given (as slice object), only the sliced part
	    of the Python string is decoded and returned as Unicode
	    object.  Note that this can cause the decoding algorithm
	    to fail due to truncations in the encoding.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	""" 
	...
	

class StreamCodec(Codec):

    def __init__(self,stream=None,errors='strict'):

	""" Creates a StreamCodec instance.

	    stream must be a file-like object open for reading and/or
	    writing binary data depending on the intended codec
            action or None.

	    The StreamCodec may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are known (they need not all be supported by StreamCodec
            subclasses): 

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def write(self,u,slice=None):

	""" Writes the Unicode object's contents encoded to self.stream.

	    stream must be a file-like object open for writing binary
	    data.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def read(self,length=None):

	""" Reads an encoded string from the stream and returns
	    an equivalent Unicode object.

	    If length is given, only length Unicode characters are
	    returned (the StreamCodec instance reads as many raw bytes
            as needed to fulfill this requirement). Otherwise, all
	    available data is read and decoded.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...


It is not required by the unicodec.register() API to provide a
subclass of these base class, only the given methods must be present;
this allows writing Codecs as extensions types.  All Codecs must
provide the .encode()/.decode() methods. Codecs having the .read()
and/or .write() methods are considered to be StreamCodecs.

The Unicode implementation will by itself only use the
stateless .encode() and .decode() methods.

All other conversion have to be done by explicitly instantiating
the appropriate [Stream]Codec.
--

Feel free to beat on this one ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Tue Nov 16 17:08:49 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 17:08:49 +0100
Subject: [Python-Dev] just say no...
References: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
		<000901bf2ffc$3d4bb8e0$042d153f@tim> <14385.28495.685427.598748@weyr.cnri.reston.va.us>
Message-ID: <38318191.11D93903@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> Tim Peters writes:
>  > Yet another use for a weak reference <0.5 wink>.
> 
>   Those just keep popping up!  I seem to recall Diane Hackborne
> actually implemented these under the name "vref" long ago; perhaps
> that's worth revisiting after all?  (Not the implementation so much as
> the idea.)  I think to make it general would cost one PyObject* in
> each object's structure, and some code in some constructors (maybe),
> and all destructors, but not much.
>   Is this worth pursuing, or is it locked out of the core because of
> the added space for the PyObject*?  (Note that the concept isn't
> necessarily useful for all object types -- numbers in particular --
> but it only makes sense to bother if it works for everything, even if
> it's not very useful in some cases.)

FYI, there's mxProxy which implements a flavor of them. Look
in the standard places for mx stuff ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake at acm.org  Tue Nov 16 17:14:06 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 11:14:06 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <38318191.11D93903@lemburg.com>
References: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
	<000901bf2ffc$3d4bb8e0$042d153f@tim>
	<14385.28495.685427.598748@weyr.cnri.reston.va.us>
	<38318191.11D93903@lemburg.com>
Message-ID: <14385.33486.855802.187739@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > FYI, there's mxProxy which implements a flavor of them. Look
 > in the standard places for mx stuff ;-)

  Yes, but still not in the core.  So we have two general examples
(vrefs and mxProxy) and there's WeakDict (or something like that).  I
think there really needs to be a core facility for this.  There are a
lot of users (including myself) who think that things are far less
useful if they're not in the core.  (No, I'm not saying that
everything should be in the core, or even that it needs a lot more
stuff.  I just don't want to be writing code that requires a lot of
separate packages to be installed.  At least not until we can tell an
installation tool to "install this and everything it depends on." ;)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From bwarsaw at cnri.reston.va.us  Tue Nov 16 17:14:55 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Tue, 16 Nov 1999 11:14:55 -0500 (EST)
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
References: <199911152137.QAA28280@eric.cnri.reston.va.us>
	
	<19991116091032.A4063@cnri.reston.va.us>
	<14385.27579.292173.433577@amarok.cnri.reston.va.us>
Message-ID: <14385.33535.23316.286575@anthem.cnri.reston.va.us>

>>>>> "AMK" == Andrew M Kuchling  writes:

    AMK> I don't see how this is fixable, unless we strip down
    AMK> site.py, which drags in os, which drags in os.path and stat
    AMK> and UserDict.

One approach might be to support loading modules out of jar files (or
whatever) using Greg imputils.  We could put the bootstrap .pyc files
in this jar and teach Python to import from it first.  Python
installations could even craft their own modules.jar file to include
whatever modules they are willing to "hard code".  This, with -S might
make Python start up much faster, at the small cost of some
flexibility (which could be regained with a c.l. switch or other
mechanism to bypass modules.jar).

-Barry



From guido at CNRI.Reston.VA.US  Tue Nov 16 17:20:28 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 11:20:28 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Tue, 16 Nov 1999 17:00:58 +0100."
             <38317FBA.4F3D6B1F@lemburg.com> 
References: <38317FBA.4F3D6B1F@lemburg.com> 
Message-ID: <199911161620.LAA02643@eric.cnri.reston.va.us>

> It is not required by the unicodec.register() API to provide a
> subclass of these base class, only the given methods must be present;
> this allows writing Codecs as extensions types.  All Codecs must
> provide the .encode()/.decode() methods. Codecs having the .read()
> and/or .write() methods are considered to be StreamCodecs.
> 
> The Unicode implementation will by itself only use the
> stateless .encode() and .decode() methods.
> 
> All other conversion have to be done by explicitly instantiating
> the appropriate [Stream]Codec.

Looks okay, although I'd like someone to implement a simple
shift-state-based stream codec to check this out further.

I have some questions about the constructor.  You seem to imply
that instantiating the class without arguments creates a codec without
state.  That's fine.  When given a stream argument, shouldn't the
direction of the stream be given as an additional argument, so the
proper state for encoding or decoding can be set up?  I can see that
for an implementation it might be more convenient to have separate
classes for encoders and decoders -- certainly the state being kept is
very different.

Also, I don't want to ignore the alternative interface that was
suggested by /F.  It uses feed() similar to htmllib c.s.  This has
some advantages (although we might want to define some compatibility
so it can also feed directly into a file).

Perhaps someone should go ahead and implement prototype codecs using
either paradigm and then write some simple apps, so we can make a
better decision.

In any case I think the specs codec registry API aren't on the
critical path, integration of /F's basic unicode object is the first
thing we need.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Tue Nov 16 17:27:53 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 11:27:53 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: Your message of "Tue, 16 Nov 1999 11:14:55 EST."
             <14385.33535.23316.286575@anthem.cnri.reston.va.us> 
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us>  
            <14385.33535.23316.286575@anthem.cnri.reston.va.us> 
Message-ID: <199911161627.LAA02665@eric.cnri.reston.va.us>

> >>>>> "AMK" == Andrew M Kuchling  writes:
> 
>     AMK> I don't see how this is fixable, unless we strip down
>     AMK> site.py, which drags in os, which drags in os.path and stat
>     AMK> and UserDict.
> 
> One approach might be to support loading modules out of jar files (or
> whatever) using Greg imputils.  We could put the bootstrap .pyc files
> in this jar and teach Python to import from it first.  Python
> installations could even craft their own modules.jar file to include
> whatever modules they are willing to "hard code".  This, with -S might
> make Python start up much faster, at the small cost of some
> flexibility (which could be regained with a c.l. switch or other
> mechanism to bypass modules.jar).

A completely different approach (which, incidentally, HP has lobbied
for before; and which has been implemented by Sjoerd Mullender for one
particular application) would be to cache a mapping from module names
to filenames in a dbm file.  For Sjoerd's app (which imported hundreds
of modules) this made a huge difference.  The problem is that it's
hard to deal with issues like updating the cache while sharing it with
other processes and even other users...  But if those can be solved,
this could greatly reduce the number of stats and unsuccessful opens,
without having to resort to jar files.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gmcm at hypernet.com  Tue Nov 16 17:56:19 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Tue, 16 Nov 1999 11:56:19 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <14385.33535.23316.286575@anthem.cnri.reston.va.us>
Message-ID: <1269351119-9152905@hypernet.com>

Barry A. Warsaw writes:

> One approach might be to support loading modules out of jar files
> (or whatever) using Greg imputils.  We could put the bootstrap
> .pyc files in this jar and teach Python to import from it first. 
> Python installations could even craft their own modules.jar file
> to include whatever modules they are willing to "hard code". 
> This, with -S might make Python start up much faster, at the
> small cost of some flexibility (which could be regained with a
> c.l. switch or other mechanism to bypass modules.jar).

Couple hundred Windows users have been doing this for 
months (http://starship.python.net/crew/gmcm/install.html). 
The .pyz files are cross-platform, although the "embedding" 
app would have to be redone for *nix, (and all the embedding 
really does is keep Python from hunting all over your disk). 
Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a 
diskette with a little room left over.

but-since-its-WIndows-it-must-be-tainted-ly y'rs


- Gordon



From guido at CNRI.Reston.VA.US  Tue Nov 16 18:00:15 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 12:00:15 -0500
Subject: [Python-Dev] Python 1.6 status
Message-ID: <199911161700.MAA02716@eric.cnri.reston.va.us>

Greg Stein recently reminded me that he was holding off on 1.6 patches
because he was under the impression that I wasn't accepting them yet.

The situation is rather more complicated than that.  There are a great
deal of things that need to be done, and for many of them I'd be most
happy to receive patches!  For other things, however, I'm still in the
requirements analysis phase, and patches might be premature (e.g., I
want to redesign the import mechanisms, and while I like some of the
prototypes that have been posted, I'm not ready to commit to any
specific implementation).

How do you know for which things I'm ready for patches?  Ask me.  I've
tried to make lists before, and there are probably some hints in the
TODO FAQ wizard as well as in the "requests" section of the Python
Bugs List.

Greg also suggested that I might receive more patches if I opened up
the CVS tree for checkins by certain valued contributors.  On the one
hand I'm reluctant to do that (I feel I have a pretty good track
record of checking in patches that are mailed to me, assuming I agree
with them) but on the other hand there might be something to say for
this, because it gives contributors more of a sense of belonging to
the inner core.  Of course, checkin privileges don't mean you can
check in anything you like -- as in the Apache world, changes must be
discussed and approved by the group, and I would like to have a veto.
However once a change is approved, it's much easier if the contributor
can check the code in without having to go through me all the time.

A drawback may be that some people will make very forceful requests to
be given checkin privileges, only to never use them; just like there
are some members of python-dev who have never contributed.  I
definitely want to limit the number of privileged contributors to a
very small number (e.g. 10-15).

One additional detail is the legal side -- contributors will have to
sign some kind of legal document similar to the current (wetsign.html)
release form, but guiding all future contributions.  I'll have to
discuss this with CNRI's legal team.

Greg, I understand you have checkin privileges for Apache.  What is
the procedure there for handing out those privileges?  What is the
procedure for using them?  (E.g. if you made a bogus change to part of
Apache you're not supposed to work on, what happens?)

I'm hoping for several kind of responses to this email:

- uncontroversial patches

- questions about whether specific issues are sufficiently settled to
start coding a patch

- discussion threads opening up some issues that haven't been settled
yet (like the current, very productive, thread in i18n)

- posts summarizing issues that were settled long ago in the past,
requesting reverification that the issue is still settled

- suggestions for new issues that maybe ought to be settled in 1.6

- requests for checkin privileges, preferably with a specific issue or
area of expertise for which the requestor will take responsibility

--Guido van Rossum (home page: http://www.python.org/~guido/)



From akuchlin at mems-exchange.org  Tue Nov 16 18:11:48 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 Nov 1999 12:11:48 -0500 (EST)
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us>
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <14385.36948.610106.195971@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>I'm hoping for several kind of responses to this email:

My list of things to do for 1.6 is:

   * Translate re.py to C and switch to the latest PCRE 2 codebase
(mostly done, perhaps ready for public review in a week or so).

   * Go through the O'Reilly POSIX book and draw up a list of missing
POSIX functions that aren't available in the posix module.  This
was sparked by Greg Ward showing me a Perl daemonize() function
he'd written, and I realized that some of the functions it used
weren't available in Python at all.  (setsid() was one of them, I
think.)

   * A while back I got approval to add the mmapfile module to the
core.  The outstanding issue there is that the constructor has a
different interface on Unix and Windows platforms.

On Windows:
mm = mmapfile.mmapfile("filename", "tag name", )

On Unix, it looks like the mmap() function:

mm = mmapfile.mmapfile(, , 
                        (like MAP_SHARED),
		        (like PROT_READ, PROT_READWRITE) 
                      )

Can we reconcile these interfaces, have two different function names,
or what?

>- suggestions for new issues that maybe ought to be settled in 1.6

Perhaps we should figure out what new capabilities, if any, should be
added in 1.6.  Fred has mentioned weak references, and there are other
possibilities such as ExtensionClass.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Society, my dear, is like salt water, good to swim in but hard to swallow.
    -- Arthur Stringer, _The Silver Poppy_




From beazley at cs.uchicago.edu  Tue Nov 16 18:24:24 1999
From: beazley at cs.uchicago.edu (David Beazley)
Date: Tue, 16 Nov 1999 11:24:24 -0600 (CST)
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
	<14385.36948.610106.195971@amarok.cnri.reston.va.us>
Message-ID: <199911161724.LAA13496@gargoyle.cs.uchicago.edu>

Andrew M. Kuchling writes:
> Guido van Rossum writes:
> >I'm hoping for several kind of responses to this email:
> 
>    * Go through the O'Reilly POSIX book and draw up a list of missing
> POSIX functions that aren't available in the posix module.  This
> was sparked by Greg Ward showing me a Perl daemonize() function
> he'd written, and I realized that some of the functions it used
> weren't available in Python at all.  (setsid() was one of them, I
> think.)
> 

I second this!   This was one of the things I noticed when doing the
Essential Reference Book.   Assuming no one has done it already,
I wouldn't mind volunteering to take a crack at it.

Cheers,

Dave





From fdrake at acm.org  Tue Nov 16 18:25:02 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 12:25:02 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <199911161620.LAA02643@eric.cnri.reston.va.us>
References: <38317FBA.4F3D6B1F@lemburg.com>
	<199911161620.LAA02643@eric.cnri.reston.va.us>
Message-ID: <14385.37742.816993.642515@weyr.cnri.reston.va.us>

Guido van Rossum writes:
 > Also, I don't want to ignore the alternative interface that was
 > suggested by /F.  It uses feed() similar to htmllib c.s.  This has
 > some advantages (although we might want to define some compatibility
 > so it can also feed directly into a file).

  I think one or the other can be used, and then a wrapper that
converts to the other interface.  Perhaps the encoders should provide
feed(), and a file-like wrapper can convert write() to feed().  It
could also be done the other way; I'm not sure if it matters which is
"normal."  (Or perhaps feed() was badly named and should be write()?
The general intent was a little different, I think, but an output file 
is very much a stream consumer.)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From akuchlin at mems-exchange.org  Tue Nov 16 18:32:41 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 Nov 1999 12:32:41 -0500 (EST)
Subject: [Python-Dev] mmapfile module
In-Reply-To: <199911161720.MAA02764@eric.cnri.reston.va.us>
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
	<14385.36948.610106.195971@amarok.cnri.reston.va.us>
	<199911161720.MAA02764@eric.cnri.reston.va.us>
Message-ID: <14385.38201.301429.786642@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>Hm, this seems to require a higher-level Python module to hide the
>differences.  Maybe the Unix version could also use a filename?  I
>would think that mmap'ed files should always be backed by a file (not
>by a pipe, socket etc.).  Or is there an issue with secure creation of
>temp files?  This is a question for a separate thread.

Hmm... I don't know of any way to use mmap() on non-file things,
either; there are odd special cases, like using MAP_ANONYMOUS on
/dev/zero to allocate memory, but that's still using a file.  On the
other hand, there may be some special case where you need to do that.
We could add a fileno() method to get the file descriptor, but I don't
know if that's useful to Windows.  (Is Sam Rushing, the original
author of the Win32 mmapfile, on this list?)  

What do we do about the tagname, which is a Win32 argument that has no
Unix counterpart -- I'm not even sure what its function is.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I had it in me to be the Pierce Brosnan of my generation.
    -- Vincent Me's past career plans in EGYPT #1



From mal at lemburg.com  Tue Nov 16 18:53:46 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 18:53:46 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us>
Message-ID: <38319A2A.4385D2E7@lemburg.com>

Guido van Rossum wrote:
> 
> > It is not required by the unicodec.register() API to provide a
> > subclass of these base class, only the given methods must be present;
> > this allows writing Codecs as extensions types.  All Codecs must
> > provide the .encode()/.decode() methods. Codecs having the .read()
> > and/or .write() methods are considered to be StreamCodecs.
> >
> > The Unicode implementation will by itself only use the
> > stateless .encode() and .decode() methods.
> >
> > All other conversion have to be done by explicitly instantiating
> > the appropriate [Stream]Codec.
> 
> Looks okay, although I'd like someone to implement a simple
> shift-state-based stream codec to check this out further.
> 
> I have some questions about the constructor.  You seem to imply
> that instantiating the class without arguments creates a codec without
> state.  That's fine.  When given a stream argument, shouldn't the
> direction of the stream be given as an additional argument, so the
> proper state for encoding or decoding can be set up?  I can see that
> for an implementation it might be more convenient to have separate
> classes for encoders and decoders -- certainly the state being kept is
> very different.

Wouldn't it be possible to have the read/write methods set up
the state when called for the first time ?

Note that I wrote ".read() and/or .write() methods" in the proposal
on purpose: you can of course implement Codecs which only implement
one of them, i.e. Readers and Writers. The registry doesn't care
about them anyway :-)

Then, if you use a Reader for writing, it will result in an
AttributeError...
 
> Also, I don't want to ignore the alternative interface that was
> suggested by /F.  It uses feed() similar to htmllib c.s.  This has
> some advantages (although we might want to define some compatibility
> so it can also feed directly into a file).

AFAIK, .feed() and .finalize() (or .close() etc.) have a different
backgound: you add data in chunks and then process it at some
final stage rather than for each feed. This is often more
efficient.

With respest to codecs this would mean, that you buffer the
output in memory, first doing only preliminary operations on
the feeds and then apply some final logic to the buffer at
the time .finalize() is called.

We could define a StreamCodec subclass for this kind of operation.

> Perhaps someone should go ahead and implement prototype codecs using
> either paradigm and then write some simple apps, so we can make a
> better decision.
> 
> In any case I think the specs codec registry API aren't on the
> critical path, integration of /F's basic unicode object is the first
> thing we need.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gward at cnri.reston.va.us  Tue Nov 16 18:54:06 1999
From: gward at cnri.reston.va.us (Greg Ward)
Date: Tue, 16 Nov 1999 12:54:06 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <199911161627.LAA02665@eric.cnri.reston.va.us>; from guido@cnri.reston.va.us on Tue, Nov 16, 1999 at 11:27:53AM -0500
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us>
Message-ID: <19991116125405.B4063@cnri.reston.va.us>

On 16 November 1999, Guido van Rossum said:
> A completely different approach (which, incidentally, HP has lobbied
> for before; and which has been implemented by Sjoerd Mullender for one
> particular application) would be to cache a mapping from module names
> to filenames in a dbm file.  For Sjoerd's app (which imported hundreds
> of modules) this made a huge difference.

Hey, this could be a big win for Zope startup.  Dunno how much of that
20-30 sec startup overhead is due to loading modules, but I'm sure it's
a sizeable percentage.  Any Zope-heads listening?

> The problem is that it's
> hard to deal with issues like updating the cache while sharing it with
> other processes and even other users...

Probably not a concern in the case of Zope: one installation, one
process, only gets started when it's explicitly shut down and
restarted.  HmmmMMMMmmm...

        Greg



From petrilli at amber.org  Tue Nov 16 19:04:46 1999
From: petrilli at amber.org (Christopher Petrilli)
Date: Tue, 16 Nov 1999 13:04:46 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <19991116125405.B4063@cnri.reston.va.us>; from gward@cnri.reston.va.us on Tue, Nov 16, 1999 at 12:54:06PM -0500
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us> <19991116125405.B4063@cnri.reston.va.us>
Message-ID: <19991116130446.A3068@trump.amber.org>

Greg Ward [gward at cnri.reston.va.us] wrote:
> On 16 November 1999, Guido van Rossum said:
> > A completely different approach (which, incidentally, HP has lobbied
> > for before; and which has been implemented by Sjoerd Mullender for one
> > particular application) would be to cache a mapping from module names
> > to filenames in a dbm file.  For Sjoerd's app (which imported hundreds
> > of modules) this made a huge difference.
> 
> Hey, this could be a big win for Zope startup.  Dunno how much of that
> 20-30 sec startup overhead is due to loading modules, but I'm sure it's
> a sizeable percentage.  Any Zope-heads listening?

Wow, that's a huge start up that I've personally never seen.  I can't
imagine... even loading the Oracle libraries dynamically, which are HUGE
(2Mb or so), it's only a couple seconds.  

> > The problem is that it's
> > hard to deal with issues like updating the cache while sharing it with
> > other processes and even other users...
> 
> Probably not a concern in the case of Zope: one installation, one
> process, only gets started when it's explicitly shut down and
> restarted.  HmmmMMMMmmm...

This doesn't reslve a lot of other users of Python howver... and Zope
would always benefit, especially when you're running multiple instances
on th same machine... would perhaps share more code.

Chris
-- 
| Christopher Petrilli
| petrilli at amber.org



From gmcm at hypernet.com  Tue Nov 16 19:04:41 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Tue, 16 Nov 1999 13:04:41 -0500
Subject: [Python-Dev] mmapfile module
In-Reply-To: <14385.38201.301429.786642@amarok.cnri.reston.va.us>
References: <199911161720.MAA02764@eric.cnri.reston.va.us>
Message-ID: <1269347016-9399681@hypernet.com>

Andrew M. Kuchling wrote:

> Hmm... I don't know of any way to use mmap() on non-file things,
> either; there are odd special cases, like using MAP_ANONYMOUS on
> /dev/zero to allocate memory, but that's still using a file.  On
> the other hand, there may be some special case where you need to
> do that. We could add a fileno() method to get the file
> descriptor, but I don't know if that's useful to Windows.  (Is
> Sam Rushing, the original author of the Win32 mmapfile, on this
> list?)  
> 
> What do we do about the tagname, which is a Win32 argument that
> has no Unix counterpart -- I'm not even sure what its function
> is.

On Windows, a mmap is always backed by disk (swap 
space), but is not necessarily associated with a (user-land) 
file. The tagname is like the "name" associated with a 
semaphore; two processes opening the same tagname get 
shared memory.

Fileno (in the c runtime sense) would be useless on Windows. 
As with all Win32 resources, there's a "handle", which is 
analagous. But different enough, it seems to me, to confound 
any attempts at a common API.

Another fundamental difference (IIRC) is that Windows mmap's 
can be resized on the fly.

- Gordon



From guido at CNRI.Reston.VA.US  Tue Nov 16 19:09:43 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 13:09:43 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Tue, 16 Nov 1999 18:53:46 +0100."
             <38319A2A.4385D2E7@lemburg.com> 
References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us>  
            <38319A2A.4385D2E7@lemburg.com> 
Message-ID: <199911161809.NAA02894@eric.cnri.reston.va.us>

> > I have some questions about the constructor.  You seem to imply
> > that instantiating the class without arguments creates a codec without
> > state.  That's fine.  When given a stream argument, shouldn't the
> > direction of the stream be given as an additional argument, so the
> > proper state for encoding or decoding can be set up?  I can see that
> > for an implementation it might be more convenient to have separate
> > classes for encoders and decoders -- certainly the state being kept is
> > very different.
> 
> Wouldn't it be possible to have the read/write methods set up
> the state when called for the first time ?

Hm, I'd rather be explicit.  We don't do this for files either.

> Note that I wrote ".read() and/or .write() methods" in the proposal
> on purpose: you can of course implement Codecs which only implement
> one of them, i.e. Readers and Writers. The registry doesn't care
> about them anyway :-)
> 
> Then, if you use a Reader for writing, it will result in an
> AttributeError...
>  
> > Also, I don't want to ignore the alternative interface that was
> > suggested by /F.  It uses feed() similar to htmllib c.s.  This has
> > some advantages (although we might want to define some compatibility
> > so it can also feed directly into a file).
> 
> AFAIK, .feed() and .finalize() (or .close() etc.) have a different
> backgound: you add data in chunks and then process it at some
> final stage rather than for each feed. This is often more
> efficient.
> 
> With respest to codecs this would mean, that you buffer the
> output in memory, first doing only preliminary operations on
> the feeds and then apply some final logic to the buffer at
> the time .finalize() is called.

This is part of the purpose, yes.

> We could define a StreamCodec subclass for this kind of operation.

The difference is that to decode from a file, your proposed interface
is to call read() on the codec which will in turn call read() on the
stream.  In /F's version, I call read() on the stream (geting multibyte
encoded data), feed() that to the codec, which in turn calls feed() to
some other back end -- perhaps another codec which in turn feed()s its
converted data to another file, perhaps an XML parser.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake at acm.org  Tue Nov 16 19:16:42 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 13:16:42 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <38319A2A.4385D2E7@lemburg.com>
References: <38317FBA.4F3D6B1F@lemburg.com>
	<199911161620.LAA02643@eric.cnri.reston.va.us>
	<38319A2A.4385D2E7@lemburg.com>
Message-ID: <14385.40842.709711.12141@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Wouldn't it be possible to have the read/write methods set up
 > the state when called for the first time ?

  That slows the down; the constructor should handle initialization.
Perhaps what gets registered should be:  encoding function, decoding
function, stream encoder factory (can be a class), stream decoder
factory (again, can be a class).  These can be encapsulated either
before or after hitting the registry, and can be None.  The registry
and provide default implementations from what is provided (stream
handlers from the functions, or functions from the stream handlers) as 
required.
  Ideally, I should be able to write a module with four well-known
entry points and then provide the module object itself as the
registration entry.  Or I could construct a new object that has the
right interface and register that if it made more sense for the
encoding.

 > AFAIK, .feed() and .finalize() (or .close() etc.) have a different
 > backgound: you add data in chunks and then process it at some
 > final stage rather than for each feed. This is often more

  Many of the classes that provide feed() do as much work as possible
as data is fed into them (see htmllib.HTMLParser); this structure is
commonly used to support asynchonous operation.

 > With respest to codecs this would mean, that you buffer the
 > output in memory, first doing only preliminary operations on
 > the feeds and then apply some final logic to the buffer at
 > the time .finalize() is called.

  That depends on the encoding.  I'd expect it to feed encoded data to 
a sink as quickly as it could and let the target decide what needs to
happen.  If buffering is needed, the target could be a StringIO or
whatever.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fredrik at pythonware.com  Tue Nov 16 20:32:21 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 20:32:21 +0100
Subject: [Python-Dev] mmapfile module
References: <199911161700.MAA02716@eric.cnri.reston.va.us><14385.36948.610106.195971@amarok.cnri.reston.va.us><199911161720.MAA02764@eric.cnri.reston.va.us> <14385.38201.301429.786642@amarok.cnri.reston.va.us>
Message-ID: <002201bf3069$4e232a50$f29b12c2@secret.pythonware.com>

> Hmm... I don't know of any way to use mmap() on non-file things,
> either; there are odd special cases, like using MAP_ANONYMOUS on
> /dev/zero to allocate memory, but that's still using a file.

but that's not always the case -- OSF/1 supports
truly anonymous mappings, for example.  in fact,
it bombs if you use ANONYMOUS with a file handle:

$ man mmap

    ...

    If MAP_ANONYMOUS is set in the flags parameter:

        +  A new memory region is created and initialized to all zeros.  This
           memory region can be shared only with descendents of the current pro-
           cess.

        +  If the filedes parameter is not -1, the mmap() function fails.

    ...

(btw, doing anonymous maps isn't exactly an odd special
case under this operating system; it's the only memory-
allocation mechanism provided by the kernel...)






From fredrik at pythonware.com  Tue Nov 16 20:33:52 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 20:33:52 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us>
Message-ID: <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> Also, I don't want to ignore the alternative interface that was
> suggested by /F.  It uses feed() similar to htmllib c.s.  This has
> some advantages (although we might want to define some
> compatibility so it can also feed directly into a file).

seeing this made me switch on my brain for a moment,
and recall how things are done in PIL (which is, as I've
bragged about before, another library with an internal
format, and many possible external encodings).  among
other things, PIL lets you read and write images to both
ordinary files and arbitrary file objects, but it also lets
you incrementally decode images by feeding it chunks
of data (through ImageFile.Parser).  and it's fast -- it has
to be, since images tends to contain lots of pixels...

anyway, here's what I came up with (code will follow,
if someone's interested).

--------------------------------------------------------------------
A PIL-like Unicode Codec Proposal
--------------------------------------------------------------------

In the PIL model, the codecs are called with a piece of data, and
returns the result to the caller.  The codecs maintain internal state
when needed.

class decoder:

    def decode(self, s, offset=0):
        # decode as much data as we possibly can from the
        # given string.  if there's not enough data in the
        # input string to form a full character, return
        # what we've got this far (this might be an empty
        # string).

    def flush(self):
        # flush the decoding buffers.  this should usually
        # return None, unless the fact that knowing that the
        # input stream has ended means that the state can be
        # interpreted in a meaningful way.  however, if the
        # state indicates that there last character was not
        # finished, this method should raise a UnicodeError
        # exception.

class encoder:

    def encode(self, u, offset=0, buffersize=0):
        # encode data from the given offset in the input
        # unicode string into a buffer of the given size
        # (or slightly larger, if required to proceed).
        # if the buffer size is 0, the decoder is free
        # to pick a suitable size itself (if at all
        # possible, it should make it large enough to
        # encode the entire input string).  returns a
        # 2-tuple containing the encoded data, and the
        # number of characters consumed by this call.

    def flush(self):
        # flush the encoding buffers.  returns an ordinary
        # string (which may be empty), or None.

Note that a codec instance can be used for a single string; the codec
registry should hold codec factories, not codec instances.  In
addition, you may use a single type or class to implement both
interfaces at once.

--------------------------------------------------------------------
Use Cases
--------------------------------------------------------------------

A null decoder:

    class decoder:
        def decode(self, s, offset=0):
            return s[offset:]
        def flush(self):
            pass

A null encoder:

    class encoder:
        def encode(self, s, offset=0, buffersize=0):
            if buffersize:
                s = s[offset:offset+buffersize]
            else:
                s = s[offset:]
            return s, len(s)
        def flush(self):
            pass

Decoding a string:

    def decode(s, encoding)
        c = registry.getdecoder(encoding)
        u = c.decode(s)
        t = c.flush()
        if not t:
            return u
        return u + t # not very common

Encoding a string:

    def encode(u, encoding)
        c = registry.getencoder(encoding)
        p = []
        o = 0
        while o < len(u):
            s, n = c.encode(u, o)
            p.append(s)
            o = o + n
        if len(p) == 1:
            return p[0]
        return string.join(p, "") # not very common

Implementing stream codecs is left as an exercise (see the zlib
material in the eff-bot guide for a decoder example).

--- end of proposal




From fredrik at pythonware.com  Tue Nov 16 20:37:40 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 20:37:40 +0100
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us>
Message-ID: <003d01bf306a$0bdea330$f29b12c2@secret.pythonware.com>

>    * Go through the O'Reilly POSIX book and draw up a list of missing
> POSIX functions that aren't available in the posix module.  This
> was sparked by Greg Ward showing me a Perl daemonize() function
> he'd written, and I realized that some of the functions it used
> weren't available in Python at all.  (setsid() was one of them, I
> think.)

$ python
Python 1.5.2 (#1, Aug 23 1999, 14:42:39)  [GCC 2.7.2.3] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import os
>>> os.setsid







From mhammond at skippinet.com.au  Tue Nov 16 22:54:15 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 17 Nov 1999 08:54:15 +1100
Subject: [Python-Dev] Some thoughts on the codecs... 
In-Reply-To: <19991116110555.8B43335BB1E@snelboot.oratrix.nl>
Message-ID: <00f701bf307d$20f0cb00$0501a8c0@bobcat>

[Andy writes:]
> Leave JISXXX and the CJK stuff out.  If you get into Japanese, you
> really need to cover ShiftJIS, EUC-JP and JIS, they are big, and
there

[Then Marc relpies:]
> 2. give more information to the unicodec registry:
>    one could register classes instead of instances which the Unicode

[Jack chimes in with:]
> I would suggest adding the Dos, Windows and Macintosh
> standard 8-bit charsets
> (their equivalents of latin-1) too, as documents in these
> encoding are pretty
> ubiquitous. But maybe these should only be added on the
> respective platforms.

[And the conversation twisted around to Greg noting:]
> Next, the number of "open" calls:
>
>               Solaris     Linux    IRIX
>  Perl             16         10       9
>  Python          107         71      48

This is leading me to conclude that our "codec registry" should be the
file system, and Python modules.

Would it be possible to define a "standard package" called
"encodings", and when we need an encoding, we simply attempt to load a
module from that package?  The key benefits I see are:

* No need to load modules simply to register a codec (which would make
the number of open calls even higher, and the startup time even
slower.)  This makes it truly demand-loading of the codecs, rather
than explicit load-and-register.

* Making language specific distributions becomes simple - simply
select a different set of modules from the "encodings" directory.  The
Python source distribution has them all, but (say) the Windows binary
installer selects only a few.  The Japanese binary installer for
Windows installs a few more.

* Installing new codecs becomes trivial - no need to hack site.py
etc - simply copy the new "codec module" to the encodings directory
and you are done.

* No serious problem for GMcM's installer nor for freeze

We would probably need to assume that certain codes exist for _all_
platforms and language - but this is no different to assuming that
"exceptions.py" also exists for all platforms.

Is this worthy of consideration?

Mark.




From andy at robanal.demon.co.uk  Wed Nov 17 01:14:06 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Wed, 17 Nov 1999 00:14:06 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <010001bf300e$14741310$f29b12c2@secret.pythonware.com>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>             <38305D17.60EC94D0@lemburg.com>  <199911152137.QAA28280@eric.cnri.reston.va.us> <010001bf300e$14741310$f29b12c2@secret.pythonware.com>
Message-ID: <3836f28c.4929177@post.demon.co.uk>

On Tue, 16 Nov 1999 09:39:20 +0100, you wrote:

>1) codes written according to the "data
>   consumer model", instead of the "stream"
>   model.
>
>        class myDecoder:
>            def __init__(self, target):
>                self.target = target
>                self.state = ...
>            def feed(self, data):
>                ... extract as much data as possible ...
>                self.target.feed(extracted data)
>            def close(self):
>                ... extract what's left ...
>                self.target.feed(additional data)
>                self.target.close()
>
Apart from feed() instead of write(), how is that different from a
Java-like Stream writer as Guido suggested?  He said:

>Andy's file translation example could then be written as follows:
>
># assuming variables input_file, input_encoding, output_file,
># output_encoding, and constant BUFFER_SIZE
>
>f = open(input_file, "rb")
>f1 = unicodec.codecs[input_encoding].stream_reader(f)
>g = open(output_file, "wb")
>g1 = unicodec.codecs[output_encoding].stream_writer(f)
>
>while 1:
>      buffer = f1.read(BUFFER_SIZE)
>      if not buffer:
>	 break
>      f2.write(buffer)
>
>f2.close()
>f1.close()
>
>Note that we could possibly make these the only API that a codec needs
>to provide; the string object <--> unicode object conversions can be
>done using this and the cStringIO module.  (On the other hand it seems
>a common case that would be quite useful.)

- Andy



From gstein at lyra.org  Wed Nov 17 03:03:21 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 18:03:21 -0800 (PST)
Subject: [Python-Dev] shared data
In-Reply-To: <1269351119-9152905@hypernet.com>
Message-ID: 

On Tue, 16 Nov 1999, Gordon McMillan wrote:
> Barry A. Warsaw writes:
> > One approach might be to support loading modules out of jar files
> > (or whatever) using Greg imputils.  We could put the bootstrap
> > .pyc files in this jar and teach Python to import from it first. 
> > Python installations could even craft their own modules.jar file
> > to include whatever modules they are willing to "hard code". 
> > This, with -S might make Python start up much faster, at the
> > small cost of some flexibility (which could be regained with a
> > c.l. switch or other mechanism to bypass modules.jar).
> 
> Couple hundred Windows users have been doing this for 
> months (http://starship.python.net/crew/gmcm/install.html). 
> The .pyz files are cross-platform, although the "embedding" 
> app would have to be redone for *nix, (and all the embedding 
> really does is keep Python from hunting all over your disk). 
> Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a 
> diskette with a little room left over.

I've got a patch from Jim Ahlstrom to provide a "standardized" library
file. I've got to review and fold that thing in (I'll post here when that
is done).

As Gordon states: yes, the startup time is considerably improved.

The DBM approach is interesting. That could definitely be used thru an
imputils Importer; it would be quite interesting to try that out.

(Note that the library style approach would be even harder to deal with
updates, relative to what Sjoerd saw with the DBM approach; I would guess 
that the "right" approach is to rebuild the library from scratch and
atomically replace the thing (but that would bust people with open
references...))

Certainly something to look at.

Cheers,
-g

p.s. I also want to try mmap'ing a library and creating code objects that
use PyBufferObjects (rather than PyStringObjects) that refer to portions
of the mmap. Presuming the mmap is shared, there "should" be a large
reduction in heap usage. Question is that I don't know the proportion of
code bytes to other heap usage caused by loading a .pyc.

p.p.s. I also want to try the buffer approach for frozen code.

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Wed Nov 17 03:29:42 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 18:29:42 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <14385.40842.709711.12141@weyr.cnri.reston.va.us>
Message-ID: 

On Tue, 16 Nov 1999, Fred L. Drake, Jr. wrote:
> M.-A. Lemburg writes:
>  > Wouldn't it be possible to have the read/write methods set up
>  > the state when called for the first time ?
> 
>   That slows the down; the constructor should handle initialization.
> Perhaps what gets registered should be:  encoding function, decoding
> function, stream encoder factory (can be a class), stream decoder
> factory (again, can be a class).  These can be encapsulated either
> before or after hitting the registry, and can be None.  The registry

I'm with Fred here; he beat me to the punch (and his email is better than 
what I'd write anyhow :-).

I'd like to see the API be *functions* rather than a particular class
specification. If the spec is going to say "do not alter/store state",
then a function makes much more sense than a method on an object.

Of course, bound method objects could be registered. This might occur if
you have a general JIS encode/decoder but need to instantiate it a little
differently for each JIS variant.
(Andy also mentioned something about "options" in JIS encoding/decoding)

> and provide default implementations from what is provided (stream
> handlers from the functions, or functions from the stream handlers) as 
> required.

Excellent idea...

"I'll provide the encode/decode functions, but I don't have a spiffy
algorithm for streaming -- please provide a stream wrapper for my
functions."

>   Ideally, I should be able to write a module with four well-known
> entry points and then provide the module object itself as the
> registration entry.  Or I could construct a new object that has the
> right interface and register that if it made more sense for the
> encoding.

Mark's idea about throwing these things into a package for on-demand
registrations is much better than a "register-beforehand" model. When the
module is loaded from the package, it calls a registration function to
insert its 4-tuple of registration data.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Wed Nov 17 03:40:07 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 18:40:07 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs... 
In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
Message-ID: 

On Wed, 17 Nov 1999, Mark Hammond wrote:
>...
> Would it be possible to define a "standard package" called
> "encodings", and when we need an encoding, we simply attempt to load a
> module from that package?  The key benefits I see are:
>...
> Is this worthy of consideration?

Absolutely!

You will need to provide a way for a module (in the "codec" package) to
state *beforehand* that it should be loaded for the X, Y, and Z encodings.
This might be in terms of little "info" files that get dropped into the
package. The __init__.py module scans the directory for the info files and
loads them to build an encoding => module-name mapping.

The alternative would be to have stub modules like:

iso-8859-1.py:

import unicodec

def encode_1(...)
  ...
def encode_2(...)
  ...
...

unicodec.register('iso-8859-1', encode_1, decode_1)
unicodec.register('iso-8859-2', encode_2, decode_2)
...


iso-8859-2.py:
import iso-8859-1


I believe that encoding names are legitimate file names, but they aren't
necessarily Python identifiers. That kind of bungs up "import
codec.iso-8859-1". The codec package would need to programmatically import
the modules. Clients should not be directly importing the modules, so I
don't see a difficult here.
[ if we do decide to allow clients access to the modules, then maybe they
  have to arrive through a "helper" module that has a nice name, or the
  codec package provides a "module = code.load('iso-8859-1')" idiom. ]

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mhammond at skippinet.com.au  Wed Nov 17 03:57:48 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 17 Nov 1999 13:57:48 +1100
Subject: [Python-Dev] Some thoughts on the codecs... 
In-Reply-To: 
Message-ID: <010501bf30a7$88c00320$0501a8c0@bobcat>

> You will need to provide a way for a module (in the "codec"
> package) to
> state *beforehand* that it should be loaded for the X, Y, and
...

> The alternative would be to have stub modules like:

Actually, I was thinking even more radically - drop the codec registry
all together, and use modules with "well-known" names  (a slight
precedent, but Python isnt adverse to well-known names in general)

eg:
iso-8859-1.py:

import unicodec
def encode(...):
  ...
def decode(...):
  ...

iso-8859-2.py:
from iso-8859-1 import *

The codec registry then is trivial, and effectively does not exist
(cant get much more trivial than something that doesnt exist :-):

def getencoder(encoding):
  mod = __import__( "encodings." + encoding )
  return getattr(mod, "encode")


> I believe that encoding names are legitimate file names, but
> they aren't
> necessarily Python identifiers. That kind of bungs up "import
> codec.iso-8859-1".

Agreed - clients should never need to import them, and codecs that
wish to import other codes could use "__import__"

Of course, I am not adverse to the idea of a registry as well and
having the modules manually register themselves - but it doesnt seem
to buy much, and the logic for getting a codec becomes more complex -
ie, it needs to determine the module to import, then look in the
registry - if it needs to determine the module anyway, why not just
get it from the module and be done with it?

Mark.




From andy at robanal.demon.co.uk  Wed Nov 17 01:18:22 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Wed, 17 Nov 1999 00:18:22 GMT
Subject: [Python-Dev] Some thoughts on the codecs... 
In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
References: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
Message-ID: <3837f379.5166829@post.demon.co.uk>

On Wed, 17 Nov 1999 08:54:15 +1100, you wrote:

>This is leading me to conclude that our "codec registry" should be the
>file system, and Python modules.
>
>Would it be possible to define a "standard package" called
>"encodings", and when we need an encoding, we simply attempt to load a
>module from that package?  The key benefits I see are:
[snip]
>Is this worthy of consideration?

Exactly what I am aiming for.  The real icing on the cake would be a
small state machine or some helper functions in C which made it
possible to write fast codecs in pure Python, but that can come a bit
later when we have examples up and running.   

- Andy





From andy at robanal.demon.co.uk  Wed Nov 17 01:08:01 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Wed, 17 Nov 1999 00:08:01 GMT
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <000601bf2ff7$4d8a4c80$042d153f@tim>
References: <000601bf2ff7$4d8a4c80$042d153f@tim>
Message-ID: <3834f142.4599884@post.demon.co.uk>

On Tue, 16 Nov 1999 00:56:18 -0500, you wrote:

>[Andy Robinson]
>> ...
>> I presume no one is actually advocating dropping
>> ordinary Python strings, or the ability to do
>>    rawdata = open('myfile.txt', 'rb').read()
>> without any transformations?
>
>If anyone has advocated either, they've successfully hidden it from me.
>Anyone?

Well, I hear statements looking forward to when all string-handling is
done in Unicode internally.  This scares the hell out of me - it is
what VB does and that bit us badly on simple stream operations.  For
encoding work, you will always need raw strings, and often need
Unicode ones.

- Andy



From tim_one at email.msn.com  Wed Nov 17 08:33:06 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 02:33:06 -0500
Subject: [Python-Dev] Unicode proposal: %-formatting ?
In-Reply-To: <383134AA.4B49D178@lemburg.com>
Message-ID: <000001bf30cd$fd6be9c0$a42d153f@tim>

[MAL]
> ...
> This means a new PyUnicode_Format() implementation mapping
> Unicode format objects to Unicode objects.

It's a bitch, isn't it <0.5 wink>?  I hope they're paying you a lot for
this!

> ... hmm, there is a problem there: how should the PyUnicode_Format()
> API deal with '%s' when it sees a Unicode object as argument ?

Anything other than taking the Unicode characters as-is would be
incomprehensible.  I mean, it's a Unicode format string sucking up Unicode
strings -- what else could possibly make *sense*?

> E.g. what would you get in these cases:
>
> u = u"%s %s" % (u"abc", "abc")

That u"abc" gets substituted as-is seems screamingly necessary to me.

I'm more baffled about what "abc" should do.  I didn't understand the t#/s#
etc arguments, and how those do or don't relate to what str() does.  On the
face of it, the idea that a gazillion and one distinct encodings all get
lumped into "a string object" without remembering their nature makes about
as much sense as if Python were to treat all instances of all user-defined
classes as being of a single InstanceType type  -- except in the
latter case you at least get a __class__ attribute to find your way home
again.

As an ignorant user, I would hope that

    u"%s" % string

had enough sense to know what string's encoding is all on its own, and
promote it correctly to Unicode by magic.

> Perhaps we need a new marker for "insert Unicode object here".

%s means string, and at this level a Unicode object *is* "a string".  If
this isn't obvious, it's likely because we're too clever about what
non-Unicode string objects do in this context.





From captainrobbo at yahoo.com  Wed Nov 17 08:53:53 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 16 Nov 1999 23:53:53 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs... 
Message-ID: <19991117075353.16046.rocketmail@web606.mail.yahoo.com>

--- Mark Hammond  wrote:
> Actually, I was thinking even more radically - drop
> the codec registry
> all together, and use modules with "well-known"
> names  (a slight
> precedent, but Python isnt adverse to well-known
> names in general)
> 
> eg:
> iso-8859-1.py:
> 
> import unicodec
> def encode(...):
>   ...
> def decode(...):
>   ...
> 
> iso-8859-2.py:
> from iso-8859-1 import *
> 
This is the simplest if each codec really is likely to
be implemented in a separate module.  But just look at
the data!  All the iso-8859 encodings need identical
functionality, and just have a different mapping table
with 256 elements.  It would be trivial to implement
these in one module.  And the wide variety of Japanese
encodings (mostly corporate or historical variants of
the same character set) are again best treated from
one code base with a bunch of mapping tables and
routines to generate the variants - basically one can
store the deltas.

So the choice is between possibly having a lot of
almost-dummy modules, or having Python modules which
generate and register a logical family of encodings.  

I may have some time next week and will try to code up
a few so we can pound on something.

- Andy



=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From captainrobbo at yahoo.com  Wed Nov 17 08:58:23 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 16 Nov 1999 23:58:23 -0800 (PST)
Subject: [Python-Dev] Unicode proposal: %-formatting ?
Message-ID: <19991117075823.6498.rocketmail@web602.mail.yahoo.com>


--- Tim Peters  wrote:
> I'm more baffled about what "abc" should do.  I
> didn't understand the t#/s#
> etc arguments, and how those do or don't relate to
> what str() does.  On the
> face of it, the idea that a gazillion and one
> distinct encodings all get
> lumped into "a string object" without remembering
> their nature makes about
> as much sense as if Python were to treat all
> instances of all user-defined
> classes as being of a single InstanceType type
>  -- except in the
> latter case you at least get a __class__ attribute
> to find your way home
> again.

Well said.  When the core stuff is done, I'm going to
implement a set of "TypedString" helper routines which
will remember what they are encoded in and won't let
you abuse them by concatenating or otherwise mixing
different encodings.  If you are consciously working
with multi-encoding data, this higher level of
abstraction is really useful.  But I reckon that can
be done in pure Python (just overload '%;, '+' etc.
with some encoding checks).

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From mal at lemburg.com  Wed Nov 17 11:03:59 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 11:03:59 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000201bf30d3$cb2cb240$a42d153f@tim>
Message-ID: <38327D8F.7A5352E6@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > ...demo script...
> 
> It looks like
> 
>     r'\\u0000'
> 
> will get translated into a 2-character Unicode string.

Right...

> That's probably not
> good, if for no other reason than that Java would not do this (it would
> create the obvious 7-character Unicode string), and having something that
> looks like a Java escape that doesn't *work* like the Java escape will be
> confusing as heck for JPython users.  Keeping track of even-vs-odd number of
> backslashes can't be done with a regexp search, but is easy if the code is
> simple :
> ...Tim's version of the demo...

Guido and I have decided to turn \uXXXX into a standard
escape sequence with no further magic applied. \uXXXX will
only be expanded in u"" strings.

Here's the new scheme:

With the 'unicode-escape' encoding being defined as:

? all non-escape characters represent themselves as a Unicode ordinal
  (e.g. 'a' -> U+0061).

? all existing defined Python escape sequences are interpreted as
  Unicode ordinals; note that \xXXXX can represent all Unicode
  ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.

? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
  error to have fewer than 4 digits after \u.

Examples:

u'abc'          -> U+0061 U+0062 U+0063
u'\u1234'       -> U+1234
u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c

Now how should we define ur"abc\u1234\n"  ... ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From tim_one at email.msn.com  Wed Nov 17 10:31:27 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 04:31:27 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <000801bf30de$85bea500$a42d153f@tim>

[Guido]
> ...
> I'm hoping for several kind of responses to this email:
> ...
> - requests for checkin privileges, preferably with a specific issue
> or area of expertise for which the requestor will take responsibility.

I'm specifically requesting not to have checkin privileges.  So there.

I see two problems:

1. When patches go thru you, you at least eyeball them.  This catches bugs
and design errors early.

2. For a multi-platform app, few people have adequate resources for testing;
e.g., I can test under an obsolete version of Win95, and NT if I have to,
but that's it.  You may not actually do better testing than that, but having
patches go thru you allows me the comfort of believing you do .





From mal at lemburg.com  Wed Nov 17 11:11:05 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 11:11:05 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
Message-ID: <38327F39.AA381647@lemburg.com>

Mark Hammond wrote:
> 
> This is leading me to conclude that our "codec registry" should be the
> file system, and Python modules.
> 
> Would it be possible to define a "standard package" called
> "encodings", and when we need an encoding, we simply attempt to load a
> module from that package?  The key benefits I see are:
> 
> * No need to load modules simply to register a codec (which would make
> the number of open calls even higher, and the startup time even
> slower.)  This makes it truly demand-loading of the codecs, rather
> than explicit load-and-register.
> 
> * Making language specific distributions becomes simple - simply
> select a different set of modules from the "encodings" directory.  The
> Python source distribution has them all, but (say) the Windows binary
> installer selects only a few.  The Japanese binary installer for
> Windows installs a few more.
> 
> * Installing new codecs becomes trivial - no need to hack site.py
> etc - simply copy the new "codec module" to the encodings directory
> and you are done.
> 
> * No serious problem for GMcM's installer nor for freeze
> 
> We would probably need to assume that certain codes exist for _all_
> platforms and language - but this is no different to assuming that
> "exceptions.py" also exists for all platforms.
> 
> Is this worthy of consideration?

Why not... using the new registry scheme I proposed in the
thread "Codecs and StreamCodecs" you could implement this
via factory_functions and lazy imports (with the encoding
name folded to make up a proper Python identifier, e.g.
hyphens get converted to '' and spaces to '_').

I'd suggest grouping encodings:

[encodings]
	[iso}
		[iso88591]
		[iso88592]
	[jis]
		...
	[cyrillic]
		...
	[misc]

The unicodec registry could then query encodings.get(encoding,action)
and the package would take care of the rest.

Note that the "walk-me-up-scotty" import patch would probably
be nice in this situation too, e.g. to reach the modules in
[misc] or in higher levels such the ones in [iso] from
[iso88591].

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Wed Nov 17 10:29:34 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 10:29:34 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com>
Message-ID: <3832757E.B9503606@lemburg.com>

Fredrik Lundh wrote:
> 
> --------------------------------------------------------------------
> A PIL-like Unicode Codec Proposal
> --------------------------------------------------------------------
> 
> In the PIL model, the codecs are called with a piece of data, and
> returns the result to the caller.  The codecs maintain internal state
> when needed.
> 
> class decoder:
> 
>     def decode(self, s, offset=0):
>         # decode as much data as we possibly can from the
>         # given string.  if there's not enough data in the
>         # input string to form a full character, return
>         # what we've got this far (this might be an empty
>         # string).
> 
>     def flush(self):
>         # flush the decoding buffers.  this should usually
>         # return None, unless the fact that knowing that the
>         # input stream has ended means that the state can be
>         # interpreted in a meaningful way.  however, if the
>         # state indicates that there last character was not
>         # finished, this method should raise a UnicodeError
>         # exception.

Could you explain for reason for having a .flush() method
and what it should return.

Note that the .decode method is not so much different
from my Codec.decode method except that it uses a single
offset where my version uses a slice (the offset is probably
the better variant, because it avoids data truncation).
 
> class encoder:
> 
>     def encode(self, u, offset=0, buffersize=0):
>         # encode data from the given offset in the input
>         # unicode string into a buffer of the given size
>         # (or slightly larger, if required to proceed).
>         # if the buffer size is 0, the decoder is free
>         # to pick a suitable size itself (if at all
>         # possible, it should make it large enough to
>         # encode the entire input string).  returns a
>         # 2-tuple containing the encoded data, and the
>         # number of characters consumed by this call.

Dito.
 
>     def flush(self):
>         # flush the encoding buffers.  returns an ordinary
>         # string (which may be empty), or None.
> 
> Note that a codec instance can be used for a single string; the codec
> registry should hold codec factories, not codec instances.  In
> addition, you may use a single type or class to implement both
> interfaces at once.

Perhaps I'm missing something, but how would you define
stream codecs using this interface ? 

> Implementing stream codecs is left as an exercise (see the zlib
> material in the eff-bot guide for a decoder example).

...?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 17 10:55:05 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 10:55:05 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>
		<199911161620.LAA02643@eric.cnri.reston.va.us>
		<38319A2A.4385D2E7@lemburg.com> <14385.40842.709711.12141@weyr.cnri.reston.va.us>
Message-ID: <38327B79.2415786B@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > Wouldn't it be possible to have the read/write methods set up
>  > the state when called for the first time ?
> 
>   That slows the down; the constructor should handle initialization.
> Perhaps what gets registered should be:  encoding function, decoding
> function, stream encoder factory (can be a class), stream decoder
> factory (again, can be a class).

Guido proposed the factory approach too, though not seperated
into these 4 APIs (note that your proposal looks very much like
what I had in the early version of my proposal).

Anyway, I think that factory functions are the way to go,
because they offer more flexibility w/r to reusing already
instantiated codecs, importing modules on-the-fly as was
suggested in another thread (thereby making codec module
import lazy) or mapping encoder and decoder requests all
to one class.

So here's a new registry approach:

unicodec.register(encoding,factory_function,action)

with 
	encoding - name of the supported encoding, e.g. Shift_JIS
	factory_function - a function that returns an object
                   or function ready to be used for action
	action - a string stating the supported action:
			'encode'
			'decode'
			'stream write'
			'stream read'

The factory_function API depends on the implementation of
the codec. The returned object's interface on the value of action:

Codecs:
-------

obj = factory_function_for_(errors='strict')

'encode': obj(u,slice=None) -> Python string
'decode': obj(s,offset=0,chunksize=0) -> (Unicode object, bytes consumed)

factory_functions are free to return simple function objects
for stateless encodings.

StreamCodecs:
-------------

obj = factory_function_for_(stream,errors='strict')

obj should provide access to all methods defined for the stream
object, overriding these:

'stream write': obj.write(u,slice=None) -> bytes written to stream
		obj.flush() -> ???
'stream read':  obj.read(chunksize=0) -> (Unicode object, bytes read)
		obj.flush() -> ???

errors is defined like in my Codec spec. The codecs are
expected to use this argument to handle error conditions.

I'm not sure what Fredrik intended with the .flush() methods,
so the definition is still open. I would expect it to do some
finalization of state.

Perhaps we need another set of actions for the .feed()/.close()
approach...

As in earlier version of the proposal:
The registry should provide default implementations for
missing action factory_functions using the other registered
functions, e.g. 'stream write' can be emulated using
'encode' and 'stream read' using 'decode'. The same probably
holds for feed approach.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From tim_one at email.msn.com  Wed Nov 17 09:14:38 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 03:14:38 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <3831350B.8F69CB6D@lemburg.com>
Message-ID: <000201bf30d3$cb2cb240$a42d153f@tim>

[MAL]
> ...
> Here is a sample implementation of what I had in mind:
>
> """ Demo for 'unicode-escape' encoding.
> """
> import struct,string,re
>
> pack_format = '>H'
>
> def convert_string(s):
>
>     l = map(None,s)
>     for i in range(len(l)):
> 	l[i] = struct.pack(pack_format,ord(l[i]))
>     return l
>
> u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})')
>
> def unicode_unescape(s):
>
>     l = []
>     start = 0
>     while start < len(s):
> 	m = u_escape.search(s,start)
> 	if not m:
> 	    l[len(l):] = convert_string(s[start:])
> 	    break
> 	m_start,m_end = m.span()
> 	if m_start > start:
> 	    l[len(l):] = convert_string(s[start:m_start])
> 	hexcode = m.group(1)
> 	#print hexcode,start,m_start
> 	if len(hexcode) != 4:
> 	    raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode
> 	ordinal = string.atoi(hexcode,16)
> 	l.append(struct.pack(pack_format,ordinal))
> 	start = m_end
>     #print l
>     return string.join(l,'')
>
> def hexstr(s,sep=''):
>
>     return string.join(map(lambda x,hex=hex,ord=ord: '%02x' %
> ord(x),s),sep)

It looks like

    r'\\u0000'

will get translated into a 2-character Unicode string.  That's probably not
good, if for no other reason than that Java would not do this (it would
create the obvious 7-character Unicode string), and having something that
looks like a Java escape that doesn't *work* like the Java escape will be
confusing as heck for JPython users.  Keeping track of even-vs-odd number of
backslashes can't be done with a regexp search, but is easy if the code is
simple :

def unicode_unescape(s):
    from string import atoi
    import array
    i, n = 0, len(s)
    result = array.array('H') # unsigned short, native order
    while i < n:
        ch = s[i]
        i = i+1
        if ch != "\\":
            result.append(ord(ch))
            continue
        if i == n:
            raise ValueError("string ends with lone backslash")
        ch = s[i]
        i = i+1
        if ch != "u":
            result.append(ord("\\"))
            result.append(ord(ch))
            continue
        hexchars = s[i:i+4]
        if len(hexchars) != 4:
            raise ValueError("\\u escape at end not followed by "
                             "at least 4 characters")
        i = i+4
        for ch in hexchars:
            if ch not in "01234567890abcdefABCDEF":
                raise ValueError("\\u" + hexchars + " contains "
                                 "non-hex characters")
        result.append(atoi(hexchars, 16))

    # print result
    return result.tostring()





From tim_one at email.msn.com  Wed Nov 17 09:47:48 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 03:47:48 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: <383156DF.2209053F@lemburg.com>
Message-ID: <000401bf30d8$6cf30bc0$a42d153f@tim>

[MAL]
> FYI, the next version of the proposal ...
> File objects opened in text mode will use "t#" and binary ones use "s#".

Am I the only one who sees magical distinctions between text and binary mode
as a Really Bad Idea?  I wouldn't have guessed the Unix natives here would
quietly acquiesce to importing a bit of Windows madness .





From tim_one at email.msn.com  Wed Nov 17 09:47:46 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 03:47:46 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <383140F3.EDDB307A@lemburg.com>
Message-ID: <000301bf30d8$6bbd4ae0$a42d153f@tim>

[Jack Jansen]
> I would suggest adding the Dos, Windows and Macintosh standard
> 8-bit charsets (their equivalents of latin-1) too, as documents
> in these encoding are pretty ubiquitous. But maybe these should
> only be added on the respective platforms.

[MAL]
> Good idea. What code pages would that be ?

I'm not clear on what's being suggested; e.g., Windows supports *many*
different "code pages".  CP 1252 is default in the U.S., and is an extension
of Latin-1.  See e.g.

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

which appears to be up-to-date (has 0x80 as the euro symbol, Unicode
U+20AC -- although whether your version of U.S. Windows actually has this
depends on whether you installed the service pack that added it!).

See

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP850.TXT

for the closest DOS got.





From tim_one at email.msn.com  Wed Nov 17 10:05:21 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 04:05:21 -0500
Subject: Weak refs (was [Python-Dev] just say no...)
In-Reply-To: <14385.33486.855802.187739@weyr.cnri.reston.va.us>
Message-ID: <000601bf30da$e069d820$a42d153f@tim>

[Fred L. Drake, Jr., pines for some flavor of weak refs; MAL reminds us
 of his work; & back to Fred]

>   Yes, but still not in the core.  So we have two general examples
> (vrefs and mxProxy) and there's WeakDict (or something like that).  I
> think there really needs to be a core facility for this.

This kind of thing certainly belongs in the core (for efficiency and smooth
integration) -- if it belongs in the language at all.  This was discussed at
length here some months ago; that's what prompted MAL to "do something"
about it.  Guido hasn't shown visible interest, and nobody has been willing
to fight him to the death over it.  So it languishes.  Buy him lunch
tomorrow and get him excited .





From tim_one at email.msn.com  Wed Nov 17 10:10:24 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 04:10:24 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <1269351119-9152905@hypernet.com>
Message-ID: <000701bf30db$94d4ac40$a42d153f@tim>

[Gordon McMillan]
> ...
> Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a
> diskette with a little room left over.

That's truly remarkable (he says while waiting for the Inbox Repair Tool to
finish repairing his 50Mb Outlook mail file ...)!

> but-since-its-WIndows-it-must-be-tainted-ly y'rs

Indeed -- if it runs on Windows, it's a worthless piece o' crap .





From fredrik at pythonware.com  Wed Nov 17 12:00:10 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:00:10 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com>
Message-ID: <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com>

M.-A. Lemburg  wrote:
> >     def flush(self):
> >         # flush the decoding buffers.  this should usually
> >         # return None, unless the fact that knowing that the
> >         # input stream has ended means that the state can be
> >         # interpreted in a meaningful way.  however, if the
> >         # state indicates that there last character was not
> >         # finished, this method should raise a UnicodeError
> >         # exception.
>
> Could you explain for reason for having a .flush() method
> and what it should return.

in most cases, it should either return None, or
raise a UnicodeError exception:

    >>> u = unicode("? i ?a ? e ?", "iso-latin-1")
    >>> # yes, that's a valid Swedish sentence ;-)
    >>> s = u.encode("utf-8")
    >>> d = decoder("utf-8")
    >>> d.decode(s[:-1])
    "? i ?a ? e "
    >>> d.flush()
    UnicodeError: last character not complete

on the other hand, there are situations where it
might actually return a string.  consider a "HTML
entity decoder" which uses the following pattern
to match a character entity: "&\w+;?" (note that
the trailing semicolon is optional).

    >>> u = unicode("? i ?a ? e ?", "iso-latin-1")
    >>> s = u.encode("html-entities")
    >>> d = decoder("html-entities")
    >>> d.decode(s[:-1])
    "? i ?a ? e "
    >>> d.flush()
    "?"

> Perhaps I'm missing something, but how would you define
> stream codecs using this interface ?

input: read chunks of data, decode, and
keep extra data in a local buffer.

output: encode data into suitable chunks,
and write to the output stream (that's why
there's a buffersize argument to encode --
if someone writes a 10mb unicode string to
an encoded stream, python shouldn't allocate
an extra 10-30 megabytes just to be able to
encode the darn thing...)

> > Implementing stream codecs is left as an exercise (see the zlib
> > material in the eff-bot guide for a decoder example).

everybody should have a copy of the eff-bot guide ;-)

(but alright, I plan to post a complete utf-8 implementation
in a not too distant future).






From gstein at lyra.org  Wed Nov 17 11:57:36 1999
From: gstein at lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 02:57:36 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38327F39.AA381647@lemburg.com>
Message-ID: 

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
>...
> I'd suggest grouping encodings:
> 
> [encodings]
> 	[iso}
> 		[iso88591]
> 		[iso88592]
> 	[jis]
> 		...
> 	[cyrillic]
> 		...
> 	[misc]

WHY?!?!

This is taking a simple solution and making it complicated. I see no
benefit to the creating yet-another-level-of-hierarchy. Why should they be
grouped?

Leave the modules just under "encodings" and be done with it.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Wed Nov 17 12:14:01 1999
From: gstein at lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 03:14:01 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <38327B79.2415786B@lemburg.com>
Message-ID: 

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
>...
> Anyway, I think that factory functions are the way to go,
> because they offer more flexibility w/r to reusing already
> instantiated codecs, importing modules on-the-fly as was
> suggested in another thread (thereby making codec module
> import lazy) or mapping encoder and decoder requests all
> to one class.

Why a factory? I've got a simple encode() function. I don't need a
factory. "flexibility" at the cost of complexity (IMO).

> So here's a new registry approach:
> 
> unicodec.register(encoding,factory_function,action)
> 
> with 
> 	encoding - name of the supported encoding, e.g. Shift_JIS
> 	factory_function - a function that returns an object
>                    or function ready to be used for action
> 	action - a string stating the supported action:
> 			'encode'
> 			'decode'
> 			'stream write'
> 			'stream read'

This action thing is subject to error. *if* you're wanting to go this
route, then have:

unicodec.register_encode(...)
unicodec.register_decode(...)
unicodec.register_stream_write(...)
unicodec.register_stream_read(...)

They are equivalent. Guido has also told me in the past that he dislikes
parameters that alter semantics -- preferring different functions instead.
(this is why there are a good number of PyBufferObject interfaces; I had
fewer to start with)

This suggested approach is also quite a bit more wordy/annoying than
Fred's alternative:

unicode.register('iso-8859-1', encoder, decoder, None, None)

And don't say "future compatibility allows us to add new actions." Well,
those same future changes can add new registration functions or additional
parameters to the single register() function.

Not that I'm advocating it, but register() could also take a single
parameter: if a class, then instantiate it and call methods for each
action; if an instance, then just call methods for each action.

[ and the third/original variety: a function object as the first param is
  the actual hook, and params 2 thru 4 (each are optional, or just the
  stream funcs?) are the other hook functions ]

> The factory_function API depends on the implementation of
> the codec. The returned object's interface on the value of action:
> 
> Codecs:
> -------
> 
> obj = factory_function_for_(errors='strict')

Where does this "errors" value come from? How does a user alter that
value? Without an ability to change this, I see no reason for a factory.
[ and no: don't tell me it is a thread-state value :-) ]

On the other hand: presuming the "errors" thing is valid, *then* I see a
need for a factory.

Truly... I dislike factories. IMO, they just add code/complexity in many
cases where the functionality isn't needed. But that's just me :-)

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From captainrobbo at yahoo.com  Wed Nov 17 12:17:00 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 17 Nov 1999 03:17:00 -0800 (PST)
Subject: [Python-Dev] Rosette i18n API
Message-ID: <19991117111700.8831.rocketmail@web603.mail.yahoo.com>

There is a very capable C++ library at

http://rosette.basistech.com/

It is well worth looking at the things this API
actually lets you do for ideas on patterns.

- Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From gstein at lyra.org  Wed Nov 17 12:21:18 1999
From: gstein at lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 03:21:18 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim>
Message-ID: 

On Wed, 17 Nov 1999, Tim Peters wrote:
> [MAL]
> > FYI, the next version of the proposal ...
> > File objects opened in text mode will use "t#" and binary ones use "s#".
> 
> Am I the only one who sees magical distinctions between text and binary mode
> as a Really Bad Idea?  I wouldn't have guessed the Unix natives here would
> quietly acquiesce to importing a bit of Windows madness .

It's a seductive idea... yes, it feels wrong, but then... it seems kind of
right, too...

:-)

Yes. It is a mode. Is it bad? Not sure. You've already told the system
that you want to treat the file differently. Much like you're treating it
differently when you specify 'r' vs. 'w'.

The real annoying thing would be to assume that opening a file as 'r'
means that I *meant* text mode and to start using "t#". In actuality, I
typically open files that way since I do most of my coding on Linux. If
I now have to pay attention to things and open it as 'rb', then I'll be
pissed.

And the change in behavior and bugs that interpreting 'r' as text would
introduce? Ack!

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From fredrik at pythonware.com  Wed Nov 17 12:36:32 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:36:32 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: 
Message-ID: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com>

Greg Stein  wrote:
> Why a factory? I've got a simple encode() function. I don't need a
> factory. "flexibility" at the cost of complexity (IMO).

so where do you put the state?

how do you reset the state between
strings?

how do you handle incremental
decoding/encoding?

etc.

(I suggest taking another look at PIL's codec
design.  it solves all these problems with a
minimum of code, and it works -- people
have been hammering on PIL for years...)






From gstein at lyra.org  Wed Nov 17 12:34:30 1999
From: gstein at lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 03:34:30 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com>
Message-ID: 

On Wed, 17 Nov 1999, Fredrik Lundh wrote:
> Greg Stein  wrote:
> > Why a factory? I've got a simple encode() function. I don't need a
> > factory. "flexibility" at the cost of complexity (IMO).
> 
> so where do you put the state?

encode() is not supposed to retain state. It is supposed to do a complete
translation. It is not a stream thingy, which may have received partial
characters.

> how do you reset the state between
> strings?

There is none :-)

> how do you handle incremental
> decoding/encoding?

Streams.

-g

--
Greg Stein, http://www.lyra.org/




From fredrik at pythonware.com  Wed Nov 17 12:46:01 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:46:01 +0100
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> - suggestions for new issues that maybe ought to be settled in 1.6

three things: imputil, imputil, imputil






From fredrik at pythonware.com  Wed Nov 17 12:51:33 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:51:33 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: 
Message-ID: <006201bf30f2$194626f0$f29b12c2@secret.pythonware.com>

Greg Stein  wrote:
> > so where do you put the state?
>
> encode() is not supposed to retain state. It is supposed to do a complete
> translation. It is not a stream thingy, which may have received partial
> characters.
>
> > how do you handle incremental
> > decoding/encoding?
> 
> Streams.

hmm.  why have two different mechanisms when
you can do the same thing with one?






From gstein at lyra.org  Wed Nov 17 14:01:47 1999
From: gstein at lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 05:01:47 -0800 (PST)
Subject: [Python-Dev] Apache process (was: Python 1.6 status)
In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: 

On Tue, 16 Nov 1999, Guido van Rossum wrote:
>...
> Greg, I understand you have checkin privileges for Apache.  What is
> the procedure there for handing out those privileges?  What is the
> procedure for using them?  (E.g. if you made a bogus change to part of
> Apache you're not supposed to work on, what happens?)

Somebody proposes that a person is added to the list of people with
checkin privileges. If nobody else in the group vetoes that, then they're
in (their system doesn't require continual participation by each member,
so it can only operate at a veto level, rather than a unanimous assent).
It is basically determined on the basis of merit -- has the person been
active (on the Apache developer's mailing list) and has the person
contributed something significant? Further, by providing commit access,
will they further the goals of Apache? And, of course, does their
temperament seem to fit in with the other group members?

I can make any change that I'd like. However, there are about 20 other
people who can easily revert or alter my changes if they're bogus.
There are no programmatic restrictions.... You could say it is based on
mutual respect and a social contract of behavior. Large changes should be
discussed before committing to CVS. Bug fixes, doc enhancements, minor
functional improvements, etc, all follow a commit-then-review process. I
just check the thing in. Others see the diff (emailed to the checkins
mailing list (this is different from Python-checkins which only says what
files are changed, rather than providing the diff)) and can comment on the
change, make their own changes, etc.

To be concrete: I added the Expat code that now appears in Apache 1.3.9.
Before doing so, I queried the group. There were some issues that I dealt
with before finally commiting Expat to the CVS repository. On another
occasion, I added a new API to Apache; again, I proposed it first, got an
"all OK" and committed it. I've done a couple bug fixes which I just
checked in.
[ "all OK" means three +1 votes and no vetoes. everybody has veto
  ability (but the responsibility to explain why and to remove their veto 
  when their concerns are addressed). ]

On many occasions, I've reviewed the diffs that were posted to the
checkins list, and made comments back to the author. I've caught a few
problems this way.

For Apache 2.0, even large changes are commit-then-review at this point.
At some point, it will switch over to review-then-commit and the project
will start moving towards stabilization/release. (bug fixes and stuff will
always remain commit-then-review)

I'll note that the process works very well given that diffs are emailed. I
doubt that it would be effective if people had to fetch CVS diffs
themselves.

Your note also implies "areas of ownership". This doesn't really exist
within Apache. There aren't even "primary authors" or things like that. I
have the ability/rights to change any portions: from the low-level
networking, to the documentation, to the server-side include processing.
Of coures, if I'm going to make a big change, then I'll be posting a patch
for review first, and whoever has worked in that area in the past
may/will/should comment.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From guido at CNRI.Reston.VA.US  Wed Nov 17 14:32:05 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:32:05 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: Your message of "Wed, 17 Nov 1999 04:31:27 EST."
             <000801bf30de$85bea500$a42d153f@tim> 
References: <000801bf30de$85bea500$a42d153f@tim> 
Message-ID: <199911171332.IAA03266@kaluha.cnri.reston.va.us>

> I'm specifically requesting not to have checkin privileges.  So there.

I will force nobody to use checkin privileges.  However I see that
for some contributors, checkin privileges will save me and them time.

> I see two problems:
> 
> 1. When patches go thru you, you at least eyeball them.  This catches bugs
> and design errors early.

I will still eyeball them -- only after the fact.  Since checkins are
pretty public, being slapped on the wrist for a bad checkin is a
pretty big embarrassment, so few contributors will check in buggy code
more than once.  Moreover, there will be more eyeballs.

> 2. For a multi-platform app, few people have adequate resources for testing;
> e.g., I can test under an obsolete version of Win95, and NT if I have to,
> but that's it.  You may not actually do better testing than that, but having
> patches go thru you allows me the comfort of believing you do .

I expect that the same mechanisms will apply.  I have access to
Solaris, Linux and Windows (NT + 98) but it's actually a lot easier to
check portability after things have been checked in.  And again, there
will be more testers.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Wed Nov 17 14:34:23 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:34:23 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Tue, 16 Nov 1999 23:53:53 PST."
             <19991117075353.16046.rocketmail@web606.mail.yahoo.com> 
References: <19991117075353.16046.rocketmail@web606.mail.yahoo.com> 
Message-ID: <199911171334.IAA03374@kaluha.cnri.reston.va.us>

> This is the simplest if each codec really is likely to
> be implemented in a separate module.  But just look at
> the data!  All the iso-8859 encodings need identical
> functionality, and just have a different mapping table
> with 256 elements.  It would be trivial to implement
> these in one module.  And the wide variety of Japanese
> encodings (mostly corporate or historical variants of
> the same character set) are again best treated from
> one code base with a bunch of mapping tables and
> routines to generate the variants - basically one can
> store the deltas.
> 
> So the choice is between possibly having a lot of
> almost-dummy modules, or having Python modules which
> generate and register a logical family of encodings.  
> 
> I may have some time next week and will try to code up
> a few so we can pound on something.

I see no problem with having a lot of near-dummy modules if it
simplifies the architecture.  You can still do code sharing.  Files
are cheap; APIs are expensive.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Wed Nov 17 14:38:35 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:38:35 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Wed, 17 Nov 1999 02:57:36 PST."
              
References:  
Message-ID: <199911171338.IAA03511@kaluha.cnri.reston.va.us>

> This is taking a simple solution and making it complicated. I see no
> benefit to the creating yet-another-level-of-hierarchy. Why should they be
> grouped?
> 
> Leave the modules just under "encodings" and be done with it.

Agreed.  Tim Peters once remarked that Python likes shallow encodings
(or perhaps that *I* like them :-).  This is one such case where I
would strongly urge for the simplicity of a shallow hierarchy.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Wed Nov 17 14:43:44 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:43:44 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Wed, 17 Nov 1999 03:14:01 PST."
              
References:  
Message-ID: <199911171343.IAA03636@kaluha.cnri.reston.va.us>

> Why a factory? I've got a simple encode() function. I don't need a
> factory. "flexibility" at the cost of complexity (IMO).

Unless there are certain cases where factories are useful.  But let's
read on...

> > 	action - a string stating the supported action:
> > 			'encode'
> > 			'decode'
> > 			'stream write'
> > 			'stream read'
> 
> This action thing is subject to error. *if* you're wanting to go this
> route, then have:
> 
> unicodec.register_encode(...)
> unicodec.register_decode(...)
> unicodec.register_stream_write(...)
> unicodec.register_stream_read(...)
> 
> They are equivalent. Guido has also told me in the past that he dislikes
> parameters that alter semantics -- preferring different functions instead.

Yes, indeed!  (But weren't we going to do away with the whole registry
idea in favor of an encodings package?)

> Not that I'm advocating it, but register() could also take a single
> parameter: if a class, then instantiate it and call methods for each
> action; if an instance, then just call methods for each action.

Nah, that's bad -- a class is just a factory, and once you are
allowing classes it's really good to also allowing factory functions.

> [ and the third/original variety: a function object as the first param is
>   the actual hook, and params 2 thru 4 (each are optional, or just the
>   stream funcs?) are the other hook functions ]

Fine too.  They should all be optional.

> > obj = factory_function_for_(errors='strict')
> 
> Where does this "errors" value come from? How does a user alter that
> value? Without an ability to change this, I see no reason for a factory.
> [ and no: don't tell me it is a thread-state value :-) ]
> 
> On the other hand: presuming the "errors" thing is valid, *then* I see a
> need for a factory.

The idea is that various places that take an encoding name can also
take a codec instance.  So the user can call the factory function /
class constructor.

> Truly... I dislike factories. IMO, they just add code/complexity in many
> cases where the functionality isn't needed. But that's just me :-)

Get over it...  In a sense, every Python class is a factory for its
own instances!  I think you must be confusing Python with Java or
C++. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Wed Nov 17 14:56:56 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:56:56 -0500
Subject: [Python-Dev] Apache process (was: Python 1.6 status)
In-Reply-To: Your message of "Wed, 17 Nov 1999 05:01:47 PST."
              
References:  
Message-ID: <199911171356.IAA04005@kaluha.cnri.reston.va.us>

> Somebody proposes that a person is added to the list of people with
> checkin privileges. If nobody else in the group vetoes that, then they're
> in (their system doesn't require continual participation by each member,
> so it can only operate at a veto level, rather than a unanimous assent).
> It is basically determined on the basis of merit -- has the person been
> active (on the Apache developer's mailing list) and has the person
> contributed something significant? Further, by providing commit access,
> will they further the goals of Apache? And, of course, does their
> temperament seem to fit in with the other group members?

This makes sense, but I have one concern: if somebody who isn't liked
very much (say a capable hacker who is a real troublemaker) asks for
privileges, would people veto this?  I'd be reluctant to go on record
as veto'ing a particular person.  (E.g. there are a few troublemakers
in c.l.py, and I would never want them to join python-dev let alone
give them commit privileges, but I'm not sure if I would want to
discuss this on a publicly archived mailing list -- or even on a
privately archived mailing list, given that the number of members
might be in the hundreds.

[...stuff I like...]

> I'll note that the process works very well given that diffs are emailed. I
> doubt that it would be effective if people had to fetch CVS diffs
> themselves.

That's a great idea; I'll see if we can do that to our checkin email,
regardless of whether we hand out commit privileges.

> Your note also implies "areas of ownership". This doesn't really exist
> within Apache. There aren't even "primary authors" or things like that. I
> have the ability/rights to change any portions: from the low-level
> networking, to the documentation, to the server-side include processing.

But that's Apache, which is explicitly run as a collective.  In
Python, I definitely want to have ownership of certain sections of the
code.  But I agree that this doesn't need to be formalized by access
control lists; the social process you describe sounds like it will
work just fine.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake at acm.org  Wed Nov 17 15:44:25 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 17 Nov 1999 09:44:25 -0500 (EST)
Subject: Weak refs (was [Python-Dev] just say no...)
In-Reply-To: <000601bf30da$e069d820$a42d153f@tim>
References: <14385.33486.855802.187739@weyr.cnri.reston.va.us>
	<000601bf30da$e069d820$a42d153f@tim>
Message-ID: <14386.48969.630893.119344@weyr.cnri.reston.va.us>

Tim Peters writes:
 > about it.  Guido hasn't shown visible interest, and nobody has been willing
 > to fight him to the death over it.  So it languishes.  Buy him lunch
 > tomorrow and get him excited .

  Guido has asked me to pursue this topic, so I'll be checking out
available implementations and seeing if any are adoptable or if
something different is needed to be fully general and
well-integrated.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From tim_one at email.msn.com  Thu Nov 18 04:21:16 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:21:16 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <38327D8F.7A5352E6@lemburg.com>
Message-ID: <000101bf3173$f9805340$c0a0143f@tim>

[MAL]
> Guido and I have decided to turn \uXXXX into a standard
> escape sequence with no further magic applied. \uXXXX will
> only be expanded in u"" strings.

Does that exclude ur"" strings?  Not arguing either way, just don't know
what all this means.

> Here's the new scheme:
>
> With the 'unicode-escape' encoding being defined as:
>
> ? all non-escape characters represent themselves as a Unicode ordinal
>   (e.g. 'a' -> U+0061).

Same as before (scream if that's wrong).

> ? all existing defined Python escape sequences are interpreted as
>   Unicode ordinals;

Same as before (ditto).

> note that \xXXXX can represent all Unicode ordinals,

This means that the definition of \xXXXX has changed, then -- as you pointed
out just yesterday , \xABCDq currently acts like \xCDq.  Does the new
\x definition apply only in u"" strings, or in "" strings too?  What is the
new \x definition?

> and \OOO (octal) can represent Unicode ordinals up to U+01FF.

Same as before (ditto).

> ? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
>   error to have fewer than 4 digits after \u.

Same as before (ditto).

IOW, I don't see anything that's changed other than an unspecified new
treatment of \x escapes, and possibly that ur"" strings don't expand \u
escapes.

> Examples:
>
> u'abc'          -> U+0061 U+0062 U+0063
> u'\u1234'       -> U+1234
> u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c

The last example is damaged (U+05c isn't legit).  Other than that, these
look the same as before.

> Now how should we define ur"abc\u1234\n"  ... ?

If strings carried an encoding tag with them, the obvious answer is that
this acts exactly like r"abc\u1234\n" acts today except gets a
"unicode-escaped" encoding tag instead of a "[whatever the default is
today]" encoding tag.

If strings don't carry an encoding tag with them, you're in a bit of a
pickle:  you'll have to convert it to a regular string or a Unicode string,
but in either case have no way to communicate that it may need further
processing; i.e., no way to distinguish it from a regular or Unicode string
produced by any other mechanism.  The code I posted yesterday remains my
best answer to that unpleasant puzzle (i.e., produce a Unicode string,
fiddling with backslashes just enough to get the \u escapes expanded, in the
same way Java's (conceptual) preprocessor does it).





From tim_one at email.msn.com  Thu Nov 18 04:21:19 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:21:19 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: 
Message-ID: <000201bf3173$fb7f7ea0$c0a0143f@tim>

[MAL]
> File objects opened in text mode will use "t#" and binary
> ones use "s#".

[Greg Stein]
> ...
> The real annoying thing would be to assume that opening a file as 'r'
> means that I *meant* text mode and to start using "t#".

Isn't that exactly what MAL said would happen?  Note that a "t" flag for
"text mode" is an MS extension -- C doesn't define "t", and Python doesn't
either; a lone "r" has always meant text mode.

> In actuality, I typically open files that way since I do most of my
> coding on Linux. If I now have to pay attention to things and open it
> as 'rb', then I'll be pissed.
>
> And the change in behavior and bugs that interpreting 'r' as text would
> introduce? Ack!

'r' is already intepreted as text mode, but so far, on Unix-like systems,
there's been no difference between text and binary modes.  Introducing a
distinction will certainly cause problems.  I don't know what the
compensating advantages are thought to be.





From tim_one at email.msn.com  Thu Nov 18 04:23:00 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:23:00 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <199911171332.IAA03266@kaluha.cnri.reston.va.us>
Message-ID: <000301bf3174$37b465c0$c0a0143f@tim>

[Guido]
> I will force nobody to use checkin privileges.

That almost went without saying .

> However I see that for some contributors, checkin privileges will
> save me and them time.

Then it's Good!  Provided it doesn't hurt language stability.  I agree that
changing the system to mail out diffs addresses what I was worried about
there.





From tim_one at email.msn.com  Thu Nov 18 04:31:38 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:31:38 -0500
Subject: [Python-Dev] Apache process (was: Python 1.6 status)
In-Reply-To: <199911171356.IAA04005@kaluha.cnri.reston.va.us>
Message-ID: <000401bf3175$6c089660$c0a0143f@tim>

[Greg]
> ...
> Somebody proposes that a person is added to the list of people with
> checkin privileges. If nobody else in the group vetoes that, then
? they're in ...

[Guido]
> This makes sense, but I have one concern: if somebody who isn't liked
> very much (say a capable hacker who is a real troublemaker) asks for
> privileges, would people veto this?

It seems that a key point in Greg's description is that people don't propose
*themselves* for checkin.  They have to talk someone else into proposing
them.  That should keep Endang out of the running for a few years .

After that, I care more about their code than their personalities.  If the
stuff they check in is good, fine; if it's not, lock 'em out for direct
cause.

> I'd be reluctant to go on record as veto'ing a particular person.

Secret Ballot run off a web page -- although not so secret you can't see who
voted for what .





From tim_one at email.msn.com  Thu Nov 18 04:37:18 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:37:18 -0500
Subject: Weak refs (was [Python-Dev] just say no...)
In-Reply-To: <14386.48969.630893.119344@weyr.cnri.reston.va.us>
Message-ID: <000501bf3176$36a5ca00$c0a0143f@tim>

[Fred L. Drake, Jr.]
> Guido has asked me to pursue this topic [weak refs], so I'll be
> checking out available implementations and seeing if any are
> adoptable or if something different is needed to be fully general
> and well-integrated.

Just don't let "fully general" stop anything for its sake alone; e.g., if
there's a slick trick that *could* exempt numbers, that's all to the good!
Adding a pointer to every object is really unattractive, while adding a flag
or two to type objects is dirt cheap.

Note in passing that current Java addresses weak refs too (several flavors
of 'em! -- very elaborate).





From gstein at lyra.org  Thu Nov 18 09:09:24 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 18 Nov 1999 00:09:24 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000201bf3173$fb7f7ea0$c0a0143f@tim>
Message-ID: 

On Wed, 17 Nov 1999, Tim Peters wrote:
>...
> 'r' is already intepreted as text mode, but so far, on Unix-like systems,
> there's been no difference between text and binary modes.  Introducing a
> distinction will certainly cause problems.  I don't know what the
> compensating advantages are thought to be.

Wow. "compensating advantages" ... Excellent "power phrase" there.

hehe...

-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Thu Nov 18 09:15:04 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:15:04 +0100
Subject: [Python-Dev] just say no...
References: <000201bf3173$fb7f7ea0$c0a0143f@tim>
Message-ID: <3833B588.1E31F01B@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > File objects opened in text mode will use "t#" and binary
> > ones use "s#".
> 
> [Greg Stein]
> > ...
> > The real annoying thing would be to assume that opening a file as 'r'
> > means that I *meant* text mode and to start using "t#".
> 
> Isn't that exactly what MAL said would happen?  Note that a "t" flag for
> "text mode" is an MS extension -- C doesn't define "t", and Python doesn't
> either; a lone "r" has always meant text mode.

Em, I think you've got something wrong here: "t#" refers to the
parsing marker used for writing data to files opened in text mode.

Until now, all files used the "s#" parsing marker for writing
data, regardeless of being opened in text or binary mode. The
new interpretation (new, because there previously was none ;-)
of the buffer interface forces this to be changed to regain
conformance.

> > In actuality, I typically open files that way since I do most of my
> > coding on Linux. If I now have to pay attention to things and open it
> > as 'rb', then I'll be pissed.
> >
> > And the change in behavior and bugs that interpreting 'r' as text would
> > introduce? Ack!
> 
> 'r' is already intepreted as text mode, but so far, on Unix-like systems,
> there's been no difference between text and binary modes.  Introducing a
> distinction will certainly cause problems.  I don't know what the
> compensating advantages are thought to be.

I guess you won't notice any difference: strings define both
interfaces ("s#" and "t#") to mean the same thing. Only other
buffer compatible types may now fail to write to text files
-- which is not so bad, because it forces the programmer to
rethink what he really intended when opening the file in text
mode.

Besides, if you are writing portable scripts you should pay
close attention to "r" vs. "rb" anyway.

[Strange, I find myself argueing for a feature that I don't
like myself ;-)]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 09:59:21 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:59:21 +0100
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us> <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com>
Message-ID: <3833BFE9.6FD118B1@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum  wrote:
> > - suggestions for new issues that maybe ought to be settled in 1.6
> 
> three things: imputil, imputil, imputil

But please don't add the current version as default importer...
its strategy is way too slow for real life apps (yes, I've tested
this: imports typically take twice as long as with the builtin
importer).

I'd opt for an import manager which provides a useful API for
import hooks to register themselves with. What we really need
is not yet another complete reimplementation of what the
builtin importer does, but rather a more detailed exposure of
the various import aspects: finding modules and loading modules.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 09:50:36 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:50:36 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com>
Message-ID: <3833BDDC.7CD2CC1F@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg  wrote:
> > >     def flush(self):
> > >         # flush the decoding buffers.  this should usually
> > >         # return None, unless the fact that knowing that the
> > >         # input stream has ended means that the state can be
> > >         # interpreted in a meaningful way.  however, if the
> > >         # state indicates that there last character was not
> > >         # finished, this method should raise a UnicodeError
> > >         # exception.
> >
> > Could you explain for reason for having a .flush() method
> > and what it should return.
> 
> in most cases, it should either return None, or
> raise a UnicodeError exception:
> 
>     >>> u = unicode("? i ?a ? e ?", "iso-latin-1")
>     >>> # yes, that's a valid Swedish sentence ;-)
>     >>> s = u.encode("utf-8")
>     >>> d = decoder("utf-8")
>     >>> d.decode(s[:-1])
>     "? i ?a ? e "
>     >>> d.flush()
>     UnicodeError: last character not complete
> 
> on the other hand, there are situations where it
> might actually return a string.  consider a "HTML
> entity decoder" which uses the following pattern
> to match a character entity: "&\w+;?" (note that
> the trailing semicolon is optional).
> 
>     >>> u = unicode("? i ?a ? e ?", "iso-latin-1")
>     >>> s = u.encode("html-entities")
>     >>> d = decoder("html-entities")
>     >>> d.decode(s[:-1])
>     "? i ?a ? e "
>     >>> d.flush()
>     "?"

Ah, ok. So the .flush() method checks for proper
string endings and then either returns the remaining
input or raises an error.
 
> > Perhaps I'm missing something, but how would you define
> > stream codecs using this interface ?
> 
> input: read chunks of data, decode, and
> keep extra data in a local buffer.
> 
> output: encode data into suitable chunks,
> and write to the output stream (that's why
> there's a buffersize argument to encode --
> if someone writes a 10mb unicode string to
> an encoded stream, python shouldn't allocate
> an extra 10-30 megabytes just to be able to
> encode the darn thing...)

So the stream codecs would be wrappers around the
string codecs.

Have you read my latest version of the Codec interface ?
Wouldn't that be a reasonable approach ? Note that I have
integrated your ideas into the new API -- it's basically
only missing the .flush() methods, which I can add now
that I know what you meant.
 
> > > Implementing stream codecs is left as an exercise (see the zlib
> > > material in the eff-bot guide for a decoder example).
> 
> everybody should have a copy of the eff-bot guide ;-)

Sure, but the format, the format... make it printed and add
a CD and you would probably have a good selling book
there ;-)
 
> (but alright, I plan to post a complete utf-8 implementation
> in a not too distant future).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 09:16:48 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:16:48 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: 
Message-ID: <3833B5F0.FA4620AD@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
> >...
> > I'd suggest grouping encodings:
> >
> > [encodings]
> >       [iso}
> >               [iso88591]
> >               [iso88592]
> >       [jis]
> >               ...
> >       [cyrillic]
> >               ...
> >       [misc]
> 
> WHY?!?!
> 
> This is taking a simple solution and making it complicated. I see no
> benefit to the creating yet-another-level-of-hierarchy. Why should they be
> grouped?
> 
> Leave the modules just under "encodings" and be done with it.

Nevermind, was just an idea...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 09:43:31 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:43:31 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References:  <199911171343.IAA03636@kaluha.cnri.reston.va.us>
Message-ID: <3833BC33.66E134F@lemburg.com>

Guido van Rossum wrote:
> 
> > Why a factory? I've got a simple encode() function. I don't need a
> > factory. "flexibility" at the cost of complexity (IMO).
> 
> Unless there are certain cases where factories are useful.  But let's
> read on...
>
> > >     action - a string stating the supported action:
> > >                     'encode'
> > >                     'decode'
> > >                     'stream write'
> > >                     'stream read'
> >
> > This action thing is subject to error. *if* you're wanting to go this
> > route, then have:
> >
> > unicodec.register_encode(...)
> > unicodec.register_decode(...)
> > unicodec.register_stream_write(...)
> > unicodec.register_stream_read(...)
> >
> > They are equivalent. Guido has also told me in the past that he dislikes
> > parameters that alter semantics -- preferring different functions instead.
> 
> Yes, indeed!

Ok.

> (But weren't we going to do away with the whole registry
> idea in favor of an encodings package?)

One way or another, the Unicode implementation will have to
access a dictionary containing references to the codecs for
a particular encoding. You won't get around registering these
at some point... be it in a lazy way, on-the-fly or by some
other means.

What we could do is implement the lookup like this:

1. call encodings.lookup_(encoding) and use the
   return value for the conversion
2. if all fails, cop out with an error

Step 1. would do all the import magic and then register
the found codecs in some dictionary for faster access
(perhaps this could be done in a way that is directly
available to the Unicode implementation, e.g. in a
global internal dictionary -- the one I originally had in
mind for the unicodec registry).

> > Not that I'm advocating it, but register() could also take a single
> > parameter: if a class, then instantiate it and call methods for each
> > action; if an instance, then just call methods for each action.
> 
> Nah, that's bad -- a class is just a factory, and once you are
> allowing classes it's really good to also allowing factory functions.
> 
> > [ and the third/original variety: a function object as the first param is
> >   the actual hook, and params 2 thru 4 (each are optional, or just the
> >   stream funcs?) are the other hook functions ]
> 
> Fine too.  They should all be optional.

Ok.
 
> > > obj = factory_function_for_(errors='strict')
> >
> > Where does this "errors" value come from? How does a user alter that
> > value? Without an ability to change this, I see no reason for a factory.
> > [ and no: don't tell me it is a thread-state value :-) ]
> >
> > On the other hand: presuming the "errors" thing is valid, *then* I see a
> > need for a factory.
> 
> The idea is that various places that take an encoding name can also
> take a codec instance.  So the user can call the factory function /
> class constructor.

Right. The argument is reachable via:

Codec = encodings.lookup_encode('utf-8')
codec = Codec(errors='?')
s = codec(u"abc????")

s would then equal 'abc??'.

--

Should I go ahead then and change the registry business to
the new strategy (via the encodings package in the above
sense) ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mhammond at skippinet.com.au  Thu Nov 18 11:57:44 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 18 Nov 1999 21:57:44 +1100
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <3833BC33.66E134F@lemburg.com>
Message-ID: <002401bf31b3$bf16c230$0501a8c0@bobcat>

[Guido]
> > (But weren't we going to do away with the whole registry
> > idea in favor of an encodings package?)
>
[MAL]
> One way or another, the Unicode implementation will have to
> access a dictionary containing references to the codecs for
> a particular encoding. You won't get around registering these
> at some point... be it in a lazy way, on-the-fly or by some
> other means.

What is wrong with my idea of using well-known-names from the encoding
module?  The dict then is "encodings..__dict__".  All
encodings "just work" because the leverage from the Python module
system.  Unless Im missing something, there is no need for any extra
registry at all.  I guess it would actually resolve to 2 dict lookups,
but thats OK surely?

Mark.




From mal at lemburg.com  Thu Nov 18 10:39:30 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 10:39:30 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf3173$f9805340$c0a0143f@tim>
Message-ID: <3833C952.C6F154B1@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > Guido and I have decided to turn \uXXXX into a standard
> > escape sequence with no further magic applied. \uXXXX will
> > only be expanded in u"" strings.
> 
> Does that exclude ur"" strings?  Not arguing either way, just don't know
> what all this means.
> 
> > Here's the new scheme:
> >
> > With the 'unicode-escape' encoding being defined as:
> >
> > ? all non-escape characters represent themselves as a Unicode ordinal
> >   (e.g. 'a' -> U+0061).
> 
> Same as before (scream if that's wrong).
> 
> > ? all existing defined Python escape sequences are interpreted as
> >   Unicode ordinals;
> 
> Same as before (ditto).
> 
> > note that \xXXXX can represent all Unicode ordinals,
> 
> This means that the definition of \xXXXX has changed, then -- as you pointed
> out just yesterday , \xABCDq currently acts like \xCDq.  Does the new
> \x definition apply only in u"" strings, or in "" strings too?  What is the
> new \x definition?

Guido decided to make \xYYXX return U+YYXX *only* within u""
strings. In  "" (Python strings) the same sequence will result
in chr(0xXX).
 
> > and \OOO (octal) can represent Unicode ordinals up to U+01FF.
> 
> Same as before (ditto).
> 
> > ? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
> >   error to have fewer than 4 digits after \u.
> 
> Same as before (ditto).
> 
> IOW, I don't see anything that's changed other than an unspecified new
> treatment of \x escapes, and possibly that ur"" strings don't expand \u
> escapes.

The difference is that we no longer take the two step approach.
\uXXXX is treated at the same time all other escape sequences
are decoded (the previous version first scanned and decoded
all standard Python sequences and then turned to the \uXXXX
sequences in a second scan).
 
> > Examples:
> >
> > u'abc'          -> U+0061 U+0062 U+0063
> > u'\u1234'       -> U+1234
> > u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c
> 
> The last example is damaged (U+05c isn't legit).  Other than that, these
> look the same as before.

Corrected; thanks.
 
> > Now how should we define ur"abc\u1234\n"  ... ?
> 
> If strings carried an encoding tag with them, the obvious answer is that
> this acts exactly like r"abc\u1234\n" acts today except gets a
> "unicode-escaped" encoding tag instead of a "[whatever the default is
> today]" encoding tag.
> 
> If strings don't carry an encoding tag with them, you're in a bit of a
> pickle:  you'll have to convert it to a regular string or a Unicode string,
> but in either case have no way to communicate that it may need further
> processing; i.e., no way to distinguish it from a regular or Unicode string
> produced by any other mechanism.  The code I posted yesterday remains my
> best answer to that unpleasant puzzle (i.e., produce a Unicode string,
> fiddling with backslashes just enough to get the \u escapes expanded, in the
> same way Java's (conceptual) preprocessor does it).

They don't have such tags... so I guess we're in trouble ;-)

I guess to make ur"" have a meaning at all, we'd need to go
the Java preprocessor way here, i.e. scan the string *only*
for \uXXXX sequences, decode these and convert the rest as-is
to Unicode ordinals.

Would that be ok ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 12:41:32 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 12:41:32 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
Message-ID: <3833E5EC.AAFE5016@lemburg.com>

Mark Hammond wrote:
> 
> [Guido]
> > > (But weren't we going to do away with the whole registry
> > > idea in favor of an encodings package?)
> >
> [MAL]
> > One way or another, the Unicode implementation will have to
> > access a dictionary containing references to the codecs for
> > a particular encoding. You won't get around registering these
> > at some point... be it in a lazy way, on-the-fly or by some
> > other means.
> 
> What is wrong with my idea of using well-known-names from the encoding
> module?  The dict then is "encodings..__dict__".  All
> encodings "just work" because the leverage from the Python module
> system.  Unless Im missing something, there is no need for any extra
> registry at all.  I guess it would actually resolve to 2 dict lookups,
> but thats OK surely?

The problem is that the encoding names are not Python identifiers,
e.g. iso-8859-1 is allowed as identifier. This and
the fact that applications may want to ship their own codecs (which
do not get installed under the system wide encodings package)
make the registry necessary.

I don't see a problem with the registry though -- the encodings
package can take care of the registration process without any
user interaction. There would only have to be an API for
looking up an encoding published by the encodings package for
the Unicode implementation to use. The magic behind that API
is left to the encodings package...

BTW, nothing's wrong with your idea :-) In fact, I like it
a lot because it keeps the encoding modules out of the
top-level scope which is good.

PS: we could probably even take the whole codec idea one step
further and also allow other input/output formats to be registered,
e.g. stream ciphers or pickle mechanisms. The step in that
direction is not a big one: we'd only have to drop the specification
of the Unicode object in the spec and replace it with an arbitrary
object. Of course, this will still have to be a Unicode object
for use by the Unicode implementation.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gmcm at hypernet.com  Thu Nov 18 15:19:48 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Thu, 18 Nov 1999 09:19:48 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <3833BFE9.6FD118B1@lemburg.com>
Message-ID: <1269187709-18981857@hypernet.com>

Marc-Andre wrote:

> Fredrik Lundh wrote:
> > 
> > Guido van Rossum  wrote:
> > > - suggestions for new issues that maybe ought to be settled in 1.6
> > 
> > three things: imputil, imputil, imputil
> 
> But please don't add the current version as default importer...
> its strategy is way too slow for real life apps (yes, I've tested
> this: imports typically take twice as long as with the builtin
> importer).

I think imputil's emulation of the builtin importer is more of a 
demonstration than a serious implementation. As for speed, it 
depends on the test. 
 
> I'd opt for an import manager which provides a useful API for
> import hooks to register themselves with. 

I think that rather than blindly chain themselves together, there 
should be a simple minded manager. This could let the 
programmer prioritize them.

> What we really need
> is not yet another complete reimplementation of what the
> builtin importer does, but rather a more detailed exposure of
> the various import aspects: finding modules and loading modules.

The first clause I sort of agree with - the current 
implementation is a fine implementation of a filesystem 
directory based importer.

I strongly disagree with the second clause. The current import 
hooks are just such a detailed exposure; and they are 
incomprehensible and unmanagable.

I guess you want to tweak the "finding" part of the builtin 
import mechanism. But that's no reason to ask all importers 
to break themselves up into "find" and "load" pieces. It's a 
reason to ask that the standard importer be, in some sense, 
"subclassable" (ie, expose hooks, or perhaps be an extension 
class like thingie).

- Gordon



From jim at interet.com  Thu Nov 18 15:39:20 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Thu, 18 Nov 1999 09:39:20 -0500
Subject: [Python-Dev] Python 1.6 status
References: <1269187709-18981857@hypernet.com>
Message-ID: <38340F98.212F61@interet.com>

Gordon McMillan wrote:
> 
> Marc-Andre wrote:
> 
> > Fredrik Lundh wrote:
> > >
> > > Guido van Rossum  wrote:
> > > > - suggestions for new issues that maybe ought to be settled in 1.6
> > >
> > > three things: imputil, imputil, imputil
> >
> > But please don't add the current version as default importer...
> > its strategy is way too slow for real life apps (yes, I've tested
> > this: imports typically take twice as long as with the builtin
> > importer).
> 
> I think imputil's emulation of the builtin importer is more of a
> demonstration than a serious implementation. As for speed, it
> depends on the test.

IMHO the current import mechanism is good for developers who must
work on the library code in the directory tree, but a disaster
for sysadmins who must distribute Python applications either
internally to a number of machines or commercially.  What we
need is a standard Python library file like a Java "Jar" file.
Imputil can support this as 130 lines of Python.  I have also
written one in C.  I like the imputil approach, but if we want
to add a library importer to import.c, I volunteer to write it.

I don't want to just add more complicated and unmanageable hooks
which people will all use different ways and just add to the
confusion.

It is easy to install packages by just making them into a library
file and throwing it into a directory.  So why aren't we doing it?

Jim Ahlstrom



From guido at CNRI.Reston.VA.US  Thu Nov 18 16:30:28 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 10:30:28 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: Your message of "Thu, 18 Nov 1999 09:19:48 EST."
             <1269187709-18981857@hypernet.com> 
References: <1269187709-18981857@hypernet.com> 
Message-ID: <199911181530.KAA03887@eric.cnri.reston.va.us>

Gordon McMillan wrote:

> Marc-Andre wrote:
> 
> > Fredrik Lundh wrote:
> >
> > > Guido van Rossum  wrote:
> > > > - suggestions for new issues that maybe ought to be settled in 1.6
> > > 
> > > three things: imputil, imputil, imputil
> > 
> > But please don't add the current version as default importer...
> > its strategy is way too slow for real life apps (yes, I've tested
> > this: imports typically take twice as long as with the builtin
> > importer).
> 
> I think imputil's emulation of the builtin importer is more of a 
> demonstration than a serious implementation. As for speed, it 
> depends on the test. 

Agreed.  I like some of imputil's features, but I think the API
need to be redesigned.

> > I'd opt for an import manager which provides a useful API for
> > import hooks to register themselves with. 
> 
> I think that rather than blindly chain themselves together, there 
> should be a simple minded manager. This could let the 
> programmer prioritize them.

Indeed.  (A list of importers has been suggested, to replace the list
of directories currently used.)

> > What we really need
> > is not yet another complete reimplementation of what the
> > builtin importer does, but rather a more detailed exposure of
> > the various import aspects: finding modules and loading modules.
> 
> The first clause I sort of agree with - the current 
> implementation is a fine implementation of a filesystem 
> directory based importer.
> 
> I strongly disagree with the second clause. The current import 
> hooks are just such a detailed exposure; and they are 
> incomprehensible and unmanagable.

Based on how many people have successfully written import hooks, I
have to agree. :-(

> I guess you want to tweak the "finding" part of the builtin 
> import mechanism. But that's no reason to ask all importers 
> to break themselves up into "find" and "load" pieces. It's a 
> reason to ask that the standard importer be, in some sense, 
> "subclassable" (ie, expose hooks, or perhaps be an extension 
> class like thingie).

Agreed.  Subclassing is a good way towards flexibility.

And Jim Ahlstrom writes:

> IMHO the current import mechanism is good for developers who must
> work on the library code in the directory tree, but a disaster
> for sysadmins who must distribute Python applications either
> internally to a number of machines or commercially.

Unfortunately, you're right. :-(

> What we need is a standard Python library file like a Java "Jar"
> file.  Imputil can support this as 130 lines of Python.  I have also
> written one in C.  I like the imputil approach, but if we want to
> add a library importer to import.c, I volunteer to write it.

Please volunteer to design or at least review the grand architecture
-- see below.

> I don't want to just add more complicated and unmanageable hooks
> which people will all use different ways and just add to the
> confusion.

You're so right!

> It is easy to install packages by just making them into a library
> file and throwing it into a directory.  So why aren't we doing it?

Rhetorical question. :-)

So here's a challenge: redesign the import API from scratch.

Let me start with some requirements.

Compatibility issues:
---------------------

- the core API may be incompatible, as long as compatibility layers
can be provided in pure Python

- support for rexec functionality

- support for freeze functionality

- load .py/.pyc/.pyo files and shared libraries from files

- support for packages

- sys.path and sys.modules should still exist; sys.path might
have a slightly different meaning

- $PYTHONPATH and $PYTHONHOME should still be supported

(I wouldn't mind a splitting up of importdl.c into several
platform-specific files, one of which is chosen by the configure
script; but that's a bit of a separate issue.)

New features:
-------------

- Integrated support for Greg Ward's distribution utilities (i.e. a
  module prepared by the distutil tools should install painlessly)

- Good support for prospective authors of "all-in-one" packaging tool
  authors like Gordon McMillan's win32 installer or /F's squish.  (But
  I *don't* require backwards compatibility for existing tools.)

- Standard import from zip or jar files, in two ways:

  (1) an entry on sys.path can be a zip/jar file instead of a directory;
      its contents will be searched for modules or packages

  (2) a file in a directory that's on sys.path can be a zip/jar file;
      its contents will be considered as a package (note that this is
      different from (1)!)

  I don't particularly care about supporting all zip compression
  schemes; if Java gets away with only supporting gzip compression
  in jar files, so can we.

- Easy ways to subclass or augment the import mechanism along
  different dimensions.  For example, while none of the following
  features should be part of the core implementation, it should be
  easy to add any or all:

  - support for a new compression scheme to the zip importer

  - support for a new archive format, e.g. tar

  - a hook to import from URLs or other data sources (e.g. a
    "module server" imported in CORBA) (this needn't be supported
    through $PYTHONPATH though)

  - a hook that imports from compressed .py or .pyc/.pyo files

  - a hook to auto-generate .py files from other filename
    extensions (as currently implemented by ILU)

  - a cache for file locations in directories/archives, to improve
    startup time

  - a completely different source of imported modules, e.g. for an
    embedded system or PalmOS (which has no traditional filesystem)

- Note that different kinds of hooks should (ideally, and within
  reason) properly combine, as follows: if I write a hook to recognize
  .spam files and automatically translate them into .py files, and you
  write a hook to support a new archive format, then if both hooks are
  installed together, it should be possible to find a .spam file in an
  archive and do the right thing, without any extra action.  Right?

- It should be possible to write hooks in C/C++ as well as Python

- Applications embedding Python may supply their own implementations,
  default search path, etc., but don't have to if they want to piggyback
  on an existing Python installation (even though the latter is
  fraught with risk, it's cheaper and easier to understand).

Implementation:
---------------

- There must clearly be some code in C that can import certain
  essential modules (to solve the chicken-or-egg problem), but I don't
  mind if the majority of the implementation is written in Python.
  Using Python makes it easy to subclass.

- In order to support importing from zip/jar files using compression,
  we'd at least need the zlib extension module and hence libz itself,
  which may not be available everywhere.

- I suppose that the bootstrap is solved using a mechanism very
  similar to what freeze currently used (other solutions seem to be
  platform dependent).

- I also want to still support importing *everything* from the
  filesystem, if only for development.  (It's hard enough to deal with
  the fact that exceptions.py is needed during Py_Initialize();
  I want to be able to hack on the import code written in Python
  without having to rebuild the executable all the time.

Let's first complete the requirements gathering.  Are these
requirements reasonable?  Will they make an implementation too
complex?  Am I missing anything?

Finally, to what extent does this impact the desire for dealing
differently with the Python bytecode compiler (e.g. supporting
optimizers written in Python)?  And does it affect the desire to
implement the read-eval-print loop (the >>> prompt) in Python?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Thu Nov 18 16:37:49 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 10:37:49 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Thu, 18 Nov 1999 12:41:32 +0100."
             <3833E5EC.AAFE5016@lemburg.com> 
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>  
            <3833E5EC.AAFE5016@lemburg.com> 
Message-ID: <199911181537.KAA03911@eric.cnri.reston.va.us>

> The problem is that the encoding names are not Python identifiers,
> e.g. iso-8859-1 is allowed as identifier.

This is easily taken care of by translating each string of consecutive
non-identifier-characters to an underscore, so this would import the
iso_8859_1.py module.  (I also noticed in an earlier post that the
official name for Shift_JIS has an underscore, while most other
encodings use hyphens.)

> This and
> the fact that applications may want to ship their own codecs (which
> do not get installed under the system wide encodings package)
> make the registry necessary.

But it could be enough to register a package where to look for
encodings (in addition to the system package).

Or there could be a registry for encoding search functions.  (See the
import discussion.)

> I don't see a problem with the registry though -- the encodings
> package can take care of the registration process without any
> user interaction. There would only have to be an API for
> looking up an encoding published by the encodings package for
> the Unicode implementation to use. The magic behind that API
> is left to the encodings package...

I think that the collection of encodings will eventually grow large
enough to make it a requirement to avoid doing work proportional to
the number of supported encodings at startup (or even when an encoding
is referenced for the first time).  Any "lazy" mechanism (of which
module search is an example) will do.

> BTW, nothing's wrong with your idea :-) In fact, I like it
> a lot because it keeps the encoding modules out of the
> top-level scope which is good.

Yes.

> PS: we could probably even take the whole codec idea one step
> further and also allow other input/output formats to be registered,
> e.g. stream ciphers or pickle mechanisms. The step in that
> direction is not a big one: we'd only have to drop the specification
> of the Unicode object in the spec and replace it with an arbitrary
> object. Of course, this will still have to be a Unicode object
> for use by the Unicode implementation.

This is a step towards Java's architecture of stackable streams.

But I'm always in favor of tackling what we know we need before
tackling the most generalized version of the problem.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Thu Nov 18 16:52:26 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 16:52:26 +0100
Subject: [Python-Dev] Python 1.6 status
References: <1269187709-18981857@hypernet.com> <38340F98.212F61@interet.com>
Message-ID: <383420BA.EF8A6AC5@lemburg.com>

[imputil and friends]

"James C. Ahlstrom" wrote:
> 
> IMHO the current import mechanism is good for developers who must
> work on the library code in the directory tree, but a disaster
> for sysadmins who must distribute Python applications either
> internally to a number of machines or commercially.  What we
> need is a standard Python library file like a Java "Jar" file.
> Imputil can support this as 130 lines of Python.  I have also
> written one in C.  I like the imputil approach, but if we want
> to add a library importer to import.c, I volunteer to write it.
> 
> I don't want to just add more complicated and unmanageable hooks
> which people will all use different ways and just add to the
> confusion.
> 
> It is easy to install packages by just making them into a library
> file and throwing it into a directory.  So why aren't we doing it?

Perhaps we ought to rethink the strategy under a different
light: what are the real requirement we have for Python imports ?

Perhaps the outcome is only the addition of say one or two features
and those can probably easily be added to the builtin system...
then we can just forget about the whole import hook dilema
for quite a while (AFAIK, this is how we got packages into the
core -- people weren't happy with the import hook).

Well, just an idea... I have other threads to follow :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake at acm.org  Thu Nov 18 17:01:47 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 18 Nov 1999 11:01:47 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <3833E5EC.AAFE5016@lemburg.com>
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
	<3833E5EC.AAFE5016@lemburg.com>
Message-ID: <14388.8939.911928.41746@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > The problem is that the encoding names are not Python identifiers,
 > e.g. iso-8859-1 is allowed as identifier. This and
 > the fact that applications may want to ship their own codecs (which
 > do not get installed under the system wide encodings package)
 > make the registry necessary.

  This isn't a substantial problem.  Try this on for size (probably
not too different from what everyone is already thinking, but let's
make it clear).  This could be in encodings/__init__.py; I've tried to 
be really clear on the names.  (No testing, only partially complete.)

------------------------------------------------------------------------
import string
import sys

try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO


class EncodingError(Exception):
    def __init__(self, encoding, error):
        self.encoding = encoding
        self.strerror = "%s %s" % (error, `encoding`)
        self.error = error
        Exception.__init__(self, encoding, error)


_registry = {}

def registerEncoding(encoding, encode=None, decode=None,
                     make_stream_encoder=None, make_stream_decoder=None):
    encoding = encoding.lower()
    if _registry.has_key(encoding):
        info = _registry[encoding]
    else:
        info = _registry[encoding] = Codec(encoding)
    info._update(encode, decode,
                 make_stream_encoder, make_stream_decoder)


def getCodec(encoding):
    encoding = encoding.lower()
    if _registry.has_key(encoding):
        return _registry[encoding]

    # load the module
    modname = "encodings." + encoding.replace("-", "_")
    try:
        __import__(modname)
    except ImportError:
        raise EncodingError("unknown uncoding " + `encoding`)

    # if the module registered, use the codec as-is:
    if _registry.has_key(encoding):
        return _registry[encoding]

    # nothing registered, use well-known names
    module = sys.modules[modname]
    codec = _registry[encoding] = Codec(encoding)
    encode = getattr(module, "encode", None)
    decode = getattr(module, "decode", None)
    make_stream_encoder = getattr(module, "make_stream_encoder", None)
    make_stream_decoder = getattr(module, "make_stream_decoder", None)
    codec._update(encode, decode,
                  make_stream_encoder, make_stream_decoder)


class Codec:
    __encode = None
    __decode = None
    __stream_encoder_factory = None
    __stream_decoder_factory = None

    def __init__(self, name):
        self.name = name

    def encode(self, u):
        if self.__stream_encoder_factory:
            sio = StringIO()
            encoder = self.__stream_encoder_factory(sio)
            encoder.write(u)
            encoder.flush()
            return sio.getvalue()
        else:
            raise EncodingError("no encoder available for " + `self.name`)

    # similar for decode()...

    def make_stream_encoder(self, target):
        if self.__stream_encoder_factory:
            return self.__stream_encoder_factory(target)
        elif self.__encode:
            return DefaultStreamEncoder(target, self.__encode)
        else:
            raise EncodingError("no encoder available for " + `self.name`)

    # similar for make_stream_decoder()...

    def _update(self, encode, decode,
                make_stream_encoder, make_stream_decoder):
        self.__encode = encode or self.__encode
        self.__decode = decode or self.__decode
        self.__stream_encoder_factory = (
            make_stream_encoder or self.__stream_encoder_factory)
        self.__stream_decoder_factory = (
            make_stream_decoder or self.__stream_decoder_factory)
------------------------------------------------------------------------

 > I don't see a problem with the registry though -- the encodings
 > package can take care of the registration process without any

  No problem at all; we just need to make sure the right magic is
there for the "normal" case.

 > PS: we could probably even take the whole codec idea one step
 > further and also allow other input/output formats to be registered,

  File formats are different from text encodings, so let's keep them
separate.  Yes, a registry can be a good approach whenever the various 
things being registered are sufficiently similar semantically, but the 
behavior of the registry/lookup can be very different for each type of 
thing.  Let's not over-generalize.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fdrake at acm.org  Thu Nov 18 17:02:45 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 18 Nov 1999 11:02:45 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <3833E5EC.AAFE5016@lemburg.com>
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
	<3833E5EC.AAFE5016@lemburg.com>
Message-ID: <14388.8997.703108.401808@weyr.cnri.reston.va.us>

  Er, I should note that the sample code I just sent makes use of
string methods.  ;)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From mal at lemburg.com  Thu Nov 18 17:23:09 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 17:23:09 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>  
	            <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us>
Message-ID: <383427ED.45A01BBB@lemburg.com>

Guido van Rossum wrote:
> 
> > The problem is that the encoding names are not Python identifiers,
> > e.g. iso-8859-1 is allowed as identifier.
> 
> This is easily taken care of by translating each string of consecutive
> non-identifier-characters to an underscore, so this would import the
> iso_8859_1.py module.  (I also noticed in an earlier post that the
> official name for Shift_JIS has an underscore, while most other
> encodings use hyphens.)

Right. That's one way of doing it.

> > This and
> > the fact that applications may want to ship their own codecs (which
> > do not get installed under the system wide encodings package)
> > make the registry necessary.
> 
> But it could be enough to register a package where to look for
> encodings (in addition to the system package).
> 
> Or there could be a registry for encoding search functions.  (See the
> import discussion.)

Like a path of search functions ? Not a bad idea... I will still
want the internal dict for caching purposes though. I'm not sure
how often these encodings will be, but even a few hundred function
call will slow down the Unicode implementation quite a bit.

The implementation could proceed as follows:

def lookup(encoding):

    codecs = _internal_dict.get(encoding,None)
    if codecs:
	return codecs
    for query in sys.encoders:
	codecs = query(encoding)
	if codecs:
	    break
    else:
	raise UnicodeError,'unkown encoding: %s' % encoding
    _internal_dict[encoding] = codecs
    return codecs

For simplicity, codecs should be a tuple (encoder,decoder,
stream_writer,stream_reader) of factory functions.

...that is if we can agree on these 4 APIs :-) Here are my
current versions:
-----------------------------------------------------------------------
class Codec:

    """ Defines the interface for stateless encoders/decoders.
    """

    def __init__(self,errors='strict'):

	""" Creates a Codec instance.

	    The Codec may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.errors = errors

    def encode(self,u,slice=None):
	
	""" Return the Unicode object u encoded as Python string.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is encoded.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	"""
	...

    def decode(self,s,offset=0):

	""" Decodes data from the Python string s and returns a tuple 
	    (Unicode object, bytes consumed).
	
	    If offset is given, the decoding process starts at
	    s[offset]. It defaults to 0.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	""" 
	...


StreamWriter and StreamReader define the interface for stateful
encoders/decoders:

class StreamWriter(Codec):

    def __init__(self,stream,errors='strict'):

	""" Creates a StreamWriter instance.

	    stream must be a file-like object open for writing
	    (binary) data.

	    The StreamWriter may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def write(self,u,slice=None):

	""" Writes the Unicode object's contents encoded to self.stream
	    and returns the number of bytes written.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def flush(self):

	""" Flushed the codec buffers used for keeping state.

	    Returns values are not defined. Implementations are free to
	    return None, raise an exception (in case there is pending
	    data in the buffers which could not be decoded) or
	    return any remaining data from the state buffers used.

	"""
	pass

class StreamReader(Codec):

    def __init__(self,stream,errors='strict'):

	""" Creates a StreamReader instance.

	    stream must be a file-like object open for reading
	    (binary) data.

	    The StreamReader may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def read(self,chunksize=0):

	""" Decodes data from the stream self.stream and returns a tuple 
	    (Unicode object, bytes consumed).

	    chunksize indicates the approximate maximum number of
	    bytes to read from the stream for decoding purposes. The
	    decoder can modify this setting as appropriate. The default
	    value 0 indicates to read and decode as much as possible.
	    The chunksize is intended to prevent having to decode huge
	    files in one step.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...

    def flush(self):

	""" Flushed the codec buffers used for keeping state.

	    Returns values are not defined. Implementations are free to
	    return None, raise an exception (in case there is pending
	    data in the buffers which could not be decoded) or
	    return any remaining data from the state buffers used.

	"""

In addition to the above methods, the StreamWriter and StreamReader
instances should also provide access to all other methods defined for
the stream object.

Stream codecs are free to combine the StreamWriter and StreamReader
interfaces into one class.
-----------------------------------------------------------------------

> > I don't see a problem with the registry though -- the encodings
> > package can take care of the registration process without any
> > user interaction. There would only have to be an API for
> > looking up an encoding published by the encodings package for
> > the Unicode implementation to use. The magic behind that API
> > is left to the encodings package...
> 
> I think that the collection of encodings will eventually grow large
> enough to make it a requirement to avoid doing work proportional to
> the number of supported encodings at startup (or even when an encoding
> is referenced for the first time).  Any "lazy" mechanism (of which
> module search is an example) will do.

Right. The list of search functions should provide this kind
of lazyness. It also provides ways to implement other strategies
to look for codecs, e.g. PIL could provide such a search function
for its codecs, mxCrypto for the included ciphers, etc.
 
> > BTW, nothing's wrong with your idea :-) In fact, I like it
> > a lot because it keeps the encoding modules out of the
> > top-level scope which is good.
> 
> Yes.
> 
> > PS: we could probably even take the whole codec idea one step
> > further and also allow other input/output formats to be registered,
> > e.g. stream ciphers or pickle mechanisms. The step in that
> > direction is not a big one: we'd only have to drop the specification
> > of the Unicode object in the spec and replace it with an arbitrary
> > object. Of course, this will still have to be a Unicode object
> > for use by the Unicode implementation.
> 
> This is a step towards Java's architecture of stackable streams.
> 
> But I'm always in favor of tackling what we know we need before
> tackling the most generalized version of the problem.

Well, I just wanted to mention the possibility... might be
something to look into next year. I find it rather thrilling
to be able to create encrypted streams by just hooking together
a few stream codecs...

f = open('myfile.txt','w')

CipherWriter = sys.codec('rc5-cipher')[3]
sf = StreamWriter(f,key='xxxxxxxx')

UTF8Writer = sys.codec('utf-8')[3]
sfx = UTF8Writer(sf)

sfx.write('asdfasdfasdfasdf')
sfx.close()

Hmm, we should probably define the additional constructor
arguments to be keyword arguments... writers/readers other
than Unicode ones will probably need different kinds of
parameters (such as the key in the above example).

Ahem, ...I'm getting distracted here :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From bwarsaw at cnri.reston.va.us  Thu Nov 18 17:23:41 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Thu, 18 Nov 1999 11:23:41 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
	<3833E5EC.AAFE5016@lemburg.com>
	<14388.8997.703108.401808@weyr.cnri.reston.va.us>
Message-ID: <14388.10253.902424.904199@anthem.cnri.reston.va.us>

>>>>> "Fred" == Fred L Drake, Jr  writes:

    Fred>   Er, I should note that the sample code I just sent makes
    Fred> use of string methods.  ;)

Yay!



From guido at CNRI.Reston.VA.US  Thu Nov 18 17:37:08 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 11:37:08 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Thu, 18 Nov 1999 17:23:09 +0100."
             <383427ED.45A01BBB@lemburg.com> 
References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us>  
            <383427ED.45A01BBB@lemburg.com> 
Message-ID: <199911181637.LAA04260@eric.cnri.reston.va.us>

> Like a path of search functions ? Not a bad idea... I will still
> want the internal dict for caching purposes though. I'm not sure
> how often these encodings will be, but even a few hundred function
> call will slow down the Unicode implementation quite a bit.

Of course.  (It's like sys.modules caching the results of an import).

[...]
>     def flush(self):
> 
> 	""" Flushed the codec buffers used for keeping state.
> 
> 	    Returns values are not defined. Implementations are free to
> 	    return None, raise an exception (in case there is pending
> 	    data in the buffers which could not be decoded) or
> 	    return any remaining data from the state buffers used.
> 
> 	"""

I don't know where this came from, but a flush() should work like
flush() on a file.  It doesn't return a value, it just sends any
remaining data to the underlying stream (for output).  For input it
shouldn't be supported at all.

The idea is that flush() should do the same to the encoder state that
close() followed by a reopen() would do.  Well, more or less.  But if
the process were to be killed right after a flush(), the data written
to disk should be a complete encoding, and not have a lingering shift
state.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Thu Nov 18 17:59:06 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 11:59:06 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Thu, 18 Nov 1999 09:50:36 +0100."
             <3833BDDC.7CD2CC1F@lemburg.com> 
References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com>  
            <3833BDDC.7CD2CC1F@lemburg.com> 
Message-ID: <199911181659.LAA04303@eric.cnri.reston.va.us>

[Responding to some lingering mails]

[/F]
> >     >>> u = unicode("? i ?a ? e ?", "iso-latin-1")
> >     >>> s = u.encode("html-entities")
> >     >>> d = decoder("html-entities")
> >     >>> d.decode(s[:-1])
> >     "? i ?a ? e "
> >     >>> d.flush()
> >     "?"

[MAL]
> Ah, ok. So the .flush() method checks for proper
> string endings and then either returns the remaining
> input or raises an error.

No, please.  See my previous post on flush().

> > input: read chunks of data, decode, and
> > keep extra data in a local buffer.
> > 
> > output: encode data into suitable chunks,
> > and write to the output stream (that's why
> > there's a buffersize argument to encode --
> > if someone writes a 10mb unicode string to
> > an encoded stream, python shouldn't allocate
> > an extra 10-30 megabytes just to be able to
> > encode the darn thing...)
> 
> So the stream codecs would be wrappers around the
> string codecs.

No -- the other way around.  Think of the stream encoder as a little
FSM engine that you feed with unicode characters and which sends bytes
to the backend stream.  When a unicode character comes in that
requires a particular shift state, and the FSM isn't in that shift
state, it emits the escape sequence to enter that shift state first.
It should use standard buffered writes to the output stream; i.e. one
call to feed the encoder could cause several calls to write() on the
output stream, or vice versa (if you fed the encoder a single
character it might keep it in its own buffer).  That's all up to the
codec implementation.

The flush() forces the FSM into the "neutral" shift state, possibly
writing an escape sequence to leave the current shift state, and
empties the internal buffer.

The string codec CONCEPTUALLY uses the stream codec to a cStringIO
object, using flush() to force the final output.  However the
implementation may take a shortcut.  For stateless encodings the
stream codec may call on the string codec, but that's all an
implementation issue.

For input, things are slightly different (you don't know how much
encoded data you must read to give you N Unicode characters, so you
may have to make a guess and hold on to some data that you read
unnecessarily -- either in encoded form or in Unicode form, at the
discretion of the implementation.  Using seek() on the input stream is
forbidden (it could be a pipe or socket).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Thu Nov 18 18:11:51 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 12:11:51 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: Your message of "Thu, 18 Nov 1999 10:39:30 +0100."
             <3833C952.C6F154B1@lemburg.com> 
References: <000101bf3173$f9805340$c0a0143f@tim>  
            <3833C952.C6F154B1@lemburg.com> 
Message-ID: <199911181711.MAA04342@eric.cnri.reston.va.us>

> > > Now how should we define ur"abc\u1234\n"  ... ?
> > 
> > If strings carried an encoding tag with them, the obvious answer is that
> > this acts exactly like r"abc\u1234\n" acts today except gets a
> > "unicode-escaped" encoding tag instead of a "[whatever the default is
> > today]" encoding tag.
> > 
> > If strings don't carry an encoding tag with them, you're in a bit of a
> > pickle:  you'll have to convert it to a regular string or a Unicode string,
> > but in either case have no way to communicate that it may need further
> > processing; i.e., no way to distinguish it from a regular or Unicode string
> > produced by any other mechanism.  The code I posted yesterday remains my
> > best answer to that unpleasant puzzle (i.e., produce a Unicode string,
> > fiddling with backslashes just enough to get the \u escapes expanded, in the
> > same way Java's (conceptual) preprocessor does it).
> 
> They don't have such tags... so I guess we're in trouble ;-)
> 
> I guess to make ur"" have a meaning at all, we'd need to go
> the Java preprocessor way here, i.e. scan the string *only*
> for \uXXXX sequences, decode these and convert the rest as-is
> to Unicode ordinals.
> 
> Would that be ok ?

Read Tim's code (posted about 40 messages ago in this list).

Like Java, it interprets \u.... when the number of backslashes is odd,
but not when it's even.  So \\u.... returns exactly that, while
\\\u.... returns two backslashes and a unicode character.

This is nice and can be done regardless of whether we are going to
interpret other \ escapes or not.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From skip at mojam.com  Thu Nov 18 18:34:51 1999
From: skip at mojam.com (Skip Montanaro)
Date: Thu, 18 Nov 1999 11:34:51 -0600 (CST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim>
References: <383156DF.2209053F@lemburg.com>
	<000401bf30d8$6cf30bc0$a42d153f@tim>
Message-ID: <14388.14523.158050.594595@dolphin.mojam.com>

    >> FYI, the next version of the proposal ...  File objects opened in
    >> text mode will use "t#" and binary ones use "s#".

    Tim> Am I the only one who sees magical distinctions between text and
    Tim> binary mode as a Really Bad Idea? 

No.

    Tim> I wouldn't have guessed the Unix natives here would quietly
    Tim> acquiesce to importing a bit of Windows madness .

We figured you and Guido would come to our rescue... ;-)

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From mal at lemburg.com  Thu Nov 18 19:15:54 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 19:15:54 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.7
References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com>
Message-ID: <3834425A.8E9C3B7E@lemburg.com>

FYI, I've uploaded a new version of the proposal which includes
new codec APIs, a new codec search mechanism and some minor
fixes here and there.

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ? Unicode objects support for %-formatting

    ? Design of the internal C API and the Python API for
      the Unicode character properties database

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 19:32:49 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 19:32:49 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf3173$f9805340$c0a0143f@tim>  
	            <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>
Message-ID: <38344651.960878A2@lemburg.com>

Guido van Rossum wrote:
> 
> > I guess to make ur"" have a meaning at all, we'd need to go
> > the Java preprocessor way here, i.e. scan the string *only*
> > for \uXXXX sequences, decode these and convert the rest as-is
> > to Unicode ordinals.
> >
> > Would that be ok ?
> 
> Read Tim's code (posted about 40 messages ago in this list).

I did, but wasn't sure whether he was argueing for going the
Java way...
 
> Like Java, it interprets \u.... when the number of backslashes is odd,
> but not when it's even.  So \\u.... returns exactly that, while
> \\\u.... returns two backslashes and a unicode character.
> 
> This is nice and can be done regardless of whether we are going to
> interpret other \ escapes or not.

So I'll take that as: this is what we want in Python too :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mal at lemburg.com  Thu Nov 18 19:38:41 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 19:38:41 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf3173$f9805340$c0a0143f@tim>  
	            <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>
Message-ID: <383447B1.1B7B594C@lemburg.com>

Would this definition be fine ?
"""

  u = ur''

The 'raw-unicode-escape' encoding is defined as follows:

? \uXXXX sequence represent the U+XXXX Unicode character if and
  only if the number of leading backslashes is odd

? all other characters represent themselves as Unicode ordinal
  (e.g. 'b' -> U+0062)

"""

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From guido at CNRI.Reston.VA.US  Thu Nov 18 19:46:35 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 13:46:35 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Thu, 18 Nov 1999 11:34:51 CST."
             <14388.14523.158050.594595@dolphin.mojam.com> 
References: <383156DF.2209053F@lemburg.com> <000401bf30d8$6cf30bc0$a42d153f@tim>  
            <14388.14523.158050.594595@dolphin.mojam.com> 
Message-ID: <199911181846.NAA04547@eric.cnri.reston.va.us>

>     >> FYI, the next version of the proposal ...  File objects opened in
>     >> text mode will use "t#" and binary ones use "s#".
> 
>     Tim> Am I the only one who sees magical distinctions between text and
>     Tim> binary mode as a Really Bad Idea? 
> 
> No.
> 
>     Tim> I wouldn't have guessed the Unix natives here would quietly
>     Tim> acquiesce to importing a bit of Windows madness .
> 
> We figured you and Guido would come to our rescue... ;-)

Don't count on me.  My brain is totally cross-platform these days, and
writing "rb" or "wb" for files containing binary data is second nature
for me.  I actually *like* it.

Anyway, the Unicode stuff ought to have a wrapper open(filename, mode,
encoding) where the 'b' will be added to the mode if you don't give it
and it's needed.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Thu Nov 18 19:50:20 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 13:50:20 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: Your message of "Thu, 18 Nov 1999 19:32:49 +0100."
             <38344651.960878A2@lemburg.com> 
References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>  
            <38344651.960878A2@lemburg.com> 
Message-ID: <199911181850.NAA04576@eric.cnri.reston.va.us>

> > Like Java, it interprets \u.... when the number of backslashes is odd,
> > but not when it's even.  So \\u.... returns exactly that, while
> > \\\u.... returns two backslashes and a unicode character.
> > 
> > This is nice and can be done regardless of whether we are going to
> > interpret other \ escapes or not.
> 
> So I'll take that as: this is what we want in Python too :-)

I'll reserve judgement until we've got some experience with it in the
field, but it seems the best compromise.  It also gives a clear
explanation about why we have \uXXXX when we already have \xXXXX.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Thu Nov 18 19:57:36 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 13:57:36 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: Your message of "Thu, 18 Nov 1999 19:38:41 +0100."
             <383447B1.1B7B594C@lemburg.com> 
References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>  
            <383447B1.1B7B594C@lemburg.com> 
Message-ID: <199911181857.NAA04617@eric.cnri.reston.va.us>

> Would this definition be fine ?
> """
> 
>   u = ur''
> 
> The 'raw-unicode-escape' encoding is defined as follows:
> 
> ? \uXXXX sequence represent the U+XXXX Unicode character if and
>   only if the number of leading backslashes is odd
> 
> ? all other characters represent themselves as Unicode ordinal
>   (e.g. 'b' -> U+0062)
> 
> """

Yes.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From skip at mojam.com  Thu Nov 18 20:09:46 1999
From: skip at mojam.com (Skip Montanaro)
Date: Thu, 18 Nov 1999 13:09:46 -0600 (CST)
Subject: [Python-Dev] Unicode Proposal: Version 0.7
In-Reply-To: <3834425A.8E9C3B7E@lemburg.com>
References: <382C0A54.E6E8328D@lemburg.com>
	<382D625B.DC14DBDE@lemburg.com>
	<38316685.7977448D@lemburg.com>
	<3834425A.8E9C3B7E@lemburg.com>
Message-ID: <14388.20218.294814.234327@dolphin.mojam.com>

I haven't been following this discussion closely at all, and have no
previous experience with Unicode, so please pardon a couple stupid questions
from the peanut gallery:

    1. What does U+0061 mean (other than 'a')?  That is, what is U?

    2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter
       description.  Given a Unicode object with encoding e1, how do I write
       it to a file that is to be encoded with encoding e2?  Seems like I
       would do something like

           u1 = unicode(s, encoding=e1)
	   f = open("somefile", "wb")
	   u2 = unicode(u1, encoding=e2)
	   f.write(u2)

       Is that how it would be done?  Does this question even make sense?

    3. What will the impact be on programmers such as myself currently
       living with blinders on (that is, writing in plain old 7-bit ASCII)?

Thx,

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From jim at interet.com  Thu Nov 18 20:23:53 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Thu, 18 Nov 1999 14:23:53 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>
Message-ID: <38345249.4AFD91DA@interet.com>

Guido van Rossum wrote:
>
> Let's first complete the requirements gathering.

Yes.

> Are these
> requirements reasonable?  Will they make an implementation too
> complex?

I think you can get 90% of where you want to be with something
much simpler.  And the simpler implementation will be useful in
the 100% solution, so it is not wasted time.

How about if we just design a Python archive file format; provide
code in the core (in Python or C) to import from it; provide a
Python program to create archive files; and provide a Standard
Directory to put archives in so they can be found quickly.  For
extensibility and control, we add functions to the imp module.
Detailed comments follow:


> Compatibility issues:
> ---------------------
> [list of current features...]

Easily met by keeping the current C code.

> 
> New features:
> -------------
> 
> - Integrated support for Greg Ward's distribution utilities (i.e. a
>   module prepared by the distutil tools should install painlessly)
> 
> - Good support for prospective authors of "all-in-one" packaging tool
>   authors like Gordon McMillan's win32 installer or /F's squish.  (But
>   I *don't* require backwards compatibility for existing tools.)

These tools go well beyond just an archive file format, but hopefully
a file format will help.  Greg and Gordon should be able to control the
format so it meets their needs.  We need a standard format.
 
> - Standard import from zip or jar files, in two ways:
> 
>   (1) an entry on sys.path can be a zip/jar file instead of a directory;
>       its contents will be searched for modules or packages
> 
>   (2) a file in a directory that's on sys.path can be a zip/jar file;
>       its contents will be considered as a package (note that this is
>       different from (1)!)

I don't like sys.path at all.  It is currently part of the problem.
I suggest that archive files MUST be put into a known directory.
On Windows this is the directory of the executable, sys.executable.
On Unix this $PREFIX plus version, namely
  "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]).
Other platforms can have different rules.

We should also have the ability to append archive files to the
executable or a shared library assuming the OS allows this
(Windows and Linux do allow it).  This is the first location
searched, nails the archive to the interpreter, insulates us
from an erroneous sys.path, and enables single-file Python programs.

>   I don't particularly care about supporting all zip compression
>   schemes; if Java gets away with only supporting gzip compression
>   in jar files, so can we.

We don't need compression.  The whole ./Lib is 1.2 Meg, and if we
compress
it to zero we save a Meg.  Irrelevant.  Installers provide compression
anyway so when Python programs are shipped, they will be compressed
then.

Problems are that Python does not ship with compression, we will
have to add it, we will have to support it and its current method
of compression forever, and it adds complexity.
 
> - Easy ways to subclass or augment the import mechanism along
>   different dimensions.  For example, while none of the following
>   features should be part of the core implementation, it should be
>   easy to add any or all:
>
>  [ List of new features including hooks...]

Sigh, this proposal does not provide for this.  It seems
like a job for imputil.  But if the file format and import code
is available from the imp module, it can be used as part of the
solution.

>   - support for a new compression scheme to the zip importer

I guess compression should be easy to add if Python ships with
a compression module.
 
>   - a cache for file locations in directories/archives, to improve
>     startup time

If the Python library is available as an archive, I think
startup will be greatly improved anyway.
 
> Implementation:
> ---------------
> 
> - There must clearly be some code in C that can import certain
>   essential modules (to solve the chicken-or-egg problem), but I don't
>   mind if the majority of the implementation is written in Python.
>   Using Python makes it easy to subclass.

Yes.
 
> - In order to support importing from zip/jar files using compression,
>   we'd at least need the zlib extension module and hence libz itself,
>   which may not be available everywhere.

That's a good reason to omit compression.  At least for now.
 
> - I suppose that the bootstrap is solved using a mechanism very
>   similar to what freeze currently used (other solutions seem to be
>   platform dependent).

Yes, except that we need to be careful to preserve the freeze feature
for users.  We don't want to take it over.
 
> - I also want to still support importing *everything* from the
>   filesystem, if only for development.  (It's hard enough to deal with
>   the fact that exceptions.py is needed during Py_Initialize();
>   I want to be able to hack on the import code written in Python
>   without having to rebuild the executable all the time.

Yes, we need a function in imp to turn archives off:
  import imp
  imp.archiveEnable(0)
 
> Finally, to what extent does this impact the desire for dealing
> differently with the Python bytecode compiler (e.g. supporting
> optimizers written in Python)?  And does it affect the desire to
> implement the read-eval-print loop (the >>> prompt) in Python?

I don't think it impacts these at all.

Jim Ahlstrom



From guido at CNRI.Reston.VA.US  Thu Nov 18 20:55:02 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 14:55:02 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: Your message of "Thu, 18 Nov 1999 14:23:53 EST."
             <38345249.4AFD91DA@interet.com> 
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>  
            <38345249.4AFD91DA@interet.com> 
Message-ID: <199911181955.OAA04830@eric.cnri.reston.va.us>

> I think you can get 90% of where you want to be with something
> much simpler.  And the simpler implementation will be useful in
> the 100% solution, so it is not wasted time.

Agreed, but I'm not sure that it addresses the problems that started
this thread.  I can't really tell, since the message starting the
thread just requested imputil, without saying which parts of it were
needed.  A followup claimed that imputil was a fine prototype but too
slow for real work.

I inferred that flexibility was requested.  But maybe that was
projection since that was on my own list.  (I'm happy with the
performance and find manipulating zip or jar files clumsy, so I'm not
too concerned about all the nice things you can *do* with that
flexibility. :-)

> How about if we just design a Python archive file format; provide
> code in the core (in Python or C) to import from it; provide a
> Python program to create archive files; and provide a Standard
> Directory to put archives in so they can be found quickly.  For
> extensibility and control, we add functions to the imp module.
> Detailed comments follow:

> These tools go well beyond just an archive file format, but hopefully
> a file format will help.  Greg and Gordon should be able to control the
> format so it meets their needs.  We need a standard format.

I think the standard format should be a subclass of zip or jar (which
is itself a subclass of zip).  We have already written (at CNRI, as
yet unreleased) the necessary Python tools to manipulate zip archives;
moreover 3rd party tools are abundantly available, both on Unix and on
Windows (as well as in Java).  Zip files also lend themselves to
self-extracting archives and similar things, because the file index is
at the end, so I think that Greg & Gordon should be happy.

> I don't like sys.path at all.  It is currently part of the problem.

Eh?  That's the first thing I hear something bad about it.  Maybe
that's because you live on Windows -- on Unix, search paths are
ubiquitous.

> I suggest that archive files MUST be put into a known directory.

Why?  Maybe this works on Windows; on Unix this is asking for trouble
because it prevents users from augmenting the installation provided by
the sysadmin.  Even on newer Windows versions, users without admin
perms may not be allowed to add files to that privileged directory.

> On Windows this is the directory of the executable, sys.executable.
> On Unix this $PREFIX plus version, namely
>   "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]).
> Other platforms can have different rules.
> 
> We should also have the ability to append archive files to the
> executable or a shared library assuming the OS allows this
> (Windows and Linux do allow it).  This is the first location
> searched, nails the archive to the interpreter, insulates us
> from an erroneous sys.path, and enables single-file Python programs.

OK for the executable.  I'm not sure what the point is of appending an
archive to the shared library?  Anyway, does it matter (on Windows) if
you add it to python16.dll or to python.exe?

> We don't need compression.  The whole ./Lib is 1.2 Meg, and if we
> compress
> it to zero we save a Meg.  Irrelevant.  Installers provide compression
> anyway so when Python programs are shipped, they will be compressed
> then.
> 
> Problems are that Python does not ship with compression, we will
> have to add it, we will have to support it and its current method
> of compression forever, and it adds complexity.

OK, OK.  I think most zip tools have a way to turn off the
compression.  (Anyway, it's a matter of more I/O time vs. more CPU
time; hardare for both is getting better faster than we can tweak the
code :-)

> Sigh, this proposal does not provide for this.  It seems
> like a job for imputil.  But if the file format and import code
> is available from the imp module, it can be used as part of the
> solution.

Well, the question is really if we want flexibility or archive files.
I care more about the flexibility.  If we get a clear vote for archive
files, I see no problem with implementing that first.

> If the Python library is available as an archive, I think
> startup will be greatly improved anyway.

Really?  I know about all the system calls it makes, but I don't
really see much of a delay -- I have a prompt in well under 0.1
second.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein at lyra.org  Thu Nov 18 23:03:55 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 18 Nov 1999 14:03:55 -0800 (PST)
Subject: [Python-Dev] file modes (was: just say no...)
In-Reply-To: <3833B588.1E31F01B@lemburg.com>
Message-ID: 

On Thu, 18 Nov 1999, M.-A. Lemburg wrote:
> Tim Peters wrote:
> > [MAL]
> > > File objects opened in text mode will use "t#" and binary
> > > ones use "s#".
> > 
> > [Greg Stein]
> > > ...
> > > The real annoying thing would be to assume that opening a file as 'r'
> > > means that I *meant* text mode and to start using "t#".
> > 
> > Isn't that exactly what MAL said would happen?  Note that a "t" flag for
> > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't
> > either; a lone "r" has always meant text mode.
> 
> Em, I think you've got something wrong here: "t#" refers to the
> parsing marker used for writing data to files opened in text mode.

Nope. We've got it right :-)

Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to
refer to the parse marker.

>...
> I guess you won't notice any difference: strings define both
> interfaces ("s#" and "t#") to mean the same thing. Only other
> buffer compatible types may now fail to write to text files
> -- which is not so bad, because it forces the programmer to
> rethink what he really intended when opening the file in text
> mode.

It *is* bad if it breaks my existing programs in subtle ways that are a
bitch to track down.

> Besides, if you are writing portable scripts you should pay
> close attention to "r" vs. "rb" anyway.

I'm not writing portable scripts. I mentioned that once before. I don't
want a difference between 'r' and 'rb' on my Linux box. It was never there
before, I'm lazy, and I don't want to see it added :-).

Honestly, I don't know offhand of any Python types that repond to "s#" and
"t#" in different ways, such that changing file.write would end up writing
something different (and thereby breaking existing code).

I just don't like introduce text/binary to *nix platforms where it didn't
exist before.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From skip at mojam.com  Thu Nov 18 23:15:43 1999
From: skip at mojam.com (Skip Montanaro)
Date: Thu, 18 Nov 1999 16:15:43 -0600 (CST)
Subject: [Python-Dev] file modes (was: just say no...)
In-Reply-To: 
References: <3833B588.1E31F01B@lemburg.com>
	
Message-ID: <14388.31375.296388.973848@dolphin.mojam.com>

    Greg> I'm not writing portable scripts. I mentioned that once before. I
    Greg> don't want a difference between 'r' and 'rb' on my Linux box. It
    Greg> was never there before, I'm lazy, and I don't want to see it added
    Greg> :-).

    ...

    Greg> I just don't like introduce text/binary to *nix platforms where it
    Greg> didn't exist before.

I'll vote with Greg, Guido's cross-platform conversion not withstanding.  If
I haven't been writing portable scripts up to this point because I only care
about a single target platform, why break my scripts for me?  Forcing me to
use "rb" or "wb" on my open calls isn't going to make them portable anyway.

There are probably many other harder to identify and correct portability
issues than binary file access anyway.  Seems like requiring "b" is just
going to cause gratuitous breakage with no obvious increase in portability.

porta-nanny.py-anyone?-ly y'rs,

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From jim at interet.com  Thu Nov 18 23:40:05 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Thu, 18 Nov 1999 17:40:05 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>  
	            <38345249.4AFD91DA@interet.com> <199911181955.OAA04830@eric.cnri.reston.va.us>
Message-ID: <38348045.BB95F783@interet.com>

Guido van Rossum wrote:

> I think the standard format should be a subclass of zip or jar (which
> is itself a subclass of zip).  We have already written (at CNRI, as
> yet unreleased) the necessary Python tools to manipulate zip archives;
> moreover 3rd party tools are abundantly available, both on Unix and on
> Windows (as well as in Java).  Zip files also lend themselves to
> self-extracting archives and similar things, because the file index is
> at the end, so I think that Greg & Gordon should be happy.

Think about multiple packages in multiple zip files.  The zip files
store file directories.  That means we would need a sys.zippath to
search the zip files.  I don't want another PYTHONPATH phenomenon.

Greg Stein and I once discussed this (and Gordon I think).  They
argued that the directories should be flattened.  That is, think of
all directories which can be reached on PYTHONPATH.  Throw
away all initial paths.  The resultant archive has *.pyc at the top
level,
as well as package directories only.  The search path is "." in every
archive file.  No directory information is stored, only module names,
some with dots.
 
> > I don't like sys.path at all.  It is currently part of the problem.
> 
> Eh?  That's the first thing I hear something bad about it.  Maybe
> that's because you live on Windows -- on Unix, search paths are
> ubiquitous.

On windows, just print sys.path.  It is junk.  A commercial
distribution has to "just work", and it fails if a second installation
(by someone else) changes PYTHONPATH to suit their app.  I am trying
to get to "just works", no excuses, no complications.
 
> > I suggest that archive files MUST be put into a known directory.
> 
> Why?  Maybe this works on Windows; on Unix this is asking for trouble
> because it prevents users from augmenting the installation provided by
> the sysadmin.  Even on newer Windows versions, users without admin
> perms may not be allowed to add files to that privileged directory.

It works on Windows because programs install themselves in their own
subdirectories, and can put files there instead of /windows/system32.
This holds true for Windows 2000 also.  A Unix-style installation
to /windows/system32 would (may?) require "administrator" privilege.

On Unix you are right.  I didn't think of that because I am the Unix
sysadmin here, so I can put things where I want.  The Windows
solution doesn't fit with Unix, because executables go in a ./bin
directory and putting library files there is a no-no.  Hmmmm...
This needs more thought.  Anyone else have ideas??

> > We should also have the ability to append archive files to the
> > executable or a shared library assuming the OS allows this
>
> OK for the executable.  I'm not sure what the point is of appending an
> archive to the shared library?  Anyway, does it matter (on Windows) if
> you add it to python16.dll or to python.exe?

The point of using python16.dll is to append the Python library to
it, and append to python.exe (or use files) for everything else.
That way, the 1.6 interpreter is linked to the 1.6 Lib, upgrading
to 1.7 means replacing only one file, and there is no wasted storage
in multiple Lib's.  I am thinking of multiple Python programs in
different directories.

But maybe you are right.  On Windows, if python.exe can be put in
/windows/system32 then it really doesn't matter.

> OK, OK.  I think most zip tools have a way to turn off the
> compression.  (Anyway, it's a matter of more I/O time vs. more CPU
> time; hardare for both is getting better faster than we can tweak the
> code :-)

Well, if Python now has its own compression that is built
in and comes with it, then that is different.  Maybe compression
is OK.
 
> Well, the question is really if we want flexibility or archive files.
> I care more about the flexibility.  If we get a clear vote for archive
> files, I see no problem with implementing that first.

I don't like flexibility, I like standardization and simplicity.
Flexibility just encourages users to do the wrong thing.

Everyone vote please.  I don't have a solid feeling about
what people want, only what they don't like.
 
> > If the Python library is available as an archive, I think
> > startup will be greatly improved anyway.
> 
> Really?  I know about all the system calls it makes, but I don't
> really see much of a delay -- I have a prompt in well under 0.1
> second.

So do I.  I guess I was just echoing someone else's complaint.

JimA



From mal at lemburg.com  Fri Nov 19 00:28:31 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 00:28:31 +0100
Subject: [Python-Dev] file modes (was: just say no...)
References: 
Message-ID: <38348B9F.A31B09C4@lemburg.com>

Greg Stein wrote:
> 
> On Thu, 18 Nov 1999, M.-A. Lemburg wrote:
> > Tim Peters wrote:
> > > [MAL]
> > > > File objects opened in text mode will use "t#" and binary
> > > > ones use "s#".
> > >
> > > [Greg Stein]
> > > > ...
> > > > The real annoying thing would be to assume that opening a file as 'r'
> > > > means that I *meant* text mode and to start using "t#".
> > >
> > > Isn't that exactly what MAL said would happen?  Note that a "t" flag for
> > > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't
> > > either; a lone "r" has always meant text mode.
> >
> > Em, I think you've got something wrong here: "t#" refers to the
> > parsing marker used for writing data to files opened in text mode.
> 
> Nope. We've got it right :-)
> 
> Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to
> refer to the parse marker.

Ah, ok. But "t" as file opener is non-portable anyways, so I'll
skip it here :-)
 
> >...
> > I guess you won't notice any difference: strings define both
> > interfaces ("s#" and "t#") to mean the same thing. Only other
> > buffer compatible types may now fail to write to text files
> > -- which is not so bad, because it forces the programmer to
> > rethink what he really intended when opening the file in text
> > mode.
> 
> It *is* bad if it breaks my existing programs in subtle ways that are a
> bitch to track down.
> 
> > Besides, if you are writing portable scripts you should pay
> > close attention to "r" vs. "rb" anyway.
> 
> I'm not writing portable scripts. I mentioned that once before. I don't
> want a difference between 'r' and 'rb' on my Linux box. It was never there
> before, I'm lazy, and I don't want to see it added :-).
> 
> Honestly, I don't know offhand of any Python types that repond to "s#" and
> "t#" in different ways, such that changing file.write would end up writing
> something different (and thereby breaking existing code).
> 
> I just don't like introduce text/binary to *nix platforms where it didn't
> exist before.

Please remember that up until now you were probably only using
strings to write to files. Python strings don't differentiate
between "t#" and "s#" so you wont see any change in function
or find subtle errors being introduced.

If you are already using the buffer feature for e.g. array which 
also implement "s#" but don't support "t#" for obvious reasons
you'll run into trouble, but then: arrays are binary data,
so changing from text mode to binary mode is well worth the
effort even if you just consider it a nuisance.

Since the buffer interface and its consequences haven't published
yet, there are probably very few users out there who would
actually run into any problems. And even if they do, its a
good chance to catch subtle bugs which would only have shown
up when trying to port to another platform.

I'll leave the rest for Guido to answer, since it was his idea ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 19 00:41:32 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 00:41:32 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.7
References: <382C0A54.E6E8328D@lemburg.com>
		<382D625B.DC14DBDE@lemburg.com>
		<38316685.7977448D@lemburg.com>
		<3834425A.8E9C3B7E@lemburg.com> <14388.20218.294814.234327@dolphin.mojam.com>
Message-ID: <38348EAC.82B41A4D@lemburg.com>

Skip Montanaro wrote:
> 
> I haven't been following this discussion closely at all, and have no
> previous experience with Unicode, so please pardon a couple stupid questions
> from the peanut gallery:
> 
>     1. What does U+0061 mean (other than 'a')?  That is, what is U?

U+XXXX means Unicode character with ordinal hex number XXXX. It is
basically just another way to say, hey I want the Unicode character
at position 0xXXXX in the Unicode spec.
 
>     2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter
>        description.  Given a Unicode object with encoding e1, how do I write
>        it to a file that is to be encoded with encoding e2?  Seems like I
>        would do something like
> 
>            u1 = unicode(s, encoding=e1)
>            f = open("somefile", "wb")
>            u2 = unicode(u1, encoding=e2)
>            f.write(u2)
> 
>        Is that how it would be done?  Does this question even make sense?

The unicode() constructor converts all input to Unicode as
basis for other conversions. In the above example, s would be
converted to Unicode using the assumption that the bytes in
s represent characters encoded using the encoding given in e1.
The line with u2 would raise a TypeError, because u1 is not
a string. To convert a Unicode object u1 to another encoding,
you would have to call the .encode() method with the intended
new encoding. The Unicode object will then take care of the
conversion of its internal Unicode data into a string using
the given encoding, e.g. you'd write:

f.write(u1.encode(e2))
 
>     3. What will the impact be on programmers such as myself currently
>        living with blinders on (that is, writing in plain old 7-bit ASCII)?

If you don't want your scripts to know about Unicode, nothing
will really change. In case you do use e.g. Latin-1 characters
in your scripts for strings, you are asked to include a pragma
in the comment lines at the beginning of the script (so that
programmers viewing your code using other encoding have a chance
to figure out what you've written).

Here's the text from the proposal:
"""
Note that you should provide some hint to the encoding you used to
write your programs as pragma line in one the first few comment lines
of the source file (e.g. '# source file encoding: latin-1'). If you
only use 7-bit ASCII then everything is fine and no such notice is
needed, but if you include Latin-1 characters not defined in ASCII, it
may well be worthwhile including a hint since people in other
countries will want to be able to read you source strings too.
"""

Other than that you can continue to use normal strings like
you always have.

Hope that clarifies things at least a bit,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mhammond at skippinet.com.au  Fri Nov 19 01:27:09 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 19 Nov 1999 11:27:09 +1100
Subject: [Python-Dev] file modes (was: just say no...)
In-Reply-To: <38348B9F.A31B09C4@lemburg.com>
Message-ID: <003401bf3224$d231be30$0501a8c0@bobcat>

[MAL]

> If you are already using the buffer feature for e.g. array which
> also implement "s#" but don't support "t#" for obvious reasons
> you'll run into trouble, but then: arrays are binary data,
> so changing from text mode to binary mode is well worth the
> effort even if you just consider it a nuisance.

Breaking existing code that works should be considered more than a
nuisance.

However, one answer would be to have "t#" _prefer_ to use the text
buffer, but not insist on it.  eg, the logic for processing "t#" could
check if the text buffer is supported, and if not move back to the
blob buffer.

This should mean that all existing code still works, except for
objects that support both buffers to mean different things.  AFAIK
there are no objects that qualify today, so it should work fine.

Unix users _will_ need to revisit their thinking about "text mode" vs
"binary mode" when writing these new objects (such as Unicode), but
IMO that is more than reasonable - Unix users dont bother qualifying
the open mode of their files, simply because it has no effect on their
files.  If for certain objects or requirements there _is_ a
distinction, then new code can start to think these issues through.
"Portable File IO" will simply be extended from simply "portable among
all platforms" to "portable among all platforms and objects".

Mark.




From gmcm at hypernet.com  Fri Nov 19 03:23:44 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Thu, 18 Nov 1999 21:23:44 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: <38348045.BB95F783@interet.com>
Message-ID: <1269144272-21594530@hypernet.com>

[Guido]
> > I think the standard format should be a subclass of zip or jar
> > (which is itself a subclass of zip).  We have already written
> > (at CNRI, as yet unreleased) the necessary Python tools to
> > manipulate zip archives; moreover 3rd party tools are
> > abundantly available, both on Unix and on Windows (as well as
> > in Java).  Zip files also lend themselves to self-extracting
> > archives and similar things, because the file index is at the
> > end, so I think that Greg & Gordon should be happy.

No problem (I created my own formats for relatively minor 
reasons).
 
[JimA]
> Think about multiple packages in multiple zip files.  The zip
> files store file directories.  That means we would need a
> sys.zippath to search the zip files.  I don't want another
> PYTHONPATH phenomenon.

What if sys.path looked like:
 [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...]
 
> Greg Stein and I once discussed this (and Gordon I think).  They
> argued that the directories should be flattened.  That is, think
> of all directories which can be reached on PYTHONPATH.  Throw
> away all initial paths.  The resultant archive has *.pyc at the
> top level, as well as package directories only.  The search path
> is "." in every archive file.  No directory information is
> stored, only module names, some with dots.

While I do flat archives (no dots, but that's a different story), 
there's no reason the archive couldn't be structured. Flat 
archives are definitely simpler.
 
[JimA]
> > > I don't like sys.path at all.  It is currently part of the
> > > problem.
[Guido] 
> > Eh?  That's the first thing I hear something bad about it. 
> > Maybe that's because you live on Windows -- on Unix, search
> > paths are ubiquitous.
> 
> On windows, just print sys.path.  It is junk.  A commercial
> distribution has to "just work", and it fails if a second
> installation (by someone else) changes PYTHONPATH to suit their
> app.  I am trying to get to "just works", no excuses, no
> complications.

		Py_Initialize ();
		PyRun_SimpleString ("import sys; del sys.path[1:]");

Yeah, there's a hole there. Fixable if you could do a little pre- 
Py_Initialize twiddling.
 
> > > I suggest that archive files MUST be put into a known
> > > directory.

No way. Hard code a directory? Overwrite someone else's 
Python "standalone"? Write to a C: partition that is 
deliberately sized to hold nothing but Windows? Make 
network installations impossible?
 
> > Why?  Maybe this works on Windows; on Unix this is asking for
> > trouble because it prevents users from augmenting the
> > installation provided by the sysadmin.  Even on newer Windows
> > versions, users without admin perms may not be allowed to add
> > files to that privileged directory.
> 
> It works on Windows because programs install themselves in their
> own subdirectories, and can put files there instead of
> /windows/system32. This holds true for Windows 2000 also.  A
> Unix-style installation to /windows/system32 would (may?) require
> "administrator" privilege.

There's nothing Unix-style about installing to 
/Windows/system32. 'Course *they* have symbolic links that 
actually work...
 
> On Unix you are right.  I didn't think of that because I am the
> Unix sysadmin here, so I can put things where I want.  The
> Windows solution doesn't fit with Unix, because executables go in
> a ./bin directory and putting library files there is a no-no. 
> Hmmmm... This needs more thought.  Anyone else have ideas??

The official Windows solution is stuff in registry about app 
paths and such. Putting the dlls in the exe's directory is a 
workaround which works and is more managable than the 
official solution.
 
> > > We should also have the ability to append archive files to
> > > the executable or a shared library assuming the OS allows
> > > this

That's a handy trick on Windows, but it's got nothing to do 
with Python.

> > Well, the question is really if we want flexibility or archive
> > files. I care more about the flexibility.  If we get a clear
> > vote for archive files, I see no problem with implementing that
> > first.
> 
> I don't like flexibility, I like standardization and simplicity.
> Flexibility just encourages users to do the wrong thing.

I've noticed that the people who think there should only be one 
way to do things never agree on what it is.
 
> Everyone vote please.  I don't have a solid feeling about
> what people want, only what they don't like.

Flexibility. You can put Christian's favorite Einstein quote here 
too.
 
> > > If the Python library is available as an archive, I think
> > > startup will be greatly improved anyway.
> > 
> > Really?  I know about all the system calls it makes, but I
> > don't really see much of a delay -- I have a prompt in well
> > under 0.1 second.
> 
> So do I.  I guess I was just echoing someone else's complaint.

Install some stuff. Deinstall some of it. Repeat (mixing up the 
order) until your registry and hard drive are shattered into tiny 
little fragments. It doesn't take long (there's lots of stuff a 
defragmenter can't touch once it's there).


- Gordon



From mal at lemburg.com  Fri Nov 19 10:08:44 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 10:08:44 +0100
Subject: [Python-Dev] file modes (was: just say no...)
References: <003401bf3224$d231be30$0501a8c0@bobcat>
Message-ID: <3835139C.344F3EEE@lemburg.com>

Mark Hammond wrote:
> 
> [MAL]
> 
> > If you are already using the buffer feature for e.g. array which
> > also implement "s#" but don't support "t#" for obvious reasons
> > you'll run into trouble, but then: arrays are binary data,
> > so changing from text mode to binary mode is well worth the
> > effort even if you just consider it a nuisance.
> 
> Breaking existing code that works should be considered more than a
> nuisance.

Its an error that pretty easy to fix... that's what I was
referring to with "nuisance". All you have to do is open
the file in binary mode and you're done.

BTW, the change will only effect platforms that don't differ
between text and binary mode, e.g. Unix ones.
 
> However, one answer would be to have "t#" _prefer_ to use the text
> buffer, but not insist on it.  eg, the logic for processing "t#" could
> check if the text buffer is supported, and if not move back to the
> blob buffer.

I doubt that this is conform to what the buffer interface want's
to reflect: if the getcharbuf slot is not implemented this means
"I am not text". If you would write non-text to a text file,
this may cause line breaks to be interpreted in ways that are
incompatible with the binary data, i.e. when you read the data
back in, it may fail to load because e.g. '\n' was converted to
'\r\n'.
 
> This should mean that all existing code still works, except for
> objects that support both buffers to mean different things.  AFAIK
> there are no objects that qualify today, so it should work fine.

Well, even though the code would work, it might break badly
someday for the above reasons. Better fix that now when there
aren't too many possible cases around than at some later
point where the user has to figure out the problem for himself
due to the system not warning him about this.
 
> Unix users _will_ need to revisit their thinking about "text mode" vs
> "binary mode" when writing these new objects (such as Unicode), but
> IMO that is more than reasonable - Unix users dont bother qualifying
> the open mode of their files, simply because it has no effect on their
> files.  If for certain objects or requirements there _is_ a
> distinction, then new code can start to think these issues through.
> "Portable File IO" will simply be extended from simply "portable among
> all platforms" to "portable among all platforms and objects".

Right.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    42 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 19 10:56:03 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 10:56:03 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us>  
	            <383427ED.45A01BBB@lemburg.com> <199911181637.LAA04260@eric.cnri.reston.va.us>
Message-ID: <38351EB3.153FCDFC@lemburg.com>

Guido van Rossum wrote:
> 
> > Like a path of search functions ? Not a bad idea... I will still
> > want the internal dict for caching purposes though. I'm not sure
> > how often these encodings will be, but even a few hundred function
> > call will slow down the Unicode implementation quite a bit.
> 
> Of course.  (It's like sys.modules caching the results of an import).

I've fixed the "path of search functions" approach in the latest
version of the spec.
 
> [...]
> >     def flush(self):
> >
> >       """ Flushed the codec buffers used for keeping state.
> >
> >           Returns values are not defined. Implementations are free to
> >           return None, raise an exception (in case there is pending
> >           data in the buffers which could not be decoded) or
> >           return any remaining data from the state buffers used.
> >
> >       """
> 
> I don't know where this came from, but a flush() should work like
> flush() on a file. 

It came from Fredrik's proposal.

> It doesn't return a value, it just sends any
> remaining data to the underlying stream (for output).  For input it
> shouldn't be supported at all.
> 
> The idea is that flush() should do the same to the encoder state that
> close() followed by a reopen() would do.  Well, more or less.  But if
> the process were to be killed right after a flush(), the data written
> to disk should be a complete encoding, and not have a lingering shift
> state.

Ok. I've modified the API as follows:

StreamWriter:
    def flush(self):

	""" Flushes and resets the codec buffers used for keeping state.

	    Calling this method should ensure that the data on the
	    output is put into a clean state, that allows appending
	    of new fresh data without having to rescan the whole
	    stream to recover state.

	"""
	pass

StreamReader:
    def read(self,chunksize=0):

	""" Decodes data from the stream self.stream and returns a tuple 
	    (Unicode object, bytes consumed).

	    chunksize indicates the approximate maximum number of
	    bytes to read from the stream for decoding purposes. The
	    decoder can modify this setting as appropriate. The default
	    value 0 indicates to read and decode as much as possible.
	    The chunksize is intended to prevent having to decode huge
	    files in one step.

	    The method should use a greedy read strategy meaning that
	    it should read as much data as is allowed within the
	    definition of the encoding and the given chunksize, e.g.
            if optional encoding endings or state markers are
	    available on the stream, these should be read too.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...

    def reset(self):

	""" Resets the codec buffers used for keeping state.

	    Note that no stream repositioning should take place.
	    This method is primarely intended to recover from
	    decoding errors.

	"""
	pass

The .reset() method replaces the .flush() method on StreamReaders.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    42 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Fri Nov 19 10:22:48 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 10:22:48 +0100
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>
Message-ID: <383516E8.EE66B527@lemburg.com>

Guido van Rossum wrote:
>
> Let's first complete the requirements gathering.  Are these
> requirements reasonable?  Will they make an implementation too
> complex?  Am I missing anything?

Since you were asking: I would like functionality equivalent
to my latest import patch for a slightly different lookup scheme
for module import inside packages to become a core feature.

If it becomes a core feature I promise to never again start
threads about relative imports :-)

Here's the summary again:
"""
[The patch] changes the default import mechanism to work like this:

>>> import d # from directory a/b/c/
try a.b.c.d
try a.b.d
try a.d
try d
fail

instead of just doing the current two-level lookup:

>>> import d # from directory a/b/c/
try a.b.c.d
try d
fail

As a result, relative imports referring to higher level packages
work out of the box without any ugly underscores in the import name.
Plus the whole scheme is pretty simple to explain and straightforward.
"""

You can find the patch attached to the message "Walking up the package
hierarchy" in the python-dev mailing list archive.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    42 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From captainrobbo at yahoo.com  Fri Nov 19 14:01:04 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Fri, 19 Nov 1999 05:01:04 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
Message-ID: <19991119130104.21726.rocketmail@ web605.yahoomail.com>


--- "M.-A. Lemburg"  wrote:
> Guido van Rossum wrote:
> > I don't know where this came from, but a flush()
> should work like
> > flush() on a file. 
> 
> It came from Fredrik's proposal.
> 
> > It doesn't return a value, it just sends any
> > remaining data to the underlying stream (for
> output).  For input it
> > shouldn't be supported at all.
> > 
> > The idea is that flush() should do the same to the
> encoder state that
> > close() followed by a reopen() would do.  Well,
> more or less.  But if
> > the process were to be killed right after a
> flush(), the data written
> > to disk should be a complete encoding, and not
> have a lingering shift
> > state.
> 
This could be useful in real life.  
For example, iso-2022-jp has a 'single-byte-mode'
and a 'double-byte-mode' with shift-sequences to
separate them.  The rule is that each line in the 
text file or email message or whatever must begin
and end in single-byte mode.  So I would take flush()
to mean 'shift back to ASCII now'.

Calling flush and reopen would thus "almost" get the
same data across.

I'm trying to think if it would be dangerous.  Do web
and ftp servers often call flush() in the middle of
transmitting a block of text?

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From fredrik at pythonware.com  Fri Nov 19 14:33:50 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 19 Nov 1999 14:33:50 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <19991119130104.21726.rocketmail@ web605.yahoomail.com>
Message-ID: <000701bf3292$b7c49130$f29b12c2@secret.pythonware.com>

Andy Robinson  wrote:
> So I would take flush() to mean 'shift back to
> ASCII now'.

if we're still talking about my "just one
codec, please" proposal, that's exactly
what encoder.flush should do.

while decoder.flush should raise an ex-
ception if you're still in double byte mode
(at least if running in 'strict' mode).

> Calling flush and reopen would thus "almost" get the
> same data across.
> 
> I'm trying to think if it would be dangerous.  Do web
> and ftp servers often call flush() in the middle of
> transmitting a block of text?

again, if we're talking about my proposal,
these flush methods are only called by the
string or stream wrappers, never by the
applications.  see the original post for de-
tails.






From gstein at lyra.org  Fri Nov 19 14:29:50 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 05:29:50 -0800 (PST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us>
Message-ID: 

On Thu, 18 Nov 1999, Guido van Rossum wrote:
> Gordon McMillan wrote:
>...
> > I think imputil's emulation of the builtin importer is more of a 
> > demonstration than a serious implementation. As for speed, it 
> > depends on the test. 
> 
> Agreed.  I like some of imputil's features, but I think the API
> need to be redesigned.

It what ways? It sounds like you've applied some thought. Do you have any
concrete ideas yet, or "just a feeling" :-)  I'm working through some
changes from JimA right now, and would welcome other suggestions. I think
there may be some outstanding stuff from MAL, but I'm not sure (Marc?)

>...
> So here's a challenge: redesign the import API from scratch.

I would suggest starting with imputil and altering as necessary. I'll use
that viewpoint below.

> Let me start with some requirements.
> 
> Compatibility issues:
> ---------------------
> 
> - the core API may be incompatible, as long as compatibility layers
> can be provided in pure Python

Which APIs are you referring to? The "imp" module? The C functions? The
__import__ and reload builtins?

I'm guessing some of imp, the two builtins, and only one or two C
functions.

> - support for rexec functionality

No problem. I can think of a number of ways to do this.

> - support for freeze functionality

No problem. A function in "imp" must be exposed to Python to support this
within the imputil framework.

> - load .py/.pyc/.pyo files and shared libraries from files

No problem. Again, a function is needed for platform-specific loading of
shared libraries.

> - support for packages

No problem. Demo's in current imputil.

> - sys.path and sys.modules should still exist; sys.path might
> have a slightly different meaning

I would suggest that both retain their *exact* meaning. We introduce
sys.importers -- a list of importers to check, in sequence. The first
importer on that list uses sys.path to look for and load modules. The
second importer loads builtins and frozen code (i.e. modules not on
sys.path).

Users can insert/append new importers or alter sys.path as before.

sys.modules continues to record name:module mappings.

> - $PYTHONPATH and $PYTHONHOME should still be supported

No problem.

> (I wouldn't mind a splitting up of importdl.c into several
> platform-specific files, one of which is chosen by the configure
> script; but that's a bit of a separate issue.)

Easy enough. The standard importer can select the appropriate
platform-specific module/function to perform the load. i.e. these can move
to Modules/ and be split into a module-per-platform.

> New features:
> -------------
> 
> - Integrated support for Greg Ward's distribution utilities (i.e. a
>   module prepared by the distutil tools should install painlessly)

I don't know the specific requirements/functionality that would be
required here (does Greg? :-), but I can't imagine any problem with this.

> - Good support for prospective authors of "all-in-one" packaging tool
>   authors like Gordon McMillan's win32 installer or /F's squish.  (But
>   I *don't* require backwards compatibility for existing tools.)

Um. *No* problem. :-)

> - Standard import from zip or jar files, in two ways:
> 
>   (1) an entry on sys.path can be a zip/jar file instead of a directory;
>       its contents will be searched for modules or packages

While this could easily be done, I might argue against it. Old
apps/modules that process sys.path might get confused.

If compatibility is not an issue, then "No problem."

An alternative would be an Importer instance added to sys.importers that
is configured for a specific archive (in other words, don't add the zip
file to sys.path, add ZipImporter(file) to sys.importers).

Another alternative is an Importer that looks at a "sys.py_archives" list.
Or an Importer that has a py_archives instance attribute.

>   (2) a file in a directory that's on sys.path can be a zip/jar file;
>       its contents will be considered as a package (note that this is
>       different from (1)!)

No problem. This will slow things down, as a stat() for *.zip and/or *.jar
must be done, in addition to *.py, *.pyc, and *.pyo.

>   I don't particularly care about supporting all zip compression
>   schemes; if Java gets away with only supporting gzip compression
>   in jar files, so can we.

I presume we would support whatever zlib gives us, and no more.

> - Easy ways to subclass or augment the import mechanism along
>   different dimensions.  For example, while none of the following
>   features should be part of the core implementation, it should be
>   easy to add any or all:
> 
>   - support for a new compression scheme to the zip importer

Presuming ZipImporter is a class (derived from Importer), then this
ability is wholly dependent upon the author of ZipImporter providing the
hook.

The Importer class is already designed for subclassing (and its interface 
is very narrow, which means delegation is also *very* easy; see
imputil.FuncImporter).

>   - support for a new archive format, e.g. tar

A cakewalk. Gordon, JimA, and myself each have archive formats. :-)

>   - a hook to import from URLs or other data sources (e.g. a
>     "module server" imported in CORBA) (this needn't be supported
>     through $PYTHONPATH though)

No problem at all.

>   - a hook that imports from compressed .py or .pyc/.pyo files

No problem at all.

>   - a hook to auto-generate .py files from other filename
>     extensions (as currently implemented by ILU)

No problem at all.

>   - a cache for file locations in directories/archives, to improve
>     startup time

No problem at all.

>   - a completely different source of imported modules, e.g. for an
>     embedded system or PalmOS (which has no traditional filesystem)

No problem at all.

In each of the above cases, the Importer.get_code() method just needs to
grab the byte codes from the XYZ data source. That data source can be
cmopressed, across a network, on-the-fly generated, or whatever. Each
importer can certainly create a cache based on its concept of "location".
In some cases, that would be a mapping from module name to filesystem
path, or to a URL, or to a compiled-in, frozen module.

> - Note that different kinds of hooks should (ideally, and within
>   reason) properly combine, as follows: if I write a hook to recognize
>   .spam files and automatically translate them into .py files, and you
>   write a hook to support a new archive format, then if both hooks are
>   installed together, it should be possible to find a .spam file in an
>   archive and do the right thing, without any extra action.  Right?

Ack. Very, very difficult.

The imputil scheme combines the concept of locating/loading into one step.
There is only one "hook" in the imputil system. Its semantic is "map this
name to a code/module object and return it; if you don't have it, then
return None."

Your compositing example is based on the capabilities of the
find-then-load paradigm of the existing "ihooks.py". One module finds
something (foo.spam) and the other module loads it (by generating a .py).

All is not lost, however. I can easily envision the get_code() hook as
allowing any kind of return type. If it isn't a code or module object,
then another hook is called to transform it.
[ actually, I'd design it similarly: a *series* of hooks would be called
  until somebody transforms the foo.spam into a code/module object. ]

The compositing would be limited ony by the (Python-based) Importer
classes. For example, my ZipImporter might expect to zip up .pyc files
*only*. Obviously, you would want to alter this to support zipping any
file, then use the suffic to determine what to do at unzip time.

> - It should be possible to write hooks in C/C++ as well as Python

Use FuncImporter to delegate to an extension module.

This is one of the benefits of imputil's single/narrow interface.

> - Applications embedding Python may supply their own implementations,
>   default search path, etc., but don't have to if they want to piggyback
>   on an existing Python installation (even though the latter is
>   fraught with risk, it's cheaper and easier to understand).

An application would have full control over the contents of sys.importers.

For a restricted execution app, it might install an Importer that loads
files from *one* directory only which is configured from a specific
Win32 Registry entry. That importer could also refuse to load shared
modules. The BuiltinImporter would still be present (although the app
would certainly omit all but the necessary builtins from the build).
Frozen modules could be excluded.

> Implementation:
> ---------------
> 
> - There must clearly be some code in C that can import certain
>   essential modules (to solve the chicken-or-egg problem), but I don't
>   mind if the majority of the implementation is written in Python.
>   Using Python makes it easy to subclass.

I posited once before that the cost of import is mostly I/O rather than
CPU, so using Python should not be an issue. MAL demonstrated that a good
design for the Importer classes is also required. Based on this, I'm a
*strong* advocate of moving as much as possible into Python (to get
Python's ease-of-coding with little relative cost).

The (core) C code should be able to search a path for a module and import
it. It does not require dynamic loading or packages. This will be used to
import exceptions.py, then imputil.py, then site.py.

The platform-specific module that perform dynamic-loading must be a
statically linked module (in Modules/ ... it doesn't have to be in the
Python/ directory).

site.py can complete the bootstrap by setting up sys.importers with the
appropriate Importer instances (this is where an application can define
its own policy). sys.path was initially set by the import.c bootstrap code
(from the compiled-in path and environment variables).

Note that imputil.py would not install any hooks when it is loaded. That
is up to site.py. This implies the core C code will import a total of
three modules using its builtin system. After that, the imputil mechanism
would be importing everything (site.py would .install() an Importer which
then takes over the __import__ hook).

Further note that the "import" Python statement could be simplified to use
only the hook. However, this would require the core importer to inject
some module names into the imputil module's namespace (since it couldn't
use an import statement until a hook was installed). While this
simplification is "neat", it complicates the run-time system (the import
statement is broken until a hook is installed).

Therefore, the core C code must also support importing builtins. "sys" and
"imp" are needed by imputil to bootstrap.

The core importer should not need to deal with dynamic-load modules.

To support frozen apps, the core importer would need to support loading
the three modules as frozen modules.

The builtin/frozen importing would be exposed thru "imp" for use by
imputil for future imports. imputil would load and use the (builtin)
platform-specific module to do dynamic-load imports.

> - In order to support importing from zip/jar files using compression,
>   we'd at least need the zlib extension module and hence libz itself,
>   which may not be available everywhere.

Yes. I don't see this as a requirement, though. We wouldn't start to use
these by default, would we? Or insist on zlib being present? I see this as
more along the lines of "we have provided a standardized Importer to do
this, *provided* you have zlib support."

> - I suppose that the bootstrap is solved using a mechanism very
>   similar to what freeze currently used (other solutions seem to be
>   platform dependent).

The bootstrap that I outlined above could be done in C code. The import
code would be stripped down dramatically because you'll drop package
support and dynamic loading.

Alternatively, you could probably do the path-scanning in Python and
freeze that into the interpreter. Personally, I don't like this idea as it
would not buy you much at all (it would still need to return to C for
accessing a number of scanning functions and module importing funcs).

> - I also want to still support importing *everything* from the
>   filesystem, if only for development.  (It's hard enough to deal with
>   the fact that exceptions.py is needed during Py_Initialize();
>   I want to be able to hack on the import code written in Python
>   without having to rebuild the executable all the time.

My outline above does not freeze anything. Everything resides in the
filesystem. The C code merely needs a path-scanning loop and functions to
import .py*, builtin, and frozen types of modules.

If somebody nukes their imputil.py or site.py, then they return to Python
1.4 behavior where the core interpreter uses a path for importing (i.e. no
packages). They lose dynamically-loaded module support.

> Let's first complete the requirements gathering.  Are these
> requirements reasonable?  Will they make an implementation too
> complex?  Am I missing anything?

I'm not a fan of the compositing due to it requiring a change to semantics
that I believe are very useful and very clean. However, I outlined a
possible, clean solution to do that (a secondary set of hooks for
transforming get_code() return values).

The requirements are otherwise reasonable to me, as I see that they can
all be readily solved (i.e. they aren't burdensome).

While this email may be long, I do not believe the resulting system would
be complex. From the user-visible side of things, nothing would be
changed. sys.path is still present and operates as before. They *do* have
new functionality they can grow into, though (sys.importers). The
underlying C code is simplified, and the platform-specific dynamic-load
stuff can be distributed to distinct modules, as needed
(e.g. BeOS/dynloadmodule.c and PC/dynloadmodule.c).

> Finally, to what extent does this impact the desire for dealing
> differently with the Python bytecode compiler (e.g. supporting
> optimizers written in Python)?  And does it affect the desire to
> implement the read-eval-print loop (the >>> prompt) in Python?

If the three startup files require byte-compilation, then you could have
some issues (i.e. the byte-compiler must be present).

Once you hit site.py, you have a "full" environment and can easily detect
and import a read-eval-print loop module (i.e. why return to Python? just 
start things up right there).

site.py can also install new optimizers as desired, a new Python-based
parser or compiler, or whatever...  If Python is built without a parser or
compiler (I hope that's an option!), then the three startup modules would
simply be frozen into the executable.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/





From bwarsaw at cnri.reston.va.us  Fri Nov 19 17:30:15 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Fri, 19 Nov 1999 11:30:15 -0500 (EST)
Subject: [Python-Dev] CVS log messages with diffs
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <14389.31511.706588.20840@anthem.cnri.reston.va.us>

There was a suggestion to start augmenting the checkin emails to
include the diffs of the checkin.  This would let you keep a current
snapshot of the tree without having to do a direct `cvs update'.

I think I can add this without a ton of pain.  It would not be
optional however, and the emails would get larger (and some checkins
could be very large).  There's also the question of whether to
generate unified or context diffs.  Personally, I find context diffs
easier to read; unified diffs are smaller but not by enough to really
matter.

So here's an informal poll.  If you don't care either way, you don't
need to respond.  Otherwise please just respond to me and not to the
list.

1. Would you like to start receiving diffs in the checkin messages?

2. If you answer `yes' to #1 above, would you prefer unified or
   context diffs?

-Barry



From bwarsaw at cnri.reston.va.us  Fri Nov 19 18:04:51 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Fri, 19 Nov 1999 12:04:51 -0500 (EST)
Subject: [Python-Dev] Another 1.6 wish
Message-ID: <14389.33587.947368.547023@anthem.cnri.reston.va.us>

We had some discussion a while back about enabling thread support by
default, if the underlying OS supports it obviously.  I'd like to see
that happen for 1.6.  IIRC, this shouldn't be too hard -- just a few
tweaks of the configure script (and who knows what for those minority
platforms that don't use configure :).

-Barry



From akuchlin at mems-exchange.org  Fri Nov 19 18:07:07 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Fri, 19 Nov 1999 12:07:07 -0500 (EST)
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <14389.33587.947368.547023@anthem.cnri.reston.va.us>
References: <14389.33587.947368.547023@anthem.cnri.reston.va.us>
Message-ID: <14389.33723.270207.374259@amarok.cnri.reston.va.us>

Barry A. Warsaw writes:
>We had some discussion a while back about enabling thread support by
>default, if the underlying OS supports it obviously.  I'd like to see

That reminds me... what about the free threading patches?  Perhaps
they should be added to the list of issues to consider for 1.6.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Oh, my fingers! My arms! My legs! My everything! Argh...
    -- The Doctor, in "Nightmare of Eden"




From petrilli at amber.org  Fri Nov 19 18:23:02 1999
From: petrilli at amber.org (Christopher Petrilli)
Date: Fri, 19 Nov 1999 12:23:02 -0500
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <14389.33723.270207.374259@amarok.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Fri, Nov 19, 1999 at 12:07:07PM -0500
References: <14389.33587.947368.547023@anthem.cnri.reston.va.us> <14389.33723.270207.374259@amarok.cnri.reston.va.us>
Message-ID: <19991119122302.B23400@trump.amber.org>

Andrew M. Kuchling [akuchlin at mems-exchange.org] wrote:
> Barry A. Warsaw writes:
> >We had some discussion a while back about enabling thread support by
> >default, if the underlying OS supports it obviously.  I'd like to see

Yes pretty please!  One of the biggest problems we have in the Zope world
is that for some unknown reason, most of hte Linux RPMs don't have threading
on in them, so people end up having to compile it anyway... while this
is a silly thing, it does create problems, and means that we deal with
a lot of "dumb" problems.

> That reminds me... what about the free threading patches?  Perhaps
> they should be added to the list of issues to consider for 1.6.

My recolection was that unfortunately MOST of the time, they actually
slowed down things because of the number of locks involved...  Guido
can no doubt shed more light onto this, but... there was a reason.

Chris
-- 
| Christopher Petrilli
| petrilli at amber.org



From gmcm at hypernet.com  Fri Nov 19 19:22:37 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 19 Nov 1999 13:22:37 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us>
References: Your message of "Thu, 18 Nov 1999 09:19:48 EST."             <1269187709-18981857@hypernet.com> 
Message-ID: <1269086690-25057991@hypernet.com>

[Guido]
> Compatibility issues:
> ---------------------
> 
> - the core API may be incompatible, as long as compatibility
> layers can be provided in pure Python

Good idea. Question: we have keyword import, __import__, 
imp and PyImport_*. Which of those (if any) define the "core 
API"?

[rexec, freeze: yes]

> - load .py/.pyc/.pyo files and shared libraries from files

Shared libraries? Might that not involve some rather shady 
platform-specific magic? If it can be kept kosher, I'm all for it; 
but I'd say no if it involved, um, undocumented features.
 
> support for packages

Absolutely. I'll just comment that the concept of 
package.__path__ is also affected by the next point.
> 
> - sys.path and sys.modules should still exist; sys.path might
> have a slightly different meaning
> 
> - $PYTHONPATH and $PYTHONHOME should still be supported

If sys.path changes meaning, should not $PYTHONPATH 
also?

> New features:
> -------------
> 
> - Integrated support for Greg Ward's distribution utilities (i.e.
> a
>   module prepared by the distutil tools should install
>   painlessly)

I assume that this is mostly a matter of $PYTHONPATH and 
other path manipulation mechanisms?
 
> - Good support for prospective authors of "all-in-one" packaging
> tool
>   authors like Gordon McMillan's win32 installer or /F's squish. 
>   (But I *don't* require backwards compatibility for existing
>   tools.)

I guess you've forgotten: I'm that *really* tall guy .
 
> - Standard import from zip or jar files, in two ways:
> 
>   (1) an entry on sys.path can be a zip/jar file instead of a
>   directory;
>       its contents will be searched for modules or packages

I don't mind this, but it depends on whether sys.path changes 
meaning.
 
>   (2) a file in a directory that's on sys.path can be a zip/jar
>   file;
>       its contents will be considered as a package (note that
>       this is different from (1)!)

But it's affected by the same considerations (eg, do we start 
with filesystem names and wrap them in importers, or do we 
just start with importer instances / specifications for importer 
instances).
 
>   I don't particularly care about supporting all zip compression
>   schemes; if Java gets away with only supporting gzip
>   compression in jar files, so can we.

I think this is a matter of what zip compression is officially 
blessed. I don't mind if it's none; providing / creating zipped 
versions for platforms that support it is nearly trivial.

> - Easy ways to subclass or augment the import mechanism along
>   different dimensions.  For example, while none of the following
>   features should be part of the core implementation, it should
>   be easy to add any or all:
> 
>   - support for a new compression scheme to the zip importer
> 
>   - support for a new archive format, e.g. tar
> 
>   - a hook to import from URLs or other data sources (e.g. a
>     "module server" imported in CORBA) (this needn't be supported
>     through $PYTHONPATH though)

Which begs the question of the meaning of sys.path; and if it's 
still filesystem names, how do you get one of these in there?
 
>   - a hook that imports from compressed .py or .pyc/.pyo files
> 
>   - a hook to auto-generate .py files from other filename
>     extensions (as currently implemented by ILU)
> 
>   - a cache for file locations in directories/archives, to
>   improve
>     startup time
> 
>   - a completely different source of imported modules, e.g. for
>   an
>     embedded system or PalmOS (which has no traditional
>     filesystem)
> 
> - Note that different kinds of hooks should (ideally, and within
>   reason) properly combine, as follows: if I write a hook to
>   recognize .spam files and automatically translate them into .py
>   files, and you write a hook to support a new archive format,
>   then if both hooks are installed together, it should be
>   possible to find a .spam file in an archive and do the right
>   thing, without any extra action.  Right?

A bit of discussion: I've got 2 kinds of archives. One can 
contain anything & is much like a zip (and probably should be 
a zip). The other contains only compressed .pyc or .pyo. The 
latter keys contents by logical name, not filesystem name. No 
extensions, and when a package is imported, the code object 
returned is the __init__ code object, (vs returning None and 
letting the import mechanism come back and ask for 
package.__init__).

When you're building an archive, you have to go thru the .py / 
.pyc / .pyo / is it a package / maybe compile logic anyway. 
Why not get it all over with, so that at runtime there's no 
choices to be made.

Which means (for this kind of archive) that including 
somebody's .spam in your archive isn't a matter of a hook, but 
a matter of adding to the archive's build smarts.
 
> - It should be possible to write hooks in C/C++ as well as Python
> 
> - Applications embedding Python may supply their own
> implementations,
>   default search path, etc., but don't have to if they want to
>   piggyback on an existing Python installation (even though the
>   latter is fraught with risk, it's cheaper and easier to
>   understand).

A way of tweaking that which will become sys.path before 
Py_Initialize would be *most* welcome.

> Implementation:
> ---------------
> 
> - There must clearly be some code in C that can import certain
>   essential modules (to solve the chicken-or-egg problem), but I
>   don't mind if the majority of the implementation is written in
>   Python. Using Python makes it easy to subclass.
> 
> - In order to support importing from zip/jar files using
> compression,
>   we'd at least need the zlib extension module and hence libz
>   itself, which may not be available everywhere.
> 
> - I suppose that the bootstrap is solved using a mechanism very
>   similar to what freeze currently used (other solutions seem to
>   be platform dependent).

There are other possibilites here, but I have only half-
formulated ideas at the moment. The critical part for 
embedding is to be able to *completely* control all path 
related logic.
 
> - I also want to still support importing *everything* from the
>   filesystem, if only for development.  (It's hard enough to deal
>   with the fact that exceptions.py is needed during
>   Py_Initialize(); I want to be able to hack on the import code
>   written in Python without having to rebuild the executable all
>   the time.
> 
> Let's first complete the requirements gathering.  Are these
> requirements reasonable?  Will they make an implementation too
> complex?  Am I missing anything?

I'll summarize as follows:
 1) What "sys.path" means (and how it's construction can be 
manipulated) is critical.
 2) See 1.
 
> Finally, to what extent does this impact the desire for dealing
> differently with the Python bytecode compiler (e.g. supporting
> optimizers written in Python)?  And does it affect the desire to
> implement the read-eval-print loop (the >>> prompt) in 
Python?

I can assure you that code.py runs fine out of an archive :-).

- Gordon



From gstein at lyra.org  Fri Nov 19 22:06:14 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 13:06:14 -0800 (PST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: 
Message-ID: 

[ taking the liberty to CC: this back to python-dev ]

On Fri, 19 Nov 1999, David Ascher wrote:
> > >   (2) a file in a directory that's on sys.path can be a zip/jar file;
> > >       its contents will be considered as a package (note that this is
> > >       different from (1)!)
> > 
> > No problem. This will slow things down, as a stat() for *.zip and/or *.jar
> > must be done, in addition to *.py, *.pyc, and *.pyo.
> 
> Aside: it strikes me that for Python programs which import lots of files,
> 'front-loading' the stat calls could make sense.  When you first look at a
> directory in sys.path, you read the entire directory in memory, and
> successive imports do a stat on the directory to see if it's changed, and
> if not use the in-memory data.  Or am I completely off my rocker here?

Not at all. I thought of this last night after my email. Since the
Importer can easily retain state, it can hold a cache of the directory
listings. If it doesn't find the file in its cached state, then it can
reload the information from disk. If it finds it in the cache, but not on
disk, then it can remove the item from its cache.

The problem occurs when you path is [A, B], the file is in B, and you add
something to A on-the-fly. The cache might direct the importer at B,
missing your file.

Of course, with the appropriate caveats/warnings, the system would work
quite well. It really only breaks during development (which is one reason 
why I didn't accept some caching changes to imputil from MAL; but that
was for the Importer in there; Python's new Importer could have a cache).

I'm also not quite sure what the cost of reading a directory is, compared
to issuing a bunch of stat() calls. Each directory read is an
opendir/readdir(s)/closedir. Note that the DBM approach is kind of
similar, but will amortize this cost over many processes.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From Jasbahr at origin.EA.com  Fri Nov 19 21:59:11 1999
From: Jasbahr at origin.EA.com (Asbahr, Jason)
Date: Fri, 19 Nov 1999 14:59:11 -0600
Subject: [Python-Dev] Another 1.6 wish
Message-ID: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com>

My first Python-Dev post.  :-)

>We had some discussion a while back about enabling thread support by
>default, if the underlying OS supports it obviously.  

What's the consensus about Python microthreads -- a likely candidate
for incorporation in 1.6 (or later)?

Also, we have a couple minor convenience functions for Python in an 
MSDEV environment, an exposure of OutputDebugString for writing to 
the DevStudio log window and a means of tripping DevStudio C/C++ layer
breakpoints from Python code (currently experimental).  The msvcrt 
module seems like a likely candidate for these, would these be 
welcome additions?

Thanks,

Jason Asbahr
Origin Systems, Inc.
jasbahr at origin.ea.com



From gstein at lyra.org  Fri Nov 19 22:35:34 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 13:35:34 -0800 (PST)
Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs
In-Reply-To: <14389.31511.706588.20840@anthem.cnri.reston.va.us>
Message-ID: 

On Fri, 19 Nov 1999, Barry A. Warsaw wrote:
> There was a suggestion to start augmenting the checkin emails to
> include the diffs of the checkin.  This would let you keep a current
> snapshot of the tree without having to do a direct `cvs update'.

I've been using diffs-in-checkin for review, rather than to keep a local
snapshot updated. I guess you use the email for this (procmail truly is
frightening), but I think for most people it would be for purposes of
review.

>...context vs unifed...
> So here's an informal poll.  If you don't care either way, you don't
> need to respond.  Otherwise please just respond to me and not to the
> list.
> 
> 1. Would you like to start receiving diffs in the checkin messages?

Absolutely.

> 2. If you answer `yes' to #1 above, would you prefer unified or
>    context diffs?

Don't care.

I've attached an archive of the files that I use in my CVS repository to
do emailed diffs. These came from Ken Coar (an Apache guy) as an
extraction from the Apache repository. Yes, they do use Perl. I'm not a
Perl guy, so I probably would break things if I tried to "fix" the scripts
by converting them to Python (in fact, Greg Ward helped to improve
log_accum.pl for me!). I certainly would not be adverse to Python versions
of these files, or other cleanups.

I trimmed down the "avail" file, leaving a few examples. It works with
cvs_acls.pl to provide per-CVS-module read/write access control.

I'm currently running mod_dav, PyOpenGL, XML-SIG, PyWin32, and two other
small projects out of this repository. It has been working quite well.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cvs-for-barry.tar.gz
Type: application/octet-stream
Size: 9668 bytes
Desc: 
URL: 

From bwarsaw at cnri.reston.va.us  Fri Nov 19 22:45:14 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Fri, 19 Nov 1999 16:45:14 -0500 (EST)
Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs
References: <14389.31511.706588.20840@anthem.cnri.reston.va.us>
	
Message-ID: <14389.50410.358686.637483@anthem.cnri.reston.va.us>

>>>>> "GS" == Greg Stein  writes:

    GS> I've been using diffs-in-checkin for review, rather than to
    GS> keep a local snapshot updated.

Interesting; I hadn't though about this use for the diffs.

    GS> I've attached an archive of the files that I use in my CVS
    GS> repository to do emailed diffs. These came from Ken Coar (an
    GS> Apache guy) as an extraction from the Apache repository. Yes,
    GS> they do use Perl. I'm not a Perl guy, so I probably would
    GS> break things if I tried to "fix" the scripts by converting
    GS> them to Python (in fact, Greg Ward helped to improve
    GS> log_accum.pl for me!). I certainly would not be adverse to
    GS> Python versions of these files, or other cleanups.

Well, we all know Greg Ward's one of those subversive types, but then
again it's great to have (hopefully now-loyal) defectors in our camp,
just to keep us honest :)

Anyway, thanks for sending the code, it'll come in handy if I get
stuck.  Of course, my P**l skills are so rusted I don't think even an
oilcan-armed Dorothy could lube 'em up, so I'm not sure how much use I
can put them to.  Besides, I already have a huge kludge that gets run
on each commit, and I don't think it'll be too hard to add diff
generation... IF the informal vote goes that way.

-Barry



From gmcm at hypernet.com  Fri Nov 19 22:56:20 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 19 Nov 1999 16:56:20 -0500
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: 
References: 
Message-ID: <1269073918-25826188@hypernet.com>

[David Ascher got involuntarily forwarded]
> > Aside: it strikes me that for Python programs which import lots
> > of files, 'front-loading' the stat calls could make sense. 
> > When you first look at a directory in sys.path, you read the
> > entire directory in memory, and successive imports do a stat on
> > the directory to see if it's changed, and if not use the
> > in-memory data.  Or am I completely off my rocker here?

I posted something here about dircache not too long ago. 
Essentially, I found it completely unreliable on NT and on 
Linux to stat the directory. There was some test code 
attached.
 


- Gordon



From gstein at lyra.org  Fri Nov 19 23:09:36 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 14:09:36 -0800 (PST)
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <19991119122302.B23400@trump.amber.org>
Message-ID: 

On Fri, 19 Nov 1999, Christopher Petrilli wrote:
> Andrew M. Kuchling [akuchlin at mems-exchange.org] wrote:
> > Barry A. Warsaw writes:
> > >We had some discussion a while back about enabling thread support by
> > >default, if the underlying OS supports it obviously.  I'd like to see

Definitely.

I think you still want a --disable-threads option, but the default really
ought to include them.

> Yes pretty please!  One of the biggest problems we have in the Zope world
> is that for some unknown reason, most of hte Linux RPMs don't have threading
> on in them, so people end up having to compile it anyway... while this
> is a silly thing, it does create problems, and means that we deal with
> a lot of "dumb" problems.

Yah. It's a pain. My RedHat 6.1 box has 1.5.2 with threads. I haven't
actually had to build my own Python(!). Man... imagine that. After almost
five years of using Linux/Python, I can actually rely on the OS getting it
right! :-)

> > That reminds me... what about the free threading patches?  Perhaps
> > they should be added to the list of issues to consider for 1.6.
> 
> My recolection was that unfortunately MOST of the time, they actually
> slowed down things because of the number of locks involved...  Guido
> can no doubt shed more light onto this, but... there was a reason.

Yes, there were problems in the first round with locks and lock
contention. The main issue is that a list must always use a lock to keep
itself consistent. Always. There is no way for an application to say "hey,
list object! I've got a higher-level construct here that guarantees there
will be no cross-thread use of this list. Ignore the locking." Another
issue that can't be avoided is using atomic increment/decrement for the
object refcounts.

Guido has already asked me about free threading patches for 1.6. I don't
know if his intent was to include them, or simply to have them available
for those who need them.

Certainly, this time around they will be simpler since Guido folded in
some of the support stuff (e.g. PyThreadState and per-thread exceptions).
There are some other supporting changes that could definitely go into the
core interpreter. The slow part comes when you start to add integrity
locks to list, dict, etc. That is when the question on whether to include
free threading comes up.

Design-wise, there is a change or two that I would probably make.

Note that shoving free-threading into the standard interpreter would get
more eyeballs at the thing, and that people may have great ideas for
reducing the overheads.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Fri Nov 19 23:11:02 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 14:11:02 -0800 (PST)
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com>
Message-ID: 

On Fri, 19 Nov 1999, Asbahr, Jason wrote:
> >We had some discussion a while back about enabling thread support by
> >default, if the underlying OS supports it obviously.  
> 
> What's the consensus about Python microthreads -- a likely candidate
> for incorporation in 1.6 (or later)?

microthreads? eh?

> Also, we have a couple minor convenience functions for Python in an 
> MSDEV environment, an exposure of OutputDebugString for writing to 
> the DevStudio log window and a means of tripping DevStudio C/C++ layer
> breakpoints from Python code (currently experimental).  The msvcrt 
> module seems like a likely candidate for these, would these be 
> welcome additions?

Sure. I don't see why not. I know that I've use OutputDebugString a
bazillion times from the Python layer. The breakpoint thingy... dunno, but
I don't see a reason to exclude it.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From skip at mojam.com  Fri Nov 19 23:11:38 1999
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 19 Nov 1999 16:11:38 -0600 (CST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: 
References: 
	
Message-ID: <14389.51994.809130.22062@dolphin.mojam.com>

    Greg> The problem occurs when you path is [A, B], the file is in B, and
    Greg> you add something to A on-the-fly. The cache might direct the
    Greg> importer at B, missing your file.

Typically your path will be relatively short (< 20 directories), right?
Just stat the directories before consulting the cache.  If any changed since
the last time the cache was built, then invalidate the entire cache (or that
portion of the cached information that is downstream from the first modified
directory).  It's still going to be cheaper than performing listdir for each
directory in the path, and like you said, only require flushes during
development or installation actions.

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...




From skip at mojam.com  Fri Nov 19 23:15:14 1999
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 19 Nov 1999 16:15:14 -0600 (CST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <1269073918-25826188@hypernet.com>
References: 
	<1269073918-25826188@hypernet.com>
Message-ID: <14389.52210.833368.249942@dolphin.mojam.com>

    Gordon> I posted something here about dircache not too long ago.
    Gordon> Essentially, I found it completely unreliable on NT and on Linux
    Gordon> to stat the directory. There was some test code attached.

The modtime of the directory's stat info should only change if you add or
delete entries in the directory.  Were you perhaps expecting changes when
other operations took place, like rewriting an existing file? 

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From skip at mojam.com  Fri Nov 19 23:34:42 1999
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 19 Nov 1999 16:34:42 -0600
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <1269073918-25826188@hypernet.com>
References: 
	<1269073918-25826188@hypernet.com>
Message-ID: <199911192234.QAA24710@dolphin.mojam.com>

Gordon wrote:

    Gordon> I posted something here about dircache not too long ago.
    Gordon> Essentially, I found it completely unreliable on NT and on Linux
    Gordon> to stat the directory. There was some test code attached.

to which I replied:

    Skip> The modtime of the directory's stat info should only change if you
    Skip> add or delete entries in the directory.  Were you perhaps
    Skip> expecting changes when other operations took place, like rewriting
    Skip> an existing file?

I took a couple minutes to write a simple script to check things.  It
created a file, changed its mode, then unlinked it.  I was a bit surprised
that deleting a file didn't appear to change the directory's mod time.  Then
I realized that since file times are only recorded with one-second
precision, you might see no change to the directory's mtime in some
circumstances.  Adding a sleep to the script between directory operations
resolved the apparent inconsistency.  Still, as Gordon stated, you probably
can't count on directory modtimes to tell you when to invalidate the cache.
It's consistent, just not reliable...

if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs,

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From mhammond at skippinet.com.au  Sat Nov 20 01:04:28 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sat, 20 Nov 1999 11:04:28 +1100
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com>
Message-ID: <005f01bf32ea$d0b82b90$0501a8c0@bobcat>

> Also, we have a couple minor convenience functions for Python in an
> MSDEV environment, an exposure of OutputDebugString for writing to
> the DevStudio log window and a means of tripping DevStudio C/C++
layer
> breakpoints from Python code (currently experimental).  The msvcrt
> module seems like a likely candidate for these, would these be
> welcome additions?

These are both available in the win32api module.  They dont really fit
in the "msvcrt" module, as they are not part of the C runtime library,
but the win32 API itself.

This is really a pointer to the fact that some or all of the win32api
should be moved into the core - registry access is the thing people
most want, but there are plenty of other useful things that people
reguarly use...

Guido objects to the coding style, but hopefully that wont be a big
issue.  IMO, the coding style isnt "bad" - it is just more an "MS"
flavour than a "Python" flavour - presumably people reading the code
will have some experience with Windows, so it wont look completely
foreign to them.  The good thing about taking it "as-is" is that it
has been fairly well bashed on over a few years, so is really quite
stable.  The final "coding style" issue is that there are no "doc
strings" - all documentation is embedded in C comments, and extracted
using a tool called "autoduck" (similar to "autodoc").  However, Im
sure we can arrange something there, too.

Mark.




From jcw at equi4.com  Sat Nov 20 01:21:43 1999
From: jcw at equi4.com (Jean-Claude Wippler)
Date: Sat, 20 Nov 1999 01:21:43 +0100
Subject: [Python-Dev] Import redesign [LONG]
References: 
		<1269073918-25826188@hypernet.com> <199911192234.QAA24710@dolphin.mojam.com>
Message-ID: <3835E997.8A4F5BC5@equi4.com>

Skip Montanaro wrote:
>
[dir stat cache times]
> I took a couple minutes to write a simple script to check things.  It
> created a file, changed its mode, then unlinked it.  I was a bit
> surprised that deleting a file didn't appear to change the directory's
> mod time.  Then I realized that since file times are only recorded
> with one-second

Or two, on Windows with older (FAT, as opposed to VFAT) file systems.

> precision, you might see no change to the directory's mtime in some
> circumstances.  Adding a sleep to the script between directory
> operations resolved the apparent inconsistency.  Still, as Gordon
> stated, you probably can't count on directory modtimes to tell you
> when to invalidate the cache. It's consistent, just not reliable...
> 
> if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs,

If the dir stat time is less than 2 seconds ago, flush - always.

If the dir stat time says it hasn't been changed for at least 2 seconds
then you can cache all entries and trust that any change is detected.
In other words: take the *current* time into account, then it can work.

I think.  Maybe.  Until you get into network drives and clock skew...

-- Jean-Claude



From gmcm at hypernet.com  Sat Nov 20 04:43:32 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 19 Nov 1999 22:43:32 -0500
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <3835E997.8A4F5BC5@equi4.com>
Message-ID: <1269053086-27079185@hypernet.com>

Jean-Claude wrote:
> Skip Montanaro wrote:
> >
> [dir stat cache times]
> > ...  Then I realized that since
> > file times are only recorded with one-second
> 
> Or two, on Windows with older (FAT, as opposed to VFAT) file
> systems.

Oh lordy, it gets worse. 

With a time.sleep(1.0) between new files, Linux detects the 
change in the dir's mtime immediately. Cool.

On NT, I get an average 2.0 sec delay. But sometimes it 
doesn't detect a delay in 100 secs (and my script quits). Then 
I added a stat of some file in the directory before the stat of 
the directory, (not the file I added). Now it acts just like Linux - 
no delay (on both FAT and NTFS partitions). OK...

> I think.  Maybe.  Until you get into network drives and clock
> skew...

No success whatsoever in either direction across Samba. In 
fact the mtime of my Linux home directory as seen from NT is 
Jan 1, 1980.

- Gordon



From gstein at lyra.org  Sat Nov 20 13:06:48 1999
From: gstein at lyra.org (Greg Stein)
Date: Sat, 20 Nov 1999 04:06:48 -0800 (PST)
Subject: [Python-Dev] updated imputil
Message-ID: 

I've updated imputil... The main changes is that I added SysPathImporter
and BuiltinImporter. I also did some restructing to help with
bootstrapping the module (remove dependence on os.py).

For testing a revamped Python import system, you can importing the thing
and call imputil._test_revamp() to set it up. This will load normal,
builtin, and frozen modules via imputil. Dynamic modules are still
handled by Python, however.

I ran a timing comparisons of importing all modules in /usr/lib/python1.5
(using standard and imputil-based importing). The standard mechanism can
do it in about 8.8 seconds. Through imputil, it does it in about 13.0
seconds. Note that I haven't profiled/optimized any of the Importer stuff
(yet).

The point about dynamic modules actually discovered a basic problem that I
need to resolve now. The current imputil assumes that if a particular
Importer loaded the top-level module in a package, then that Importer is
responsible for loading all other modules within that package. In my
particular test, I tried to import "xml.parsers.pyexpat". The two package
modules were handled by SysPathImporter. The pyexpat module is a dynamic
load module, so it is *not* handled by the Importer -- bam. Failure.

Basically, each part of "xml.parsers.pyexpat" may need to use a different
Importer...

Off to ponder,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sat Nov 20 13:11:37 1999
From: gstein at lyra.org (Greg Stein)
Date: Sat, 20 Nov 1999 04:11:37 -0800 (PST)
Subject: [Python-Dev] updated imputil
In-Reply-To: 
Message-ID: 

oops... forgot:

   http://www.lyra.org/greg/python/imputil.py

-g

On Sat, 20 Nov 1999, Greg Stein wrote:
> I've updated imputil... The main changes is that I added SysPathImporter
> and BuiltinImporter. I also did some restructing to help with
> bootstrapping the module (remove dependence on os.py).
> 
> For testing a revamped Python import system, you can importing the thing
> and call imputil._test_revamp() to set it up. This will load normal,
> builtin, and frozen modules via imputil. Dynamic modules are still
> handled by Python, however.
> 
> I ran a timing comparisons of importing all modules in /usr/lib/python1.5
> (using standard and imputil-based importing). The standard mechanism can
> do it in about 8.8 seconds. Through imputil, it does it in about 13.0
> seconds. Note that I haven't profiled/optimized any of the Importer stuff
> (yet).
> 
> The point about dynamic modules actually discovered a basic problem that I
> need to resolve now. The current imputil assumes that if a particular
> Importer loaded the top-level module in a package, then that Importer is
> responsible for loading all other modules within that package. In my
> particular test, I tried to import "xml.parsers.pyexpat". The two package
> modules were handled by SysPathImporter. The pyexpat module is a dynamic
> load module, so it is *not* handled by the Importer -- bam. Failure.
> 
> Basically, each part of "xml.parsers.pyexpat" may need to use a different
> Importer...
> 
> Off to ponder,
> -g
> 
> -- 
> Greg Stein, http://www.lyra.org/
> 
> 
> _______________________________________________
> Python-Dev maillist  -  Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/




From skip at mojam.com  Sat Nov 20 15:16:58 1999
From: skip at mojam.com (Skip Montanaro)
Date: Sat, 20 Nov 1999 08:16:58 -0600 (CST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <1269053086-27079185@hypernet.com>
References: <3835E997.8A4F5BC5@equi4.com>
	<1269053086-27079185@hypernet.com>
Message-ID: <14390.44378.83128.546732@dolphin.mojam.com>

    Gordon> No success whatsoever in either direction across Samba. In fact
    Gordon> the mtime of my Linux home directory as seen from NT is Jan 1,
    Gordon> 1980.

Ain't life grand? :-(

Ah, well, it was a nice idea...

S



From jim at interet.com  Mon Nov 22 17:43:39 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Mon, 22 Nov 1999 11:43:39 -0500
Subject: [Python-Dev] Import redesign [LONG]
References: 
Message-ID: <383972BB.C65DEB26@interet.com>

Greg Stein wrote:
> 
> I would suggest that both retain their *exact* meaning. We introduce
> sys.importers -- a list of importers to check, in sequence. The first
> importer on that list uses sys.path to look for and load modules. The
> second importer loads builtins and frozen code (i.e. modules not on
> sys.path).

We should retain the current order.  I think is is:
first builtin, next frozen, next sys.path.
I really think frozen modules should be loaded in preference
to sys.path.  After all, they are compiled in.
 
> Users can insert/append new importers or alter sys.path as before.

I agree with Greg that sys.path should remain as it is.  A list
of importers can add the extra functionality.  Users will
probably want to adjust the order of the list.

> > Implementation:
> > ---------------
> >
> > - There must clearly be some code in C that can import certain
> >   essential modules (to solve the chicken-or-egg problem), but I don't
> >   mind if the majority of the implementation is written in Python.
> >   Using Python makes it easy to subclass.
> 
> I posited once before that the cost of import is mostly I/O rather than
> CPU, so using Python should not be an issue. MAL demonstrated that a good
> design for the Importer classes is also required. Based on this, I'm a
> *strong* advocate of moving as much as possible into Python (to get
> Python's ease-of-coding with little relative cost).

Yes, I agree.  And I think the main() should be written in Python.  Lots
of Python should be written in Python.

> The (core) C code should be able to search a path for a module and import
> it. It does not require dynamic loading or packages. This will be used to
> import exceptions.py, then imputil.py, then site.py.

But these can be frozen in (as you mention below).  I dislike depending
on sys.path to load essential modules.  If they are not frozen in,
then we need a command line argument to specify their path, with
sys.path used otherwise.
 
Jim Ahlstrom



From jim at interet.com  Mon Nov 22 18:25:46 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Mon, 22 Nov 1999 12:25:46 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269144272-21594530@hypernet.com>
Message-ID: <38397C9A.DF6B7112@interet.com>

Gordon McMillan wrote:

> [JimA]
> > Think about multiple packages in multiple zip files.  The zip
> > files store file directories.  That means we would need a
> > sys.zippath to search the zip files.  I don't want another
> > PYTHONPATH phenomenon.
> 
> What if sys.path looked like:
>  [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...]

Well, that changes the current meaning of sys.path.
 
> > > > I suggest that archive files MUST be put into a known
> > > > directory.
> 
> No way. Hard code a directory? Overwrite someone else's
> Python "standalone"? Write to a C: partition that is
> deliberately sized to hold nothing but Windows? Make
> network installations impossible?

Ooops.  I didn't mean a known directory you couldn't change.
But I did mean a directory you shouldn't change.

But you are right.  The directory should be configurable.  But
I would still like to see a highly encouraged directory.  I
don't yet have a good design for this.  Anyone have ideas on an
official way to find library files?

I think a Python library file is a Good Thing, but it is not useful if
the archive can't be found.

I am thinking of a busy SysAdmin with someone nagging him/her to
install Python.  SysAdmin doesn't want another headache.  What if
Python becomes popular and users want it on Unix and PC's?  More
work!  There should be a standard way to do this that just works
and is dumb-stupid-simple.  This is a Python promotion issue.  Yes
everyone here can make sys.path work, but that is not the point.

> The official Windows solution is stuff in registry about app
> paths and such. Putting the dlls in the exe's directory is a
> workaround which works and is more managable than the
> official solution.

I agree completely.
 
> > > > We should also have the ability to append archive files to
> > > > the executable or a shared library assuming the OS allows
> > > > this
> 
> That's a handy trick on Windows, but it's got nothing to do
> with Python.

It also works on Linux.  I don't know about other systems.
 
> Flexibility. You can put Christian's favorite Einstein quote here
> too.

I hope we can still have ease of use with all this flexibility.
As I said, we need to promote Python.
 
Jim Ahlstrom



From mal at lemburg.com  Tue Nov 23 14:32:42 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 Nov 1999 14:32:42 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.8
References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com>
Message-ID: <383A977A.C20E6518@lemburg.com>

FYI, I've uploaded a new version of the proposal which includes
the encodings package, definition of the 'raw unicode escape'
encoding (available via e.g. ur""), Unicode format strings and
a new method .breaklines().

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

? Stream readers:

  What about .readline(), .readlines() ? These could be implemented
  using .read() as generic functions instead of requiring their
  implementation by all codecs. Also see Line Breaks.

? Python interface for the Unicode property database

? What other special Unicode formatting characters should be
  enhanced to work with Unicode input ? Currently only the
  following special semantics are defined:

    u"%s %s" % (u"abc", "abc") should return u"abc abc".


Pretty quiet around here lately...
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    38 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From jcw at equi4.com  Tue Nov 23 16:17:36 1999
From: jcw at equi4.com (Jean-Claude Wippler)
Date: Tue, 23 Nov 1999 16:17:36 +0100
Subject: [Python-Dev] New thread ideas in Perl-land
Message-ID: <383AB010.DD46A1FB@equi4.com>

Just got a note about a paper on a new way of dealing with threads, as
presented to the Perl-Porters list.  The idea is described in:
	http://www.cpan.org/modules/by-authors/id/G/GB/GBARTELS/thread_0001.txt

I have no time to dive in, comment, or even judge the relevance of this,
but perhaps someone else on this list wishes to check it out.

The author of this is Greg London .

-- Jean-Claude



From mhammond at skippinet.com.au  Tue Nov 23 23:45:14 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 24 Nov 1999 09:45:14 +1100
Subject: [Python-Dev] Unicode Proposal: Version 0.8
In-Reply-To: <383A977A.C20E6518@lemburg.com>
Message-ID: <002301bf3604$68fd8f00$0501a8c0@bobcat>

> Pretty quiet around here lately...

My guess is that most positions and opinions have been covered.  It is
now probably time for less talk, and more code!

It is time to start an implementation plan?  Do we start with /F's
Unicode implementation (which /G *smirk* seemed to approve of)?  Who
does what?  When can we start to play with it?

And a key point that seems to have been thrust in our faces at the
start and hardly mentioned recently - does the proposal as it stands
meet our sponsor's (HP) requirements?

Mark.




From gstein at lyra.org  Wed Nov 24 01:40:44 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 23 Nov 1999 16:40:44 -0800 (PST)
Subject: [Python-Dev] Re: updated imputil
In-Reply-To: 
Message-ID: 

 :-)

On Sat, 20 Nov 1999, Greg Stein wrote:
>...
> The point about dynamic modules actually discovered a basic problem that I
> need to resolve now. The current imputil assumes that if a particular
> Importer loaded the top-level module in a package, then that Importer is
> responsible for loading all other modules within that package. In my
> particular test, I tried to import "xml.parsers.pyexpat". The two package
> modules were handled by SysPathImporter. The pyexpat module is a dynamic
> load module, so it is *not* handled by the Importer -- bam. Failure.
> 
> Basically, each part of "xml.parsers.pyexpat" may need to use a different
> Importer...

I've thought about this and decided the issue is with my particular
Importer, rather than the imputil design. The PathImporter traverses a set
of paths and establishes a package hierarchy based on a filesystem layout.
It should be able to load dynamic modules from within that filesystem
area.

A couple alternatives, and why I don't believe they work as well:

* A separate importer to just load dynamic libraries: this would need to
  replicate PathImporter's mapping of Python module/package hierarchy onto
  the filesystem. There would also be a sequencing issue because one
  Importer's paths would be searched before the other's paths. Current
  Python import rules establishes that a module earlier in sys.path
  (whether a dyn-lib or not) is loaded before one later in the path. This
  behavior could be broken if two Importers were used.

* A design whereby other types of modules can be placed into the
  filesystem and multiple Importers are used to load parts of the path
  (e.g. PathImporter for xml.parsers and DynLibImporter for pyexpat). This
  design doesn't work well because the mapping of Python module/package to
  the filesystem is established by PathImporter -- try to mix a "private"
  mapping design among Importers creates too much coupling.


There is also an argument that the design is fundamentally incorrect :-).
I would argue against that, however. I'm not sure what form an argument
*against* imputil would be, so I'm not sure how to preempty it :-). But we
can get an idea of various arguments by hypothesizing different scenarios
and requireing that the imputil design satisifies them.

In the above two alternatives, they were examing the use of a secondary
Importer to load things out of the filesystem (and it explained why two
Importers in whatever configuration is not a good thing). Let's state for
argument's sake that files of some type T must be placable within the
filesystem (i.e. according to the layout defined by PathImporter). We'll
also say that PathImporter doesn't understand T, since the latter was
designed later or is private to some app. The way to solve this is to
allow PathImporter to recognize it through some configuration of the
instance (e.g. self.recognized_types). A set of hooks in the PathImporter
would then understand how to map files of type T to a code or module
object. (alternatively, a generalized set of hooks at the Importer class
level) Note that you could easily have a utility function that scans
sys.importers for a PathImporter instance and adds the data to recognize a
new type -- this would allow for simple installation of new types.

Note that PathImporter inherently defines a 1:1 mapping from a module to a
file. Archives (zip or jar files) cannot be recognized and handled by
PathImporter. An archive defines an entirely different style of mapping
between a module/package and a file in the filesystem. Of course, an
Importer that uses archives can certainly look for them in sys.path.

The imputil design is derived directly from the "import" statement. "Here
is a module/package name, give me a module."  (this is embodied in the
get_code() method in Importer)

The find/load design established by ihooks is very filesystem-based. In
many situations, a find/load is very intertwined. If you want to take the
URL case, then just examine the actual network activity -- preferably, you
want a single transaction (e.g. one HTTP GET). Find/load implies two
transactions. With nifty context handling between the two steps, you can
get away with a single transaction. But the point is that the design
requires you to get work around its inherent two-step mechanism and
establish a single step. This is weird, of course, because importing is
never *just* a find or a load, but always both.

Well... since I've satisfied to myself that PathImporter needs to load
dynamic lib modules, I'm off to code it...

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Wed Nov 24 02:45:29 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 23 Nov 1999 17:45:29 -0800 (PST)
Subject: [Python-Dev] breaking out code for dynamic loading
Message-ID: 

Guido,

I can't find the message, but it seems that at some point you mentioned
wanting to break out importdl.c into separate files. The configure process
could then select the appropriate one to use for the platform.

Sounded great until I looked at importdl.c. There are a 13 variants of
dynamic loading. That would imply 13 separate files/modules.

I'd be happy to break these out, but are you actually interested in that
many resulting modules? If so, then any suggestions for naming?
(e.g. aix_dynload, win32_dynload, mac_dynload)

Here are the variants:

* NeXT, using FVM shlibs             (USE_RLD)
* NeXT, using frameworks             (USE_DYLD)
* dl / GNU dld                       (USE_DL)
* SunOS, IRIX 5 shared libs          (USE_SHLIB)
* AIX dynamic linking                (_AIX)
* Win32 platform                     (MS_WIN32)
* Win16 platform                     (MS_WIN16)
* OS/2 dynamic linking               (PYOS_OS2)
* Mac CFM                            (USE_MAC_DYNAMIC_LOADING)
* HP/UX dyn linking                  (hpux)
* NetBSD shared libs                 (__NetBSD__)
* FreeBSD shared libs                (__FreeBSD__)
* BeOS shared libs                   (__BEOS__)


Could I suggest a new top-level directory in the Python distribution named
"Platform"? Move BeOS, PC, and PCbuild in there (bring back Mac?). Add new
directories for each of the above platforms and move the appropriate
portion of importdl.c into there as a Python C Extension Module. (the
module would still be statically linked into the interpreter!)

./configure could select the module and write a Setup.dynload, much like
it does with Setup.thread.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/





From gstein at lyra.org  Wed Nov 24 03:43:50 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 23 Nov 1999 18:43:50 -0800 (PST)
Subject: [Python-Dev] another round of imputil work completed
In-Reply-To: 
Message-ID: 

On Tue, 23 Nov 1999, Greg Stein wrote:
>...
> Well... since I've satisfied to myself that PathImporter needs to load
> dynamic lib modules, I'm off to code it...

All right. imputil.py now comes with code to emulate the builtin Python
import mechanism. It loads all the same types of files, uses sys.path, and
(pointed out by JimA) loads builtins before looking on the path.

The only "feature" it doesn't support is using package.__path__ to look
for submodules. I never liked that thing, so it isn't in there.
(imputil *does* set the __path__ attribute, tho)

Code is available at:

   http://www.lyra.org/greg/python/imputil.py


Next step is to add a "standard" library/archive format. JimA and I have
been tossing some stuff back and forth on this.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Wed Nov 24 09:34:52 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 24 Nov 1999 09:34:52 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.8
References: <002301bf3604$68fd8f00$0501a8c0@bobcat>
Message-ID: <383BA32C.2E6F4780@lemburg.com>

Mark Hammond wrote:
> 
> > Pretty quiet around here lately...
> 
> My guess is that most positions and opinions have been covered.  It is
> now probably time for less talk, and more code!

Or that everybody is on holidays... like Guido.
 
> It is time to start an implementation plan?  Do we start with /F's
> Unicode implementation (which /G *smirk* seemed to approve of)?  Who
> does what?  When can we start to play with it?

This depends on whether HP agrees on the current specs. If they
do, there should be code by mid December, I guess.
 
> And a key point that seems to have been thrust in our faces at the
> start and hardly mentioned recently - does the proposal as it stands
> meet our sponsor's (HP) requirements?

Haven't heard anything from them yet (this is probably mainly
due to Guido being offline).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    37 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Wed Nov 24 10:32:46 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 24 Nov 1999 10:32:46 +0100
Subject: [Python-Dev] Import Design
Message-ID: <383BB0BE.BF116A28@lemburg.com>

Before hooking on to some more PathBuiltinImporters ;-), I'd like
to spawn a thread leading in a different direction...

There has been some discussion on what we really expect of the
import mechanism to be able to do. Here's a summary of what I
think we need:

* compatibility with the existing import mechanism

* imports from library archives (e.g. .pyl or .par-files)

* a modified intra package import lookup scheme (the thingy
  which I call "walk-me-up-Scotty" patch -- see previous posts)

And for some fancy stuff:

* imports from URLs (e.g. these could be put on the path for
  automatic inclusion in the import scan or be passed explicitly
  to __import__)

* a (file based) static lookup cache to enhance lookup
  performance which is enabled via a command line switch
  (rather than being enabled per default), so that the
  user can decide whether to apply this optimization or
  not

The point I want to make is: there aren't all that many features
we are really looking for, so why not incorporate these into
the builtin importer and only *then* start thinking about
schemes for hooks, managers, etc. ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    37 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From captainrobbo at yahoo.com  Wed Nov 24 12:40:16 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 24 Nov 1999 03:40:16 -0800 (PST)
Subject: [Python-Dev] Unicode Proposal: Version 0.8
Message-ID: <19991124114016.7706.rocketmail@web601.mail.yahoo.com>

--- Mark Hammond  wrote:
> > Pretty quiet around here lately...
> 
> My guess is that most positions and opinions have
> been covered.  It is
> now probably time for less talk, and more code!
> 
> It is time to start an implementation plan?  Do we
> start with /F's
> Unicode implementation (which /G *smirk* seemed to
> approve of)?  Who
> does what?  When can we start to play with it?
> 
> And a key point that seems to have been thrust in
> our faces at the
> start and hardly mentioned recently - does the
> proposal as it stands
> meet our sponsor's (HP) requirements?
> 
> Mark.

I had a long chat with them on Friday :-)  They want
it done, but nobody is actively working on it now as
far as I can tell, and they are very busy.

The per-thread thing was a red herring - they just
want to be able to do (for example) web servers
handling different encodings from a central unicode
database, so per-output-stream works just fine.

They will be at IPC8; I'd suggest that a round of
prototyping, we insist they read it and then discuss
it at IPC8, and be prepared to rework things
thereafter are important.  Hopefully then we'll have a
plan on how to tackle the much larger (but less
interesting to python-dev) job of writing and
verifying all the codecs and utilities.


Andy Robinson



=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Thousands of Stores.  Millions of Products.  All in one place.
Yahoo! Shopping: http://shopping.yahoo.com



From jim at interet.com  Wed Nov 24 15:43:57 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Wed, 24 Nov 1999 09:43:57 -0500
Subject: [Python-Dev] Re: updated imputil
References: 
Message-ID: <383BF9AD.E183FB98@interet.com>

Greg Stein wrote:
> * A separate importer to just load dynamic libraries: this would need to
>   replicate PathImporter's mapping of Python module/package hierarchy onto
>   the filesystem. There would also be a sequencing issue because one
>   Importer's paths would be searched before the other's paths. Current
>   Python import rules establishes that a module earlier in sys.path
>   (whether a dyn-lib or not) is loaded before one later in the path. This
>   behavior could be broken if two Importers were used.

I would like to argue that on Windows, import of dynamic libraries is
broken.  If a file something.pyd is imported, then sys.path is searched
to find the module.  If a file something.dll is imported, the same thing
happens.  But Windows defines its own search order for *.dll files which
Python ignores.  I would suggest that this is wrong for files named
*.dll,
but OK for files named *.pyd.

A SysAdmin should be able to install and maintain *.dll as she has
been trained to do.  This makes maintaining Python installations
simpler and more un-surprising.

I have no solution to the backward compatibilty problem.  But the
code is only a couple lines.  A LoadLibrary() call does its own
path searching.

Jim Ahlstrom



From jim at interet.com  Wed Nov 24 16:06:17 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Wed, 24 Nov 1999 10:06:17 -0500
Subject: [Python-Dev] Import Design
References: <383BB0BE.BF116A28@lemburg.com>
Message-ID: <383BFEE9.B4FE1F19@interet.com>

"M.-A. Lemburg" wrote:

> The point I want to make is: there aren't all that many features
> we are really looking for, so why not incorporate these into
> the builtin importer and only *then* start thinking about
> schemes for hooks, managers, etc. ?!

Marc has made this point before, and I think it should be
considered carefully.  It is a lot of work to re-create the
current import logic in Python and it is almost guaranteed
to be slower.  So why do it?

I like imputil.py because it leads
to very simple Python installations.  I view this as
a Python promotion issue.  If we have a boot mechanism plus
archive files, we can have few-file Python installations
with package addition being just adding another file.

But at least some of this code must be in C.  I volunteer to
write the rest of it in C if that is what people want.  But it
would add two hundred more lines of code to import.c.  So
maybe now is the time to switch to imputil, instead of waiting
for later.

But I am indifferent as long as I can tell a Python user
to just put an archive file libpy.pyl in his Python directory
and everything will Just Work.

Jim Ahlstrom



From bwarsaw at python.org  Tue Nov 30 21:23:40 1999
From: bwarsaw at python.org (Barry Warsaw)
Date: Tue, 30 Nov 1999 15:23:40 -0500 (EST)
Subject: [Python-Dev] CFP Developers' Day - 8th International Python Conference
Message-ID: <14404.12876.847116.288848@anthem.cnri.reston.va.us>

Hello Python Developers!

Thursday January 27 2000, the final day of the 8th International
Python Conference is Developers' Day, where Python hackers get
together to discuss and reach agreements on the outstanding issues
facing Python.  This is also your once-a-year chance for face-to-face
interactions with Python's creator Guido van Rossum and other
experienced Python developers.

To make Developers' Day a success, we need you!  We're looking for a
few good champions to lead topic sessions.  As a champion, you will
choose a topic that fires you up and write a short position paper for
publication on the web prior to the conference.  You'll also prepare
introductory material for the topic overview session, and lead a 90
minute topic breakout group.

We've had great champions and topics in previous years, and many
features of today's Python had their start at past Developers' Days.
This is your chance to help shape the future of Python for 1.6,
2.0 and beyond.

If you are interested in becoming a topic champion, you must email me
by Wednesday December 15, 1999.  For more information, please visit
the IPC8 Developers' Day web page at

    

This page has more detail on schedule, suggested topics, important
dates, etc.  To volunteer as a champion, or to ask other questions,
you can email me at bwarsaw at python.org.

-Barry



From mal at lemburg.com  Mon Nov  1 00:00:55 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 01 Nov 1999 00:00:55 +0100
Subject: [Python-Dev] Misleading syntax error text
References: <1270838575-13870925@hypernet.com>
Message-ID: <381CCA27.59506CF6@lemburg.com>

[Extracted from the psa-members list...]

Gordon McMillan wrote:
> 
> Chris Fama wrote,
> > And now the rub: the exact same function definition has passed
> > through byte-compilation perfectly OK many times before with no
> > problems... of course, this points rather clearly to the
> > preceding code, but it illustrates a failing in Python's syntax
> > error messages, and IMHO a fairly serious one at that, if this is
> > indeed so.
> 
> My simple experiments refuse to compile a "del getattr(..)" at
> all.

Hmm, it seems to be a failry generic error:

>>> del f(x,y)
SyntaxError: can't assign to function call

How about chainging the com_assign_trailer function in Python/compile.c
to:

static void
com_assign_trailer(c, n, assigning)
        struct compiling *c;
        node *n;
        int assigning;
{
        REQ(n, trailer);
        switch (TYPE(CHILD(n, 0))) {
        case LPAR: /* '(' [exprlist] ')' */
                com_error(c, PyExc_SyntaxError,
                          assigning ? "can't assign to function call":
			              "can't delete expression");
                break;
        case DOT: /* '.' NAME */
                com_assign_attr(c, CHILD(n, 1), assigning);
                break;
        case LSQB: /* '[' subscriptlist ']' */
                com_subscriptlist(c, CHILD(n, 1), assigning);
                break;
        default:
                com_error(c, PyExc_SystemError, "unknown trailer type");
        }
}

or something along those lines...

BTW, has anybody tried my import patch recently ? I haven't heard
any citicism since posting it and wonder what made the list fall
asleep over the topic :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    61 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mhammond at skippinet.com.au  Mon Nov  1 02:51:56 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon, 1 Nov 1999 12:51:56 +1100
Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee?
Message-ID: <002301bf240b$ae61fa00$0501a8c0@bobcat>

I have for some time been wondering about the usefulness of this
mailing list.  It seems to have produced staggeringly few results
since inception.

This is not a critisism of any individual, but of the process.  It is
proof in my mind of how effective the benevolent dictator model is,
and how ineffective a language run by committee would be.

This "committee" never seems to be capable of reaching a consensus on
anything.  A number of issues dont seem to provoke any responses.  As
a result, many things seem to die a slow and lingering death.  Often
there is lots of interesting discussion, but still precious few
results.

In the pre python-dev days, the process seemed easier - we mailed
Guido directly, and he either stated "yea" or "nay" - maybe we didnt
get the response we hoped for, but at least we got a response.  Now,
we have the result that even if Guido does enter into a thread, the
noise seems to drown out any hope of getting anything done.  Guido
seems to be faced with the dilemma of asserting his dictatorship in
the face of many dissenting opinions from many people he respects, or
putting it in the too hard basket.  I fear the latter is the easiest
option.  At the end of this mail I list some of the major threads over
the last few months, and can't see a single thread that has resulted
in a CVS checkin, and only one that has resulted in agreement.  This,
to my mind at least, is proof that things are really not working.

I long for the "good old days" - take the replacement of "ni" with
built-in functionality, for example.  I posit that if this was
discussed on python-dev, it would have caused a huge flood of mail,
and nothing remotely resembling a consensus.  Instead, Guido simply
wrote an essay and implemented some code that he personally liked.  No
debate, no discussion.  Still an excellent result.  Maybe not a
perfect result, but a result nonetheless.

However, Guido's time is becoming increasingly limited.  So should we
consider moving to a "benevolent lieutenent" model, in conjunction
with re-ramping up the SIGS?  This would provide 2 ways to get things
done:

* A new SIG.  Take relative imports, for example.  If we really do
need a change in this fairly fundamental area, a SIG would be
justified ("import-sig").  The responsibility of the SIG is to form a
consensus (and code that reflects it), and report back to Guido (and
the main newsgroup) with the result of this.  It worked well for RE,
and allowed those of us not particularly interested to keep out of the
debate.  If the SIG can not form consensus, then tough - it dies - and
should not be mourned.  Presumably Guido would keep a watchful eye
over the SIG, providing direction where necessary, but in general stay
out of the day to day traffic.  New SIGs seem to have stopped since
this list creation, and it seems that issues that should be discussed
in new SIGS are now discussed here.

*  Guido could delegate some of his authority to a single individual
responsible for a certain limited area - a benevolent lieutenent.  We
may have a lieutentant responsible for different areas, and could only
exercise their authority with small, trivial changes.  Eg, the "getopt
helper" thread - if a lieutenant was given authority for the "standard
library", they could simply make a yea or nay decision, and present it
to Guido.  Presumably Guido trusts this person he delegated to enough
that the majority of the lieutenant's recommendations would be
accepted.  Presumably there would be a small number of lieutentants,
and they would then become the new "python-dev" - say up to 5 people.
This list then discusses high level strategies and seek direction from
each other when things get murky.  This select group of people may not
(indeed, probably would not) include me, but I would have no problem
with that - I would prefer to see results achieved than have my own
ego stroked by being included in a select, but ineffective group.

In parting, I repeat this is not a direct critisism, simply an
observation of the last few months.  I am on this list, so I am
definately as guilty as any one else - which is "not at all" - ie, no
one is guilty, I simply see it as endemic to a committee with people
of diverse backgrounds, skills and opinions.

Any thoughts?

Long live the dictator! :-)

Mark.

Recent threads, and my take on the results:

* getopt helper?
Too much noise regarding semantic changes.

* Alternative Approach to Relative Imports
* Relative package imports
* Path hacking
* Towards a Python based import scheme
Too much noise - no one could really agree on the semantics.
Implementation thrown in the ring, and promptly forgotten.

* Corporate installations
Very young, but no result at all.

* Embedding Python when using different calling conventions
Quite young, but no result as yet, and I have no reason to believe
there will be.

* Catching "return" and "return expr" at compile time
Seemed to be blessed - yay!  Dont believe I have seen a check-in yet.

* More Python command-line features
Seemed general agreement, but nothing happened?

* Tackling circular dependencies in 2.0?
Lots of noise, but no results other than "GC may be there in 2.0"

* Buffer interface in abstract.c
Determined it could break - no solution proposed.  Lots of noise
regarding if is is a good idea at all!

* mmapfile module
No result.

* Quick-and-dirty weak references
No result.

* Portable "spawn" module for core?
No result.

* Fake threads
Seemed to spawn stackless Python, but in the face of Guido being "at
best, lukewarm" about this issue, I would again have to conclude "no
result".  An authorative "no" in this area may have saved lots of
effort and heartache.

* add Expat to 1.6
No result.

* I'd like list.pop to accept an optional second argument giving a
default value
No result

* etc
No result.




From jack at oratrix.nl  Mon Nov  1 10:56:48 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Mon, 01 Nov 1999 10:56:48 +0100
Subject: [Python-Dev] Embedding Python when using different calling 
 conventions.
In-Reply-To: Message by "M.-A. Lemburg"  ,
	     Sat, 30 Oct 1999 10:46:30 +0200 , <381AB066.B54A47E0@lemburg.com> 
Message-ID: <19991101095648.DC2E535BB1E@snelboot.oratrix.nl>


> OTOH, we could take chance to reorganize these macros from bottom
> up: when I started coding extensions I found them not very useful
> mostly because I didn't have control over them meaning "export
> this symbol" or "import the symbol". Especially the DL_IMPORT
> macro is strange because it seems to handle both import *and*
> export depending on whether Python is compiled or not.

This would be very nice. The DL_IMPORT/DL_EXPORT stuff is really weird unless 
you're working with it all the time. We were trying to build a plugin DLL for 
PythonWin and first you spend hours finding out that you have to set DL_IMPORT 
(and how to set it), and the you spend another few hours before you realize 
that you can't simply copy the DL_IMPORT and DL_EXPORT from, say, timemodule.c 
because timemodule.c is going to be in the Python core (and hence can use 
DL_IMPORT for its init() routine declaration) while your module is going to be 
a plugin so it can't.

I would opt for a scheme where the define shows where the symbols is expected 
to live (DL_CORE and DL_THISMODULE would be needed at least, but probably one 
or two more for .h files).
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From jack at oratrix.nl  Mon Nov  1 11:12:37 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Mon, 01 Nov 1999 11:12:37 +0100
Subject: [Python-Dev] Benevolent dictator versus the bureaucratic 
 committee?
In-Reply-To: Message by "Mark Hammond"  ,
	     Mon, 1 Nov 1999 12:51:56 +1100 , <002301bf240b$ae61fa00$0501a8c0@bobcat> 
Message-ID: <19991101101238.3D6FA35BB1E@snelboot.oratrix.nl>

I think I agree with Mark's post, although I do see a little more light (the 
relative imports dicussion resulted in working code, for instance).

The benevolent lieutenant idea may work, _if_ the lieutenants can be found. I 
myself will quickly join Mark in wishing the new python-dev well and 
abandoning ship (half a :-).

If that doesn't work maybe we should try at the very least to create a 
"memory". If you bring up a subject for discussion and you don't have working 
code that's fine the first time. But if anyone brings it up a second time 
they're supposed to have code. That way at least we won't be rehashing old 
discussions (as happend on the python-list every time, with subjects like GC 
or optimizations).

And maybe we should limit ourselves in our replies: don't speak up too much in 
discussions if you're not going to write code. I know that I'm pretty good at 
answering with my brilliant insights to everything myself:-). It could well be 
that refining and refining the design (as in the getopt discussion) results in 
such a mess of opinions that no-one has the guts to write the code anymore.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From mal at lemburg.com  Mon Nov  1 12:09:21 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 01 Nov 1999 12:09:21 +0100
Subject: [Python-Dev] dircache.py
References: <1270737688-19939033@hypernet.com>
Message-ID: <381D74E0.1AE3DA6A@lemburg.com>

Gordon McMillan wrote:
> 
> Pursuant to my volunteering to implement Guido's plan to
> combine cmp.py, cmpcache.py, dircmp.py and dircache.py
> into filecmp.py, I did some investigating of dircache.py.
> 
> I find it completely unreliable. On my NT box, the mtime of the
> directory is updated (on average) 2 secs after a file is added,
> but within 10 tries, there's always one in which it takes more
> than 100 secs (and my test script quits). My Linux box hardly
> ever detects a change within 100 secs.
> 
> I've tried a number of ways of testing this ("this" being
> checking for a change in the mtime of the directory), the latest
> of which is below. Even if dircache can be made to work
> reliably and surprise-free on some platforms, I doubt it can be
> done cross-platform. So I'd recommend that it just get dropped.
> 
> Comments?

Note that you'll have to flush and close the tmp file to actually
have it written to the file system. That's why you are not seeing
any new mtimes on Linux.

Still, I'd suggest declaring it obsolete. Filesystem access is
usually cached by the underlying OS anyway, so adding another layer of
caching on top of it seems not worthwhile (plus, the OS knows
better when and what to cache).

Another argument against using stat() time entries for caching
purposes is the resolution of 1 second. It makes the dircache.py
unreliable per se for fast changing directories.

The problem is most probably even worse for NFS and on Samba mounted
WinXX filesystems the mtime trick doesn't work at all (stat()
returns the creation time for atime, mtime and ctime).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    60 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gward at cnri.reston.va.us  Mon Nov  1 14:28:51 1999
From: gward at cnri.reston.va.us (Greg Ward)
Date: Mon, 1 Nov 1999 08:28:51 -0500
Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee?
In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat>; from mhammond@skippinet.com.au on Mon, Nov 01, 1999 at 12:51:56PM +1100
References: <002301bf240b$ae61fa00$0501a8c0@bobcat>
Message-ID: <19991101082851.A16952@cnri.reston.va.us>

On 01 November 1999, Mark Hammond said:
> I have for some time been wondering about the usefulness of this
> mailing list.  It seems to have produced staggeringly few results
> since inception.

Perhaps this is an indication of stability rather than stagnation.  Of
course we can't have *total* stability or Python 1.6 will never appear,
but...

> * Portable "spawn" module for core?
> No result.

...I started this little thread to see if there was any interest, and to
find out the easy way if VMS/Unix/DOS-style "spawn sub-process with list
of strings as command-line arguments" makes any sense at all on the Mac
without actually having to go learn about the Mac.

The result: if 'spawn()' is added to the core, it should probably be
'os.spawn()', but it's not really clear if this is necessary or useful
to many people; and, no, it doesn't make sense on the Mac.  That
answered my questions, so I don't really see the thread as a failure.  I
might still turn the distutils.spawn module into an appendage of the os
module, but there doesn't seem to be a compelling reason to do so.

Not every thread has to result in working code.  In other words,
negative results are results too.

        Greg



From skip at mojam.com  Mon Nov  1 17:58:41 1999
From: skip at mojam.com (Skip Montanaro)
Date: Mon, 1 Nov 1999 10:58:41 -0600 (CST)
Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee?
In-Reply-To: <002301bf240b$ae61fa00$0501a8c0@bobcat>
References: <002301bf240b$ae61fa00$0501a8c0@bobcat>
Message-ID: <14365.50881.778143.590205@dolphin.mojam.com>

    Mark> * Catching "return" and "return expr" at compile time
    Mark> Seemed to be blessed - yay!  Dont believe I have seen a check-in
    Mark> yet. 

I did post a patch to compile.c here and to the announce list.  I think the
temporal distance between the furor in the main list and when it appeared
"in print" may have been a problem.  Also, as the author of that code I
surmised that compile.c was the wrong place for it.  I would have preferred
to see it in some Python code somewhere, but there's no obvious place to put
it.  Finally, there is as yet no convention about how to handle warnings.
(Maybe some sort of PyLint needs to be "blessed" and made part of the
distribution.)

Perhaps python-dev would be good to generate SIGs, sort of like a hurricane
spinning off tornadoes.

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...




From guido at CNRI.Reston.VA.US  Mon Nov  1 19:41:32 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 01 Nov 1999 13:41:32 -0500
Subject: [Python-Dev] Misleading syntax error text
In-Reply-To: Your message of "Mon, 01 Nov 1999 00:00:55 +0100."
             <381CCA27.59506CF6@lemburg.com> 
References: <1270838575-13870925@hypernet.com>  
            <381CCA27.59506CF6@lemburg.com> 
Message-ID: <199911011841.NAA06233@eric.cnri.reston.va.us>

> How about chainging the com_assign_trailer function in Python/compile.c
> to:

Please don't use the python-dev list for issues like this.  The place
to go is the python-bugs database
(http://www.python.org/search/search_bugs.html) or you could just send
me a patch (please use a context diff and include the standard disclaimer
language).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Mon Nov  1 20:06:39 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 01 Nov 1999 20:06:39 +0100
Subject: [Python-Dev] Misleading syntax error text
References: <1270838575-13870925@hypernet.com>  
	            <381CCA27.59506CF6@lemburg.com> <199911011841.NAA06233@eric.cnri.reston.va.us>
Message-ID: <381DE4BF.951B03F0@lemburg.com>

Guido van Rossum wrote:
> 
> > How about chainging the com_assign_trailer function in Python/compile.c
> > to:
> 
> Please don't use the python-dev list for issues like this.  The place
> to go is the python-bugs database
> (http://www.python.org/search/search_bugs.html) or you could just send
> me a patch (please use a context diff and include the standard disclaimer
> language).

This wasn't really a bug report... I was actually looking for some
feedback prior to sending a real (context) patch.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    60 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From jim at interet.com  Tue Nov  2 16:43:56 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Tue, 02 Nov 1999 10:43:56 -0500
Subject: [Python-Dev] Benevolent dictator versus the bureaucratic committee?
References: <002301bf240b$ae61fa00$0501a8c0@bobcat>
Message-ID: <381F06BC.CC2CBFBD@interet.com>

Mark Hammond wrote:
> 
> I have for some time been wondering about the usefulness of this
> mailing list.  It seems to have produced staggeringly few results
> since inception.

I appreciate the points you made, but I think this list is still
a valuable place to air design issues.  I don't want to see too
many Python core changes anyway.  Just my 2.E-2 worth.

Jim Ahlstrom



From Vladimir.Marangozov at inrialpes.fr  Wed Nov  3 23:34:44 1999
From: Vladimir.Marangozov at inrialpes.fr (Vladimir Marangozov)
Date: Wed, 3 Nov 1999 23:34:44 +0100 (NFT)
Subject: [Python-Dev] paper available
Message-ID: <199911032234.XAA26442@pukapuka.inrialpes.fr>

I've OCR'd Saltzer's paper. It's available temporarily (in MS Word
format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip

Since there may be legal problems with LNCS, I will disable the
link shortly (so those of you who have not received a copy and are
interested in reading it, please grab it quickly)

If prof. Saltzer agrees (and if he can, legally) put it on his web page,
I guess that the paper will show up at http://mit.edu/saltzer/

Jeremy, could you please check this with prof. Saltzer? (This version
might need some corrections due to the OCR process, despite that I've
made a significant effort to clean it up)

-- 
       Vladimir MARANGOZOV          | Vladimir.Marangozov at inrialpes.fr
http://sirac.inrialpes.fr/~marangoz | tel:(+33-4)76615277 fax:76615252



From guido at CNRI.Reston.VA.US  Thu Nov  4 21:58:53 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 04 Nov 1999 15:58:53 -0500
Subject: [Python-Dev] wish list
Message-ID: <199911042058.PAA15437@eric.cnri.reston.va.us>

I got the wish list below.  Anyone care to comment on how close we are
on fulfilling some or all of this?

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Thu, 04 Nov 1999 20:26:54 +0700
From:    "Claudio Ram?n" 
To:      guido at python.org

Hello,
  I'm a python user (excuse my english, I'm spanish and...). I think it is a 
very complete language and I use it in solve statistics, phisics, 
mathematics, chemistry and biology problemns. I'm not an
experienced programmer, only a scientific with problems to solve.
The motive of this letter is explain to you a needs that I have in
the python use and I think in the next versions...
* GNU CC for Win32 compatibility (compilation of python interpreter and 
"Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative 
eviting the cygwin dll user.
* Add low level programming capabilities for system access and speed of code 
fragments eviting the C-C++ or Java code use. Python, I think, must be a 
complete programming language in the "programming for every body" philosofy.
* Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI 
in the standard distribution. For example, Wxpython permit an html browser. 
It is very importan for document presentations. And Wxwindows and Gtk+ are 
faster than tk.
* Incorporate a database system in the standard library distribution. To be 
possible with relational and documental capabilites and with import facility 
of DBASE, Paradox, MSAccess files.
* Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to 
be possible with XML how internal file format). And to be possible with 
Microsoft Word import export facility. For example, AbiWord project can be 
an alternative but if lacks programming language. If we can make python the 
programming language for AbiWord project...

Thanks.
Ram?n Molina.
rmn70 at hotmail.com

______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com

------- End of Forwarded Message




From skip at mojam.com  Thu Nov  4 22:06:53 1999
From: skip at mojam.com (Skip Montanaro)
Date: Thu, 4 Nov 1999 15:06:53 -0600 (CST)
Subject: [Python-Dev] wish list
In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us>
References: <199911042058.PAA15437@eric.cnri.reston.va.us>
Message-ID: <14369.62829.389307.377095@dolphin.mojam.com>

     * Incorporate a database system in the standard library
       distribution. To be possible with relational and documental
       capabilites and with import facility of DBASE, Paradox, MSAccess
       files.

I know Digital Creations has a dbase module knocking around there somewhere.
I hacked on it for them a couple years ago.  You might see if JimF can
scrounge it up and donate it to the cause.

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...




From fdrake at acm.org  Thu Nov  4 22:08:26 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 4 Nov 1999 16:08:26 -0500 (EST)
Subject: [Python-Dev] wish list
In-Reply-To: <199911042058.PAA15437@eric.cnri.reston.va.us>
References: <199911042058.PAA15437@eric.cnri.reston.va.us>
Message-ID: <14369.62922.994300.233350@weyr.cnri.reston.va.us>

Guido van Rossum writes:
 > I got the wish list below.  Anyone care to comment on how close we are
 > on fulfilling some or all of this?

Claudio Ram?n  wrote:
 > * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI 
 > in the standard distribution. For example, Wxpython permit an html browser. 
 > It is very importan for document presentations. And Wxwindows and Gtk+ are 
 > faster than tk.

  And GTK+ looks better, too.  ;-)
  None the less, I don't think GTK+ is as solid or mature as Tk.
There are still a lot of oddities, and several warnings/errors get
messages printed on stderr/stdout (don't know which) rather than
raising exceptions.  (This is a failing of GTK+, not PyGTK.)  There
isn't an equivalent of the Tk text widget, which is a real shame.
There are people working on something better, but it's not a trivial
project and I don't have any idea how its going.

 > * Incorporate a database system in the standard library distribution. To be 
 > possible with relational and documental capabilites and with import facility 
 > of DBASE, Paradox, MSAccess files.

  Doesn't sound like part of a core library really, though I could see 
combining the Win32 extensions with the core package to produce a
single installable.  That should at least provide access to MSAccess,
and possible the others, via ODBC.

 > * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to 
 > be possible with XML how internal file format). And to be possible with 
 > Microsoft Word import export facility. For example, AbiWord project can be 
 > an alternative but if lacks programming language. If we can make python the 
 > programming language for AbiWord project...

  I think this would be great to have.  But I wouldn't put the
editor/browser in the core.  I would stick something like the
XML-SIG's package in, though, once that's better polished.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From jim at interet.com  Fri Nov  5 01:09:40 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Thu, 04 Nov 1999 19:09:40 -0500
Subject: [Python-Dev] wish list
References: <199911042058.PAA15437@eric.cnri.reston.va.us>
Message-ID: <38222044.46CB297E@interet.com>

Guido van Rossum wrote:
> 
> I got the wish list below.  Anyone care to comment on how close we are
> on fulfilling some or all of this?

> * GNU CC for Win32 compatibility (compilation of python interpreter and
> "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative
> eviting the cygwin dll user.

I don't know what this means.

> * Add low level programming capabilities for system access and speed of code
> fragments eviting the C-C++ or Java code use. Python, I think, must be a
> complete programming language in the "programming for every body" philosofy.

I don't know what this means in practical terms either.  I use
the C interface for this.

> * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI
> in the standard distribution. For example, Wxpython permit an html browser.
> It is very importan for document presentations. And Wxwindows and Gtk+ are
> faster than tk.

As a Windows user, I don't feel comfortable publishing GUI code
based on these tools.  Maybe they have progressed and I should
look at them again.  But I doubt the Python world is going to
standardize on a single GUI anyway.

Does anyone out there publish Windows Python code with a Windows
Python GUI?  If so, what GUI toolkit do you use?

Jim Ahlstrom



From rushing at nightmare.com  Fri Nov  5 08:22:22 1999
From: rushing at nightmare.com (Sam Rushing)
Date: Thu, 4 Nov 1999 23:22:22 -0800 (PST)
Subject: [Python-Dev] wish list
In-Reply-To: <668469884@toto.iv>
Message-ID: <14370.34222.884193.260990@seattle.nightmare.com>

James C. Ahlstrom writes:
 > Guido van Rossum wrote:
 > > I got the wish list below.  Anyone care to comment on how close we are
 > > on fulfilling some or all of this?
 > 
 > > * GNU CC for Win32 compatibility (compilation of python interpreter and
 > > "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative
 > > eviting the cygwin dll user.
 > 
 > I don't know what this means.

mingw32: 'minimalist gcc for win32'.  it's gcc on win32 without trying
to be unix. It links against crtdll, so for example it can generate
small executables that run on any win32 platform.  Also, an
alternative to plunking down money ever year to keep up with MSVC++

I used to use mingw32 a lot, and it's even possible to set up egcs to
cross-compile to it.  At one point using egcs on linux I was able to
build a stripped-down python.exe for win32...

  http://agnes.dida.physik.uni-essen.de/~janjaap/mingw32/

-Sam




From jim at interet.com  Fri Nov  5 15:04:59 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Fri, 05 Nov 1999 09:04:59 -0500
Subject: [Python-Dev] wish list
References: <14370.34222.884193.260990@seattle.nightmare.com>
Message-ID: <3822E40B.99BA7CA0@interet.com>

Sam Rushing wrote:

> mingw32: 'minimalist gcc for win32'.  it's gcc on win32 without trying
> to be unix. It links against crtdll, so for example it can generate

OK, thanks.  But I don't believe this is something that
Python should pursue.  Binaries are available for Windows
and Visual C++ is widely available and has a professional
debugger (etc.).

Jim Ahlstrom



From skip at mojam.com  Fri Nov  5 18:17:58 1999
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 5 Nov 1999 11:17:58 -0600 (CST)
Subject: [Python-Dev] paper available
In-Reply-To: <199911032234.XAA26442@pukapuka.inrialpes.fr>
References: <199911032234.XAA26442@pukapuka.inrialpes.fr>
Message-ID: <14371.4422.96832.498067@dolphin.mojam.com>

    Vlad> I've OCR'd Saltzer's paper. It's available temporarily (in MS Word
    Vlad> format) at http://sirac.inrialpes.fr/~marangoz/tmp/Saltzer.zip

I downloaded it and took a very quick peek at it, but it's applicability to
Python wasn't immediately obvious to me.  Did you download it in response to
some other thread I missed somewhere?

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From gstein at lyra.org  Fri Nov  5 23:19:49 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 5 Nov 1999 14:19:49 -0800 (PST)
Subject: [Python-Dev] wish list
In-Reply-To: <3822E40B.99BA7CA0@interet.com>
Message-ID: 

On Fri, 5 Nov 1999, James C. Ahlstrom wrote:
> Sam Rushing wrote:
> > mingw32: 'minimalist gcc for win32'.  it's gcc on win32 without trying
> > to be unix. It links against crtdll, so for example it can generate
> 
> OK, thanks.  But I don't believe this is something that
> Python should pursue.  Binaries are available for Windows
> and Visual C++ is widely available and has a professional
> debugger (etc.).

If somebody is willing to submit patches, then I don't see a problem with
it. There are quite a few people who are unable/unwilling to purchase
VC++. People may also need to build their own Python rather than using the
prebuilt binaries.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sun Nov  7 14:24:24 1999
From: gstein at lyra.org (Greg Stein)
Date: Sun, 7 Nov 1999 05:24:24 -0800 (PST)
Subject: [Python-Dev] updated modules
Message-ID: 

Hi all...

I've updated some of the modules at http://www.lyra.org/greg/python/.

Specifically, there is a new httplib.py, davlib.py, qp_xml.py, and
a new imputil.py. The latter will be updated again RSN with some patches
from Jim Ahlstrom.

Besides some tweaks/fixes/etc, I've also clarified the ownership and
licensing of the things. httplib and davlib are (C) Guido, licensed under
the Python license (well... anything he chooses :-). qp_xml and imputil
are still Public Domain. I also added some comments into the headers to
note where they come from (I've had a few people remark that they ran
across the module but had no idea who wrote it or where to get updated
versions :-), and I inserted a CVS Id to track the versions (yes, I put
them into CVS just now).

Note: as soon as I figure out the paperwork or whatever, I'll also be
skipping the whole "wetsign.txt" thingy and just transfer everything to
Guido. He remarked a while ago that he will finally own some code in the
Python distribution(!) despite not writing it :-)

I might encourage others to consider the same...

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Mon Nov  8 10:33:30 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 08 Nov 1999 10:33:30 +0100
Subject: [Python-Dev] wish list
References: <199911042058.PAA15437@eric.cnri.reston.va.us>
Message-ID: <382698EA.4DBA5E4B@lemburg.com>

Guido van Rossum wrote:
> 
> * GNU CC for Win32 compatibility (compilation of python interpreter and
> "Freeze" utility). I think MingWin32 (Mummint Khan) is a good alternative
> eviting the cygwin dll user.

I think this would be a good alternative for all those not having MS VC
for one reason or another. Since Mingw32 is free this might be an
appropriate solution for e.g. schools which don't want to spend lots
of money for VC licenses.

> * Add low level programming capabilities for system access and speed of code
> fragments eviting the C-C++ or Java code use. Python, I think, must be a
> complete programming language in the "programming for every body" philosofy.

Don't know what he meant here...

> * Incorporate WxWindows (wxpython) and/or Gtk+ (now exist a win32 port) GUI
> in the standard distribution. For example, Wxpython permit an html browser.
> It is very importan for document presentations. And Wxwindows and Gtk+ are
> faster than tk.

GUIs tend to be fast moving targets, better leave them out of the
main distribution.

> * Incorporate a database system in the standard library distribution. To be
> possible with relational and documental capabilites and with import facility
> of DBASE, Paradox, MSAccess files.

Database interfaces are usually way to complicated and largish for the standard
dist. IMHO, they should always be packaged separately. Note that simple
interfaces such as a standard CSV file import/export module would be
neat extensions to the dist.

> * Incorporate a XML/HTML/Math-ML editor/browser with graphics capability (to
> be possible with XML how internal file format). And to be possible with
> Microsoft Word import export facility. For example, AbiWord project can be
> an alternative but if lacks programming language. If we can make python the
> programming language for AbiWord project...

I'm getting the feeling that Ramon is looking for a complete
visual programming environment here. XML support in the standard
dist (faster than xmllib.py) would be nice. Before that we'd need solid builtin
Unicode support though...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    53 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From captainrobbo at yahoo.com  Tue Nov  9 14:57:46 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 9 Nov 1999 05:57:46 -0800 (PST)
Subject: [Python-Dev] Internationalisation Case Study
Message-ID: <19991109135746.20446.rocketmail@web608.mail.yahoo.com>

Guido has asked me to get involved in this discussion,
as I've been working practically full-time on i18n for
the last year and a half and have done quite a bit
with Python in this regard.  I thought the most
helpful thing would be to describe the real-world
business problems I have been tackling so people can
understand what one might want from an encoding
toolkit.  In this (long) post I have included:
1. who I am and what I want to do
2. useful sources of info
3. a real world i18n project
4. what I'd like to see in an encoding toolkit


Grab a coffee - this is a long one.

1. Who I am
--------------
Firstly, credentials.  I'm a Python programmer by
night, and when I can involve it in my work which
happens perhaps 20% of the time.  More relevantly, I
did a postgrad course in Japanese Studies and lived in
Japan for about two years; in 1990 when I returned, I
was speaking fairly fluently and could read a
newspaper with regular reference tio a dictionary. 
Since then my Japanese has atrophied badly, but it is
good enough for IT purposes.  For the last year and a
half I have been internationalizing a lot of systems -
more on this below.

My main personal interest is that I am hoping to
launch a company using Python for reporting, data
cleaning and transformation.  An encoding library is
sorely needed for this.

2. Sources of Knowledge
------------------------------
We should really go for world class advice on this. 
Some people who could really contribute to this
discussion are:
- Ken Lunde, author of "CJKV Information Processing"
and head of Asian Type Development at Adobe.  
- Jeffrey Friedl, author of "Mastering Regular
Expressions", and a long time Japan resident and
expert on things Japanese
- Maybe some of the Ruby community?

I'll list up books URLs etc. for anyone who needs them
on request.

3. A Real World Project
----------------------------
18 months ago I was offered a contract with one of the
world's largest investment management companies (which
I will nickname HugeCo) , who (after many years having
analysts out there) were launching a business in Japan
to attract savers; due to recent legal changes,
Japanese people can now freely buy into mutual funds
run by foreign firms.  Given the 2% they historically
get on their savings, and the 12% that US equities
have returned for most of this century, this is a
business with huge potential.  I've been there for a
while now, 
rotating through many different IT projects.

HugeCo runs its non-US business out of the UK.  The
core deal-processing business runs on IBM AS400s. 
These are kind of a cross between a relational
database and a file system, and speak their own
encoding called EBCDIC.    Five years ago the AS400
had limited
connectivity to everything else, so they also started
deploying Sybase databases on Unix to support some
functions.  This means 'mirroring' data between the
two systems on a regular basis.  IBM has always
included encoding information on the AS400 and it
converts from EBCDIC to ASCII on request with most of
the transfer tools (FTP, database queries etc.)

To make things work for Japan, everyone realised that
a double-byte representation would be needed. 
Japanese has about 7000 characters in most IT-related
character sets, and there are a lot of ways to store
it.  Here's a potted language lesson.  (Apologies to
people who really know this field -- I am not going to
be fully pedantic or this would take forever).

Japanese includes two phonetic alphabets (each with
about 80-90 characters), the thousands of Kanji, and
English characters, often all in the same sentence.  
The first attempt to display something was to
make a single -byte character set which included
ASCII, and a simplified (and very ugly) katakana
alphabet in the upper half of the code page.  So you
could spell out the sounds of Japanese words using
'half width katakana'. 

The basic 'character set' is Japan Industrial Standard
0208 ("JIS"). This was defined in 1978, the first
official Asian character set to be defined by a
government.   This can be thought of as a printed
chart
showing the characters - it does not define their
storage on a computer.   It defined a logical 94 x 94
grid, and each character has an index in this grid.

The "JIS" encoding was a way of mixing ASCII and
Japanese in text files and emails.  Each Japanese
character had a double-byte value. It had 'escape
sequences' to say 'You are now entering ASCII
territory' or the opposite.   In 1978 Microsoft
quickly came up with Shift-JIS, a smarter encoding. 
This basically said "Look at the next byte.  If below
127, it is ASCII; if between A and B, it is a
half-width
katakana; if between B and C, it is the first half of
a double-byte character and the next one is the second
half".  Extended Unix Code (EUC) does similar tricks. 
Both have the property that there are no control
characters, and ASCII is still ASCII.  There are a few
other encodings too.

Unfortunately for me and HugeCo, IBM had their own
standard before the Japanese government did, and it
differs; it is most commonly called DBCS (Double-Byte
Character Set).  This involves shift-in and shift-out
sequences (0x16 and 0x17, cannot remember which way
round), so you can mix single and double bytes in a
field.  And we used AS400s for our core processing.

So, back to the problem.  We had a FoxPro system using
ShiftJIS on the desks in Japan which we wanted to
replace in stages, and an AS400 database to replace it
with.  The first stage was to hook them up so names
and addresses could be uploaded to the AS400, and data
files consisting of daily report input could be
downloaded to the PCs.  The AS400 supposedly had a
library which did the conversions, but no one at IBM
knew how it worked.  The people who did all the
evaluations had basically proved that 'Hello World' in
Japanese could be stored on an AS400, but never looked
at the conversion issues until mid-project. Not only
did we need a conversion filter, we had the problem
that the character sets were of different sizes.  So
it was possible - indeed, likely - that some of our
ten thousand customers' names and addresses would
contain characters only on one system or the other,
and fail to
survive a round trip.  (This is the absolute key issue
for me - will a given set of data survive a round trip
through various encoding conversions?)

We figured out how to get the AS400 do to the
conversions during a file transfer in one direction,
and I wrote some Python scripts to make up files with
each official character in JIS on a line; these went
up with conversion, came back binary, and I was able
to build a mapping table and 'reverse engineer' the
IBM encoding.  It was straightforward in theory, "fun"
in practice.  I then wrote a python library which knew
about the AS400 and Shift-JIS encodings, and could
translate a string between them.  It could also detect
corruption and warn us when it occurred.  (This is
another key issue - you will often get badly encoded
data, half a kanji or a couple of random bytes, and
need to be clear on your strategy for handling it in
any library).  It was slow, but it got us our gateway
in both directions, and it warned us of bad input. 360
characters in the DBCS encoding actually appear twice,
so perfect round trips are impossible, but practically
you can survive with some validation of input at both
ends.  The final story was that our names and
addresses were mostly safe, but a few obscure symbols
weren't.

A big issue was that field lengths varied.  An address
field 40 characters long on a PC might grow to 42 or
44 on an AS400 because of the shift characters, so the
software would truncate the address during import, and
cut a kanji in half.  This resulted in a string that
was illegal DBCS, and errors in the database.  To
guard against this, you need really picky input
validation.  You not only ask 'is this string valid
Shift-JIS', you check it will fit on the other system
too.

The next stage was to bring in our Sybase databases. 
Sybase make a Unicode database, which works like the
usual one except that all your SQL code suddenly
becomes case sensitive - more (unrelated) fun when
you have 2000 tables.  Internally it stores data in
UTF8, which is a 'rearrangement' of Unicode which is
much safer to store in conventional systems.
Basically, a UTF8 character is between one and three
bytes, there are no nulls or control characters, and
the ASCII characters are still the same ASCII
characters.  UTF8<->Unicode involves some bit
twiddling but is one-to-one and entirely algorithmic.

We had a product to 'mirror' data between AS400 and
Sybase, which promptly broke when we fed it Japanese. 
The company bought a library called Unilib to do
conversions, and started rewriting the data mirror
software.  This library (like many) uses Unicode as a
central point in all conversions, and offers most of
the world's encodings.  We wanted to test it, and used
the Python routines to put together a regression
test.  As expected, it was mostly right but had some
differences, which we were at least able to document. 

We also needed to rig up a daily feed from the legacy
FoxPro database into Sybase while it was being
replaced (about six months).  We took the same
library, built a DLL wrapper around it, and I
interfaced to this with DynWin , so we were able to do
the low-level string conversion in compiled code and
the high-level 
control in Python. A FoxPro batch job wrote out
delimited text in shift-JIS; Python read this in, ran
it through the DLL to convert it to UTF8, wrote that
out as UTF8 delimited files, ftp'ed them to an 
in directory on the Unix box ready for daily import. 
At this point we had a lot of fun with field widths -
Shift-JIS is much more compact than UTF8 when you have
a lot of kanji (e.g. address fields).

Another issue was half-width katakana.  These were the
earliest attempt to get some form of Japanese out of a
computer, and are single-byte characters above 128 in
Shift-JIS - but are not part of the JIS0208 standard. 

They look ugly and are discouraged; but when you ar
enterinh a long address in a field of a database, and
it won't quite fit, the temptation is to go from
two-bytes-per -character to one (just hit F7 in
windows) to save space.  Unilib rejected these (as
would Java), but has optional modes to preserve them
or 'expand them out' to their full-width equivalents.


The final technical step was our reports package. 
This is a 4GL using a really horrible 1980s Basic-like
language which reads in fixed-width data files and
writes out Postscript; you write programs saying 'go
to x,y' and 'print customer_name', and can build up
anything you want out of that.  It's a monster to
develop in, but when done it really works - 
million page jobs no problem.  We had bought into this
on the promise that it supported Japanese; actually, I
think they had got the equivalent of 'Hello World' out
of it, since we had a lot of problems later.  

The first stage was that the AS400 would send down
fixed width data files in EBCDIC and DBCS.  We ran
these through a C++ conversion utility, again using
Unilib.  We had to filter out and warn about corrupt 
fields, which the conversion utility would reject. 
Surviving records then went into the reports program.

It then turned out that the reports program only
supported some of the Japanese alphabets. 
Specifically, it had a built in font switching system 
whereby when it encountered ASCII text, it would flip
to the most recent single byte text, and when it found
a byte above 127, it would flip to a double byte font.
 This is because many Chinese fonts do (or did) 
not include English characters, or included really
ugly ones.  This was wrong for Japanese, and made the
half-width katakana unprintable.  I found out that I
could control fonts if I printed one character at a
time with a special escape sequence, so wrote my own
bit-scanning code (tough in a language without ord()
or bitwise operations) to examine a string, classify
every byte, and control the fonts the way I wanted. 
So a special subroutine is used for every name or
address field.  This is apparently not unusual in GUI
development (especially web browsers) - you rarely
find a complete Unicode font, so you have to switch
fonts on the fly as you print a string.

After all of this, we had a working system and knew
quite a bit about encodings.  Then the curve ball
arrived:  User Defined Characters!

It is not true to say that there are exactly 6879
characters in Japanese, and more than counting the
number of languages on the Indian sub-continent or the
types of cheese in France.  There are historical
variations and they evolve.  Some people's names got
missed out, and others like to write a kanji in an
unusual way.   Others arrived from China where they
have more complex variants of the same characters.  
Despite the Japanese government's best attempts, these
people have dug their heels in and want to keep their
names the way they like them.  My first reaction was
'Just Say No' - I basically said that it one of these
customers (14 out of a database of 8000) could show me
a tax form or phone bill with the correct UDC on it,
we would implement it but not otherwise (the usual
workaround is to spell their name phonetically in
katakana).  But our marketing people put their foot
down.  

A key factor is that Microsoft has 'extended the
standard' a few times.  First of all, Microsoft and
IBM include an extra 360 characters in their code page
which are not in the JIS0208 standard.   This is well
understood and most encoding toolkits know what 'Code
Page 932' is Shift-JIS plus a few extra characters. 
Secondly, Shift-JIS has a User-Defined region of a
couple of thousand characters.  They have lately been
taking Chinese variants of Japanese characters (which
are readable but a bit old-fashioned - I can imagine
pipe-smoking professors using these forms as an
affectation) and adding them into their standard
Windows fonts; so users are getting used to these
being available.  These are not in a standard. 
Thirdly, they include something called the 'Gaiji
Editor' in Japanese Win95, which lets you add new
characters to the fonts on your PC within the
user-defined region.  The first step was to review all
the PCs in the Tokyo office, and get one centralized
extension font file on a server.  This was also fun as
people had assigned different code points to
characters on differene machines, so what looked
correct on your word processor was a black square on
mine.   Effectively, each company has its own custom
encoding a bit bigger than the standard.

Clearly, none of these extensions would convert
automatically to the other platforms.

Once we actually had an agreed list of code points, we
scanned the database by eye and made sure that the
relevant people were using them.  We decided that
space for 128 User-Defined Characters would  be
allowed.  We thought we would need a wrapper around
Unilib to intercept these values and do a special
conversion; but to our amazement it worked!  Somebody
had already figured out a mapping for at least 1000
characters for all the Japanes encodings, and they did
the round trips from Shift-JIS to Unicode to DBCS and
back.  So the conversion problem needed less code than
we thought.  This mapping is not defined in a standard
AFAIK (certainly not for DBCS anyway).  

We did, however, need some really impressive
validation.  When you input a name or address on any
of the platforms, the system should say 
(a) is it valid for my encoding?
(b) will it fit in the available field space in the
other platforms?
(c) if it contains user-defined characters, are they
the ones we know about, or is this a new guy who will
require updates to our fonts etc.?

Finally, we got back to the display problems.  Our
chosen range had a particular first byte. We built a
miniature font with the characters we needed starting
in the lower half of the code page.  I then
generalized by name-printing routine to say 'if the
first character is XX, throw it away, and print the
subsequent character in our custom font'.  This worked
beautifully - not only could we print everything, we
were using type 1 embedded fonts for the user defined
characters, so we could distill it and also capture it
for our internal document imaging systems.

So, that is roughly what is involved in building a
Japanese client reporting system that spans several
platforms.

I then moved over to the web team to work on our
online trading system for Japan, where I am now -
people will be able to open accounts and invest on the
web.  The first stage was to prove it all worked. 
With HTML, Java and the Web, I had high hopes, which
have mostly been fulfilled - we set an option in the
database connection to say 'this is a UTF8 database',
and Java converts it to Unicode when reading the
results, and we set another option saying 'the output
stream should be Shift-JIS' when we spew out the HTML.
 There is one limitations:  Java sticks to the JIS0208
standard, so the 360 extra IBM/Microsoft Kanji and our
user defined characters won't work on the web.  You
cannot control the fonts on someone else's web
browser; management accepted this because we gave them
no alternative.  Certain customers will need to be
warned, or asked to suggest a standard version of a
charactere if they want to see their name on the web. 
I really hope the web actually brings character usage
in line with the standard in due course, as it will
save a fortune.

Our system is multi-language - when a customer logs
in, we want to say 'You are a Japanese customer of our
Tokyo Operation, so you see page X in language Y'. 
The language strings all all kept in UTF8 in XML
files, so the same file can hold many languages.  This
and the database are the real-world reasons why you
want to store stuff in UTF8.  There are very few tools
to let you view UTF8, but luckily there is a free Word
Processor that lets you type Japanese and save it in
any encoding; so we can cut and paste between
Shift-JIS and UTF8 as needed.

And that's it.  No climactic endings and a lot of real
world mess, just like life in IT.  But hopefully this
gives you a feel for some of the practical stuff
internationalisation projects have to deal with.  See
my other mail for actual suggestions

- Andy Robinson







=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From captainrobbo at yahoo.com  Tue Nov  9 14:58:39 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 9 Nov 1999 05:58:39 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>

Here are the features I'd like to see in a Python
Internationalisation Toolkit.  I'm very open to
persuasion about APIs and how to do it, but this is
roughly the functionality I would have wanted for the
last year (see separate post "Internationalization
Case Study"):

Built-in types:
---------------
"Unicode String" and "Normal String".  The normal
string is can hold all 256 possible byte values and is
analogous to java's Byte Array - in other words an
ordinary Python string.  

Unicode strings iterate (and are manipulated) per
character, not per byte. You knew that already.  To
manipulate anything in a funny encoding, you convert
it to Unicode, manipulate it there, then convert it
back.

Easy Conversions
----------------------
This is modelled on Java which I think has it right. 
When you construct a Unicode string, you may supply an
optional encoding argument.  I'm not bothered if
conversion happens in a global function, a constructor
method or whatever.

MyUniString = ToUnicode('hello')   # assumes ASCII
MyUniString = ToUnicode('pretend this is Japanese',
'ShiftJIS')  #specified

The converse applies when converting back.

The encoding designators should agree with Java.  If
data is encountered which is not valid for the
encoding, there are several strategies, and it would
be nice if they could be specified explicitly:
1. replace offending characters with a question mark
2. try to recover intelligently (possible in some
cases)
3. raise an exception

A 'Unicode' designator is needed which performs a
dummy conversion.

File Opening:  
---------------
It should be possible to work with files as we do now
- just streams of binary data.  It should also be
possible to    read, say, a file of locally endoded
addresses into a Unicode string. e.g. open(myfile,
'r', 'ShiftJIS').  

It should also be possible to open a raw Unicode file
and read the bytes into ordinary Python strings, or
Unicode strings.  In this case one needs to watch out
for the byte-order marks at the beginning of the file.

Not sure of a good API to do this.  We could have
OrdinaryFile objects and UnicodeFile objects, or
proliferate the arguments to 'open.

Doing the Conversions
----------------------------
All conversions should go through Unicode as the
central point.  

Here is where we can start to define the territory.

Some conversions are algorithmic, some are lookups,
many are a mixture with some simple state transitions
(e.g. shift characters to denote switches from
double-byte to single-byte).  I'd like to see an
'encoding engine' modelled on something like
mxTextTools - a state machine with a few simple
actions, effectively a mini-language for doing simple
operations.  Then a new encoding can be added in a
data-driven way, and still go at C-like speeds. 
Making this open and extensible (and preferably not
needing to code C to do it) is the only way I can see
to get a really good solid encodings library.  Not all
encodings need go in the standard distribution, but
all should be downloadable from www.python.org.

A generalized two-byte-to-two-byte mapping is 128kb. 
But there are compact forms which can reduce these to
a few kb, and also make the data intelligible. It is
obviously desirable to store stuff compactly if we can
unpack it fast.


Typed Strings
----------------
When you are writing data conversion tools to sit in
the middle of a bunch of databases, you could save a
lot of grief with a string that knows its encoding. 
What follows could be done as a Python wrapper around
something ordinary strings rather than as a new type,
and thus need not be part of the language.  

This is analogous to Martin Fowler's Quantity pattern
in Analysis Patterns, where a number knows its units
and you cannot add dollars and pounds accidentally.  

These would do implicit conversions; and they would
stop you assigning or confusing differently encoded
strings.  They would also validate when constructed. 
'Typecasting' would be allowed but would require
explicit code.  So maybe something like...

>>>ts1 = TypedString('hello', 'cp932ms')   # specify
encoding, it remembers it
>>>ts2 = TypedString('goodbye','cp5035')  
>>>ts1 + ts2   #or any of a host of other encoding
options
EncodingError
>>>ts3 = TypedString(ts1, 'cp5035')   #converts it
implicitly going via Unicode
>>>ts4 = ts1.cast('ShiftJIS')   #the developer knows
that in this case the string is compatible.


Going Deeper
----------------
The project I describe involved many more issues than
just a straight conversion.  I envisage an encodings
package or module which power users could get at
directly.  

We have be able to answer the questions:

'is string X a valid instance of encoding Y?'
'is string X nearly a valid instance of encoding Y,
maybe with a little corruption, or is it something
totally different?' - this one might be a task left to
a programmer, but the toolkit should help where it
can.

'can string X be converted from encoding Y to encoding
Z without loss of data?  If not, exactly what will get
trashed' ?  This is a really useful utility.

More generally, I want tools to reason about character
sets and encodings.  I have 'Character Set' and
'Character Mapping' classes - very app-specific and
proprietary - which let me express and answer
questions about whether one character set is a
superset of another and reason about round trips.  I'd
like to do these properly for the toolkit.  They would
need some C support for speed, but I think they could
still be data driven.   So we could have an Endoding
object which could be pickled, and we could keep a
directory full of them as our database.  There might
actually be two encoding objects - one for
single-byte, one for multi-byte, with the same API.

There are so many subtle differences between encodings
(even within the Shift-JIS family) - company X has ten
extra characters, and that is technically a new
encoding.  So it would be really useful to reason
about these and say 'find me all JIS-compatible
encodings', or 'report on the differences between
Shift-JIS and 'cp932ms'.

GUI Issues
-------------
The new Pythonwin breaks somewhat on Japanese - editor
windows are fine but console output is show as
single-byte garbage.  I will try to evaluate IDLE on a
Japanese test box this week.  I think these two need
to work for double-byte languages for our credibility.

Verifiability and printing
-----------------------------
We will need to prove it all works.  This means
looking at text on a screen or on paper.  A really
wicked demo utility would be a GUI which could open
files and convert encodings in an editor window or
spreadsheet window, and specify conversions on
copy/paste.  If it could save a page as HTML (just an
encoding tag and data between 
 tags, then we
could use Netscape/IE for verification.  Better still,
a web server demo could convert on python.org and tag
the pages appropriately - browsers support most common
encodings.

All the encoding stuff is ultimately a bit meaningless
without a way to display a character.  I am hoping
that PDF and PDFgen may add a lot of value here. 
Adobe (and Ken Lunde) have spent years coming up with
a general architecture for this stuff in PDF. 
Basically, the multi-byte fonts they use are encoding
independent, and come with a whole bunch of mapping
tables.  So I can ask for the same Japanese font in
any of about ten encodings - font name is a
combination of face name and encoding.  The font
itself does the remapping.  They make available
downloadable font packs for Acrobat 4.0 for most
languages now; these are good places to raid for
building encoding databases.  

It also means that I can write a Python script to
crank out beautiful-looking code page charts for all
of our encodings from the database, and input and
output to regression tests.  I've done it for
Shift-JIS at Fidelity, and would have to rewrite it
once I am out of here.  But I think that some good
graphic design here would lead to a product that blows
people away - an encodings library that can print out
its own contents for viewing and thus help demonstrate
its own correctness (or make errors stick out like a
sore thumb).

Am I mad?  Have I put you off forever?  What I outline
above would be a serious project needing months of
work; I'd be really happy to take a role, if we could
find sponsors for the project.  But I believe we could
define the standard for years to come.  Furthermore,
it would go a long way to making Python the corporate
choice for data cleaning and transformation -
territory I think we should own.

Regards,

Andy Robinson
Robinson Analytics Ltd.









=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From guido at CNRI.Reston.VA.US  Tue Nov  9 17:46:41 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 09 Nov 1999 11:46:41 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: Your message of "Tue, 09 Nov 1999 05:58:39 PST."
             <19991109135839.25864.rocketmail@web607.mail.yahoo.com> 
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> 
Message-ID: <199911091646.LAA21467@eric.cnri.reston.va.us>

Andy,

Thanks a bundle for your case study and your toolkit proposal.  It's
interesting that you haven't touched upon internationalization of user
interfaces (dialog text, menus etc.) -- that's a whole nother can of
worms.

Marc-Andre Lemburg has a proposal for work that I'm asking him to do
(under pressure from HP who want Python i18n badly and are willing to
pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt

I think his proposal will go a long way towards your toolkit.  I hope
to hear soon from anybody who disagrees with Marc-Andre's proposal,
because without opposition this is going to be Python 1.6's offering
for i18n...  (Together with a new Unicode regex engine by /F.)

One specific question: in you discussion of typed strings, I'm not
sure why you couldn't convert everything to Unicode and be done with
it.  I have a feeling that the answer is somewhere in your case study
-- maybe you can elaborate?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From akuchlin at mems-exchange.org  Tue Nov  9 18:21:03 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 9 Nov 1999 12:21:03 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <14376.22527.323888.677816@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>I think his proposal will go a long way towards your toolkit.  I hope
>to hear soon from anybody who disagrees with Marc-Andre's proposal,
>because without opposition this is going to be Python 1.6's offering
>for i18n...  

The proposal seems reasonable to me.

>(Together with a new Unicode regex engine by /F.)

This is good news!  Would it be a from-scratch regex implementation,
or would it be an adaptation of an existing engine?  Would it involve
modifications to the existing re module, or a completely new unicodere
module?  (If, unlike re.py, it has POSIX longest-match semantics, that
would pretty much settle the question.)

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
All around me darkness gathers, fading is the sun that shone, we must speak of
other matters, you can be me when I'm gone...
    -- The train's clattering, in SANDMAN #67: "The Kindly Ones:11"




From guido at CNRI.Reston.VA.US  Tue Nov  9 18:26:38 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 09 Nov 1999 12:26:38 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: Your message of "Tue, 09 Nov 1999 12:21:03 EST."
             <14376.22527.323888.677816@amarok.cnri.reston.va.us> 
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com> <199911091646.LAA21467@eric.cnri.reston.va.us>  
            <14376.22527.323888.677816@amarok.cnri.reston.va.us> 
Message-ID: <199911091726.MAA21754@eric.cnri.reston.va.us>

[AMK]
> The proposal seems reasonable to me.

Thanks.  I really hope that this time we can move forward united...

> >(Together with a new Unicode regex engine by /F.)
> 
> This is good news!  Would it be a from-scratch regex implementation,
> or would it be an adaptation of an existing engine?  Would it involve
> modifications to the existing re module, or a completely new unicodere
> module?  (If, unlike re.py, it has POSIX longest-match semantics, that
> would pretty much settle the question.)

It's from scratch, and I believe it's got Perl style, not POSIX style
semantics -- per Tim Peters' recommendations.  Do we need to open the
discussion again?

It involves a redone re module (supporting Unicode as well as 8-bit),
but its API could be unchanged.  /F does the parsing and compilation
in Python, only the matching engine is in C -- not sure how that
impacts performance, but I imagine with aggressive caching it would be
okay.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From akuchlin at mems-exchange.org  Tue Nov  9 18:40:07 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 9 Nov 1999 12:40:07 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<14376.22527.323888.677816@amarok.cnri.reston.va.us>
	<199911091726.MAA21754@eric.cnri.reston.va.us>
Message-ID: <14376.23671.250752.637144@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>It's from scratch, and I believe it's got Perl style, not POSIX style
>semantics -- per Tim Peters' recommendations.  Do we need to open the
>discussion again?

No, no; I'm actually happier with Perl-style, because it's far better
documented and familiar to people. Worse *is* better, after all.

My concern is simply that I've started translating re.py into C, and
wonder how this affects the translation.  This isn't a pressing issue,
because the C version isn't finished yet.

>It involves a redone re module (supporting Unicode as well as 8-bit),
>but its API could be unchanged.  /F does the parsing and compilation
>in Python, only the matching engine is in C -- not sure how that
>impacts performance, but I imagine with aggressive caching it would be
>okay.

Can I get my paws on a copy of the modified re.py to see what
ramifications it has, or is this all still an unreleased
work-in-progress?

Doing the compilation in Python is a good idea, and will make it
possible to implement alternative syntaxes.  I would have liked to
make it possible to generate PCRE bytecodes from Python, but what
stopped me is the chance of bogus bytecode causing the engine to dump
core, loop forever, or some other nastiness.  (This is particularly
important for code that uses rexec.py, because you'd expect regexes to
be safe.)  Fixing the engine to be stable when faced with bad
bytecodes appears to require many additional checks that would slow
down the common case of correct code, which is unappealing.


-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Anybody else on the list got an opinion? Should I change the language or not?
    -- Guido van Rossum, 28 Dec 91




From ping at lfw.org  Tue Nov  9 19:08:05 1999
From: ping at lfw.org (Ka-Ping Yee)
Date: Tue, 9 Nov 1999 10:08:05 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <14376.23671.250752.637144@amarok.cnri.reston.va.us>
Message-ID: 

On Tue, 9 Nov 1999, Andrew M. Kuchling wrote:
> Guido van Rossum writes:
> >It's from scratch, and I believe it's got Perl style, not POSIX style
> >semantics -- per Tim Peters' recommendations.  Do we need to open the
> >discussion again?
> 
> No, no; I'm actually happier with Perl-style, because it's far better
> documented and familiar to people. Worse *is* better, after all.

I would concur with the preference for Perl-style semantics.
Aside from the issue of consistency with other scripting
languages, i think it's easier to predict the behaviour of
these semantics.  You can run the algorithm in your head,
and try the backtracking yourself.  It's good for the algorithm
to be predictable and well understood.

> Doing the compilation in Python is a good idea, and will make it
> possible to implement alternative syntaxes.

Also agree.  I still have some vague wishes for a simpler,
more readable (more Pythonian?) way to express patterns --
perhaps not as powerful as full regular expressions, but
useful for many simpler cases (an 80-20 solution).


-- ?!ng




From bwarsaw at cnri.reston.va.us  Tue Nov  9 19:15:04 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Tue, 9 Nov 1999 13:15:04 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<14376.22527.323888.677816@amarok.cnri.reston.va.us>
	<199911091726.MAA21754@eric.cnri.reston.va.us>
	<14376.23671.250752.637144@amarok.cnri.reston.va.us>
Message-ID: <14376.25768.368164.88151@anthem.cnri.reston.va.us>

>>>>> "AMK" == Andrew M Kuchling  writes:

    AMK> No, no; I'm actually happier with Perl-style, because it's
    AMK> far better documented and familiar to people. Worse *is*
    AMK> better, after all.

Plus, you can't change re's semantics and I think it makes sense if
the Unicode engine is as close semantically as possible to the
existing engine.

We need to be careful not to worsen performance for 8bit strings.  I
think we're already on the edge of acceptability w.r.t. P*** and
hopefully we can /improve/ performance here.

MAL's proposal seems quite reasonable.  It would be excellent to see
these things done for Python 1.6.  There's still some discussion on
supporting internationalization of applications, e.g. using gettext
but I think those are smaller in scope.

-Barry



From akuchlin at mems-exchange.org  Tue Nov  9 20:36:28 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 9 Nov 1999 14:36:28 -0500 (EST)
Subject: [Python-Dev] I18N Toolkit
In-Reply-To: <14376.25768.368164.88151@anthem.cnri.reston.va.us>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<14376.22527.323888.677816@amarok.cnri.reston.va.us>
	<199911091726.MAA21754@eric.cnri.reston.va.us>
	<14376.23671.250752.637144@amarok.cnri.reston.va.us>
	<14376.25768.368164.88151@anthem.cnri.reston.va.us>
Message-ID: <14376.30652.201552.116828@amarok.cnri.reston.va.us>

Barry A. Warsaw writes:
(in relation to support for Unicode regexes)
>We need to be careful not to worsen performance for 8bit strings.  I
>think we're already on the edge of acceptability w.r.t. P*** and
>hopefully we can /improve/ performance here.

I don't think that will be a problem, given that the Unicode engine
would be a separate C implementation.  A bit of 'if type(strg) ==
UnicodeType' in re.py isn't going to cost very much speed.

(Speeding up PCRE -- that's another question.  I'm often tempted to
rewrite pcre_compile to generate an easier-to-analyse parse tree,
instead of its current complicated-but-memory-parsimonious compiler,
but I'm very reluctant to introduce a fork like that.)

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
The world does so well without me, that I am moved to wish that I could do
equally well without the world.
    -- Robertson Davies, _The Diary of Samuel Marchbanks_




From mhammond at skippinet.com.au  Tue Nov  9 23:27:45 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 10 Nov 1999 09:27:45 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <001c01bf2b01$a58d5d50$0501a8c0@bobcat>

> I think his proposal will go a long way towards your toolkit.  I
hope
> to hear soon from anybody who disagrees with Marc-Andre's proposal,

No disagreement as such, but a small hole:


From tim_one at email.msn.com  Wed Nov 10 06:57:14 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 10 Nov 1999 00:57:14 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091726.MAA21754@eric.cnri.reston.va.us>
Message-ID: <000001bf2b40$70183840$d82d153f@tim>

[Guido, on "a new Unicode regex engine by /F"]

> It's from scratch, and I believe it's got Perl style, not POSIX style
> semantics -- per Tim Peters' recommendations.  Do we need to open the
> discussion again?

No, but I get to whine just a little :  I didn't recommend either
approach.  I asked many futile questions about HP's requirements, and
sketched implications either way.  If HP *has* a requirement wrt
POSIX-vs-Perl, it would be good to find that out before it's too late.

I personally prefer POSIX semantics -- but, as Andrew so eloquently said,
worse is better here; all else being equal it's best to follow JPython's
Perl-compatible re lead.

last-time-i-ever-say-what-i-really-think-ly y'rs  - tim





From tim_one at email.msn.com  Wed Nov 10 07:25:07 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 10 Nov 1999 01:25:07 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <000201bf2b44$55b8ad00$d82d153f@tim>

> Marc-Andre Lemburg has a proposal for work that I'm asking him to do
> (under pressure from HP who want Python i18n badly and are willing to
> pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt

I can't make time for a close review now.  Just one thing that hit my eye
early:

    Python should provide a built-in constructor for Unicode strings
    which is available through __builtins__:

    u = unicode([,=
                                         ])

    u = u''

Two points on the Unicode literals (u'abc'):

UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
hand -- it breaks apart and rearranges bytes at the bit level, and
everything other than 7-bit ASCII requires solid strings of "high-bit"
characters.  This is painful for people to enter manually on both counts --
and no common reference gives the UTF-8 encoding of glyphs directly.  So, as
discussed earlier, we should follow Java's lead and also introduce a \u
escape sequence:

    octet:           hexdigit hexdigit
    unicodecode:     octet octet
    unicode_escape:  "\\u" unicodecode

Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
Unicode character at the unicodecode code position.  For consistency, then,
it should probably expand the same way inside "regular strings" too.  Unlike
Java does, I'd rather not give it a meaning outside string literals.

The other point is a nit:  The vast bulk of UTF-8 encodings encode
characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
those must either be explicitly outlawed, or explicitly defined.  I vote for
outlawed, in the sense of detected error that raises an exception.  That
leaves our future options open.

BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
inverse in the Unicode world?  Both seem essential.

international-in-spite-of-himself-ly y'rs  - tim





From fredrik at pythonware.com  Wed Nov 10 09:08:06 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 09:08:06 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> http://starship.skyport.net/~lemburg/unicode-proposal.txt

Marc-Andre writes:

    The internal format for Unicode objects should either use a Python
    specific fixed cross-platform format  (e.g. 2-byte
    little endian byte order) or a compiler provided wchar_t format (if
    available). Using the wchar_t format will ease embedding of Python in
    other Unicode aware applications, but will also make internal format
    dumps platform dependent. 

having been there and done that, I strongly suggest
a third option: a 16-bit unsigned integer, in platform
specific byte order (PY_UNICODE_T).  along all other
roads lie code bloat and speed penalties...

(besides, this is exactly how it's already done in
unicode.c and what 'sre' prefers...)






From captainrobbo at yahoo.com  Wed Nov 10 09:09:26 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 00:09:26 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>

In general, I like this proposal a lot, but I think it
only covers half the story.  How we actually build the
encoder/decoder for each encoding is a very big issue.
 Thoughts on this below.

First, a little nit
>  u = u''
I don't like using funny prime characters - why not an
explicit function like "utf8()"


On to the important stuff:> 
>  unicodec.register(,,
>  [,, ])

> This registers the codecs under the given encoding
> name in the module global dictionary 
> unicodec.codecs. Stream codecs are optional: 
> the unicodec module will provide appropriate
> wrappers around  and
>  if not given.

I would MUCH prefer a single 'Encoding' class or type
to wrap up these things, rather than up to four
disconnected objects/functions.  Essentially it would
be an interface standard and would offer methods to do
the four things above.  

There are several reasons for this.  
(1) there are quite a lot of things you might want to
do with an encoding object, and we could extend the
interface in future easily.  As a minimum, give it the
four methods implied by the above, two of which can be
defaults.  But I'd like an encoding to be able to tell
me the set of characters to which it applies; validate
a string; and maybe tell me if it is a subset or
superset of another.

(2) especially with double-byte encodings, they will
need to load up some kind of database on startup and
use this for both encoding and decoding - much better
to share it and encapsulate it inside one object

(3) for some languages, there are extra functions
wanted.  For Japanese, you need two or three functions
to expand half-width to full-width katakana, convert
double-byte english to single-byte and vice versa.  A
Japanese encoding object would be a handy place to put
this knowledge.

(4) In the real world you get many encodings which are
subtle variations of the same thing, plus or minus a
few characters.  One bit of code might be able to
share the work of several encodings, by setting a few
flags.  Certainly true of Japanese.

(5) encoding/decoding algorithms can be program or
data or (very often) a bit of both.  We have not yet
discussed where to keep all the mapping tables, but if
data is involved it should be hidden in an object.

(6) See my comments on a state machine for doing the
encodings.  If this is done well, we might two
different standard objects which conform to the
Encoding interface (a really light one for single-byte
encodings, and a bigger one for multi-byte), and
everything else could be data driven.  

(6) Easy to grow - encodings can be prototyped and
proven in Python, ported to C if needed or when ready.
 

In summary, firm up the concept of an Encoding object
and give it room to grow - that's the key to
real-world usefulness.   If people feel the same way
I'll have a go at an interface for that, and try show
how it would have simplified specific problems I have
faced.

We also need to think about where encoding info will
live.  You cannot avoid mapping tables, although you
can hide them inside code modules or pickled objects
if you want.  Should there be a standard 
"..\Python\Enc" directory?

And we're going to need some kind of testing and
certification procedure when adding new encodings. 
This stuff has to be right.  

Guido asked about TypedString.  This can probably be
done on top of the built-in stuff - it is just a
convenience which would clarify intent, reduce lines
of code and prevent people shooting themselves in the
foot when juggling a lot of strings in different
(non-Unicode) encodings.  I can do a Python module to
implement that on top of whatever is built.


Regards,

Andy








=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From fredrik at pythonware.com  Wed Nov 10 09:14:21 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 09:14:21 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000201bf2b44$55b8ad00$d82d153f@tim>
Message-ID: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com>

Tim Peters wrote:
> UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
> hand -- it breaks apart and rearranges bytes at the bit level, and
> everything other than 7-bit ASCII requires solid strings of "high-bit"
> characters.

unless you're using a UTF-8 aware editor, of course ;-)

(some days, I think we need some way to tell the compiler
what encoding we're using for the source file...)

> This is painful for people to enter manually on both counts --
> and no common reference gives the UTF-8 encoding of glyphs
> directly.  So, as discussed earlier, we should follow Java's lead
> and also introduce a \u escape sequence:
> 
>     octet:           hexdigit hexdigit
>     unicodecode:     octet octet
>     unicode_escape:  "\\u" unicodecode
> 
> Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
> Unicode character at the unicodecode code position.  For consistency, then,
> it should probably expand the same way inside "regular strings" too.  Unlike
> Java does, I'd rather not give it a meaning outside string literals.

good idea.  and by some reason, patches for this is included
in the unicode distribution (see the attached str2utf.c).

> The other point is a nit:  The vast bulk of UTF-8 encodings encode
> characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
> those must either be explicitly outlawed, or explicitly defined.  I vote for
> outlawed, in the sense of detected error that raises an exception.  That
> leaves our future options open.

I vote for 'outlaw'.




/* A small code snippet that translates \uxxxx syntax to UTF-8 text.
   To be cut and pasted into Python/compile.c */

/* Written by Fredrik Lundh, January 1999. */

/* Documentation (for the language reference):

\uxxxx -- Unicode character with hexadecimal value xxxx.  The
character is stored using UTF-8 encoding, which means that this
sequence can result in up to three encoded characters.

Note that the 'u' must be followed by four hexadecimal digits.  If
fewer digits are given, the sequence is left in the resulting string
exactly as given.  If more digits are given, only the first four are
translated to Unicode, and the remaining digits are left in the
resulting string.

*/

#define Py_CHARMASK(ch) ch

void
convert(const char *s, char *p)
{
    while (*s) {
        if (*s != '\\') {
            *p++ = *s++;
            continue;
        }
        s++;
        switch (*s++) {

/* -------------------------------------------------------------------- */
/* copy this section to the appropriate place in compile.c... */

        case 'u':
            /* \uxxxx => UTF-8 encoded unicode character */
            if (isxdigit(Py_CHARMASK(s[0])) && isxdigit(Py_CHARMASK(s[1])) &&
                isxdigit(Py_CHARMASK(s[2])) && isxdigit(Py_CHARMASK(s[3]))) {
                /* fetch hexadecimal character value */
                unsigned int n, ch = 0;
                for (n = 0; n < 4; n++) {
                    int c = Py_CHARMASK(*s);
                    s++;
                    ch = (ch << 4) & ~0xF;
                    if (isdigit(c))
                        ch += c - '0';
                    else if (islower(c))
                        ch += 10 + c - 'a';
                    else
                        ch += 10 + c - 'A';
                }
                /* store as UTF-8 */
                if (ch < 0x80)
                    *p++ = (char) ch;
                else {
                    if (ch < 0x800) {
                        *p++ = 0xc0 | (ch >> 6);
                        *p++ = 0x80 | (ch & 0x3f);
                    } else {
                        *p++ = 0xe0 | (ch >> 12);
                        *p++ = 0x80 | ((ch >> 6) & 0x3f);
                        *p++ = 0x80 | (ch & 0x3f);
                    }
                }
                break;
            } else
                goto bogus;

/* -------------------------------------------------------------------- */

        default:

bogus:      *p++ = '\\';
            *p++ = s[-1];
            break;
        }
    }
    *p++ = '\0';
}

main()
{
    int i;
    unsigned char buffer[100];
    
    convert("Link\\u00f6ping", buffer);

    for (i = 0; buffer[i]; i++)
        if (buffer[i] < 0x20 || buffer[i] >= 0x80)
            printf("\\%03o", buffer[i]);
        else
            printf("%c", buffer[i]);
}





From gstein at lyra.org  Thu Nov 11 10:18:52 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 01:18:52 -0800 (PST)
Subject: [Python-Dev] Re: Internal Format
In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
Message-ID: 

On Wed, 10 Nov 1999, Fredrik Lundh wrote:
> Marc-Andre writes:
> 
>     The internal format for Unicode objects should either use a Python
>     specific fixed cross-platform format  (e.g. 2-byte
>     little endian byte order) or a compiler provided wchar_t format (if
>     available). Using the wchar_t format will ease embedding of Python in
>     other Unicode aware applications, but will also make internal format
>     dumps platform dependent. 
> 
> having been there and done that, I strongly suggest
> a third option: a 16-bit unsigned integer, in platform
> specific byte order (PY_UNICODE_T).  along all other
> roads lie code bloat and speed penalties...

I agree 100% !!

wchar_t will introduce portability issues right on up into the Python
level. The byte-order introduces speed issues and OS interoperability
issues, yet solves no portability problems (Byte Order Marks should still
be present and used).

There are two "platforms" out there that use Unicode: Win32 and Java. They
both use UCS-2, AFAIK.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From fredrik at pythonware.com  Wed Nov 10 09:24:16 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 09:24:16 +0100
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us>
Message-ID: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> One specific question: in you discussion of typed strings, I'm not
> sure why you couldn't convert everything to Unicode and be done with
> it.  I have a feeling that the answer is somewhere in your case study
> -- maybe you can elaborate?

Marc-Andre writes:

    Unicode objects should have a pointer to a cached (read-only) char
    buffer  holding the object's value using the current
    .  This is needed for performance and internal
    parsing (see below) reasons. The buffer is filled when the first
    conversion request to the  is issued on the object.

keeping track of an external encoding is better left
for the application programmers -- I'm pretty sure that
different application builders will want to handle this
in radically different ways, depending on their environ-
ment, underlying user interface toolkit, etc.

besides, this is how Tcl would have done it.  Python's
not Tcl, and I think you need *very* good arguments
for moving in that direction.






From mal at lemburg.com  Wed Nov 10 10:04:39 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 10:04:39 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <001c01bf2b01$a58d5d50$0501a8c0@bobcat>
Message-ID: <38293527.3CF5C7B0@lemburg.com>

Mark Hammond wrote:
> 
> > I think his proposal will go a long way towards your toolkit.  I
> hope
> > to hear soon from anybody who disagrees with Marc-Andre's proposal,
> 
> No disagreement as such, but a small hole:
> 
> >From the proposal:
> 
> Internal Argument Parsing:
> --------------------------
> ...
> 's':    For Unicode objects: auto convert them to the 
>         and return a pointer to the object's  buffer.
> 
> --
> Excellent - if someone passes a Unicode object, it can be
> auto-converted to a string.  This will allow "open()" to accept
> Unicode strings.

Well almost... it depends on the current value of .
If it's UTF8 and you only use normal ASCII characters the above is indeed
true, but UTF8 can go far beyond ASCII and have up to 3 bytes per
character (for UCS2, even more for UCS4). With  set
to other exotic encodings this is likely to fail though.
 
> However, there doesnt appear to be a reverse.  Eg, if my extension
> module interfaces to a library that uses Unicode natively, how can I
> get a Unicode object when the user passes a string?  If I had to
> explicitely check for a string, then check for a Unicode on failure it
> would get messy pretty quickly...  Is it not possible to have "U" also
> do a conversion?

"U" is meant to simplify checks for Unicode objects, much like "S".
It returns a reference to the object. Auto-conversions are not possible
due to this, because they would create new objects which don't get
properly garbage collected later on.

Another problem is that Unicode types differ between platforms
(MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit
wchar_t). Depending on the internal format of Unicode objects
this could mean calling different conversion APIs.

BTW, I'm still not too sure about the underlying internal format.
The problem here is that Unicode started out as 2-byte fixed length
representation (UCS2) but then shifted towards a 4-byte fixed length
reprensetation known as UCS4. Since having 4 bytes per character
is hard sell to customers, UTF16 was created to stuff the UCS4
code points (this is how character entities are called in Unicode)
into 2 bytes... with a variable length encoding.

Some platforms that started early into the Unicode business
such as the MS ones use UCS2 as wchar_t, while more recent
ones (e.g. the glibc2 on Linux) use UCS4 for wchar_t. I haven't
yet checked in what ways the two are compatible (I would suspect
the top bytes in UCS4 being 0 for UCS2 codes), but would like
to hear whether it wouldn't be a better idea to use UTF16
as internal format. The latter works in 2 bytes for most
characters and conversion to UCS2|4 should be fast. Still,
conversion to UCS2 could fail.

The downside of using UTF16: it is a variable length format,
so iterations over it will be slower than for UCS4.

Simply sticking to UCS2 is probably out of the question,
since Unicode 3.0 requires UCS4 and we are targetting
Unicode 3.0.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 10 10:49:01 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 10:49:01 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000201bf2b44$55b8ad00$d82d153f@tim>
Message-ID: <38293F8D.F60AE605@lemburg.com>

Tim Peters wrote:
> 
> > Marc-Andre Lemburg has a proposal for work that I'm asking him to do
> > (under pressure from HP who want Python i18n badly and are willing to
> > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt
> 
> I can't make time for a close review now.  Just one thing that hit my eye
> early:
> 
>     Python should provide a built-in constructor for Unicode strings
>     which is available through __builtins__:
> 
>     u = unicode([,=
>                                          ])
> 
>     u = u''
> 
> Two points on the Unicode literals (u'abc'):
> 
> UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by
> hand -- it breaks apart and rearranges bytes at the bit level, and
> everything other than 7-bit ASCII requires solid strings of "high-bit"
> characters.  This is painful for people to enter manually on both counts --
> and no common reference gives the UTF-8 encoding of glyphs directly.  So, as
> discussed earlier, we should follow Java's lead and also introduce a \u
> escape sequence:
> 
>     octet:           hexdigit hexdigit
>     unicodecode:     octet octet
>     unicode_escape:  "\\u" unicodecode
> 
> Inside a u'' string, I guess this should expand to the UTF-8 encoding of the
> Unicode character at the unicodecode code position.  For consistency, then,
> it should probably expand the same way inside "regular strings" too.  Unlike
> Java does, I'd rather not give it a meaning outside string literals.

It would be more conform to use the Unicode ordinal (instead of
interpreting the number as UTF8 encoding), e.g. \u03C0 for Pi. The
codes are easy to look up in the standard's UnicodeData.txt file or the
Unicode book for that matter.
 
> The other point is a nit:  The vast bulk of UTF-8 encodings encode
> characters in UCS-4 space outside of Unicode.  In good Pythonic fashion,
> those must either be explicitly outlawed, or explicitly defined.  I vote for
> outlawed, in the sense of detected error that raises an exception.  That
> leaves our future options open.

See my other post for a discussion of UCS4 vs. UTF16 vs. UCS2.

Perhaps we could add a flag to Unicode objects stating whether the characters
can be treated as UCS4 limited to the lower 16 bits (UCS4 and UTF16 are
the same in most ranges).

This flag could then be used to choose optimized algorithms for scanning
the strings. Fredrik's implementation currently uses UCS2, BTW.

> BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
> inverse in the Unicode world?  Both seem essential.

Good points.

How about 

  uniord(u[:1]) --> Unicode ordinal number (32-bit)

  unichr(i) --> Unicode object for character i (provided it is 32-bit);
                ValueError otherwise

They are inverse of each other, but note that Unicode allows 
private encodings too, which will of course not necessarily make
it across platforms or even from one PC to the next (see Andy Robinson's
interesting case study).

I've uploaded a new version of the proposal (0.3) to the URL:

	http://starship.skyport.net/~lemburg/unicode-proposal.txt

Thanks,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fredrik at pythonware.com  Wed Nov 10 11:50:05 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 11:50:05 +0100
Subject: regexp performance (Re: [Python-Dev] I18N Toolkit
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us>
Message-ID: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com>

Andrew M. Kuchling  wrote:
> (Speeding up PCRE -- that's another question.  I'm often tempted to
> rewrite pcre_compile to generate an easier-to-analyse parse tree,
> instead of its current complicated-but-memory-parsimonious compiler,
> but I'm very reluctant to introduce a fork like that.)

any special pattern constructs that are in need of per-
formance improvements?  (compared to Perl, that is).

or maybe anyone has an extensive performance test
suite for perlish regular expressions?  (preferrably based
on how real people use regular expressions, not only on
things that are known to be slow if not optimized)






From gstein at lyra.org  Thu Nov 11 11:46:55 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 02:46:55 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38293527.3CF5C7B0@lemburg.com>
Message-ID: 

On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
>...
> Well almost... it depends on the current value of .

Default encodings are kind of nasty when they can be altered. The same
problem occurred with import hooks. Only one can be present at a time.
This implies that modules, packages, subsystems, whatever, cannot set a
default encoding because something else might depend on it having a
different value. In the end, nobody uses the default encoding because it
is unreliable, so you end up with extra implementation/semantics that
aren't used/needed.

Have you ever noticed how Python modules, packages, tools, etc, never
define an import hook?

I'll bet nobody ever monkeys with the default encoding either...

I say axe it and say "UTF-8" is the fixed, default encoding. If you want
something else, then do that explicitly.

>...
> Another problem is that Unicode types differ between platforms
> (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit
> wchar_t). Depending on the internal format of Unicode objects
> this could mean calling different conversion APIs.

Exactly the reason to avoid wchar_t.

> BTW, I'm still not too sure about the underlying internal format.
> The problem here is that Unicode started out as 2-byte fixed length
> representation (UCS2) but then shifted towards a 4-byte fixed length
> reprensetation known as UCS4. Since having 4 bytes per character
> is hard sell to customers, UTF16 was created to stuff the UCS4
> code points (this is how character entities are called in Unicode)
> into 2 bytes... with a variable length encoding.

History is basically irrelevant. What is the situation today? What is in
use, and what are people planning for right now?

>...
> The downside of using UTF16: it is a variable length format,
> so iterations over it will be slower than for UCS4.

Bzzt. May as well go with UTF-8 as the internal format, much like Perl is
doing (as I recall).

Why go with a variable length format, when people seem to be doing fine
with UCS-2?

Like I said in the other mail note: two large platforms out there are
UCS-2 based. They seem to be doing quite well with that approach.

If people truly need UCS-4, then they can work with that on their own. One
of the major reasons for putting Unicode into Python is to
increase/simplify its ability to speak to the underlying platform. Hey!
Guess what? That generally means UCS2.

If we didn't need to speak to the OS with these Unicode values, then
people can work with the values entirely in Python,
PyUnicodeType-be-damned.

Are we digging a hole for ourselves? Maybe. But there are two other big
platforms that have the same hole to dig out of *IF* it ever comes to
that. I posit that it won't be necessary; that the people needing UCS-4
can do so entirely in Python.

Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and
vice-versa. But: it only does it from String to String -- you can't use
Unicode objects anywhere in there.

> Simply sticking to UCS2 is probably out of the question,
> since Unicode 3.0 requires UCS4 and we are targetting
> Unicode 3.0.

Oh? Who says?

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From fredrik at pythonware.com  Wed Nov 10 11:52:28 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 11:52:28 +0100
Subject: [Python-Dev] I18N Toolkit
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com><199911091646.LAA21467@eric.cnri.reston.va.us><14376.22527.323888.677816@amarok.cnri.reston.va.us><199911091726.MAA21754@eric.cnri.reston.va.us><14376.23671.250752.637144@amarok.cnri.reston.va.us><14376.25768.368164.88151@anthem.cnri.reston.va.us> <14376.30652.201552.116828@amarok.cnri.reston.va.us>
Message-ID: <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com>

(a copy was sent to comp.lang.python by mistake;
sorry for that).

Andrew M. Kuchling  wrote:
> I don't think that will be a problem, given that the Unicode engine
> would be a separate C implementation.  A bit of 'if type(strg) ==
> UnicodeType' in re.py isn't going to cost very much speed.

a slightly hairer design issue is what combinations
of pattern and string the new 're' will handle.

the first two are obvious:
 
     ordinary pattern, ordinary string
     unicode pattern, unicode string
 
 but what about these?
 
     ordinary pattern, unicode string
     unicode pattern, ordinary string
 
 "coercing" patterns (i.e. recompiling, on demand)
 seem to be a somewhat risky business ;-)
 
 




From gstein at lyra.org  Thu Nov 11 11:50:56 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 02:50:56 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38293F8D.F60AE605@lemburg.com>
Message-ID: 

On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
> Tim Peters wrote:
> > BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
> > inverse in the Unicode world?  Both seem essential.
> 
> Good points.
> 
> How about 
> 
>   uniord(u[:1]) --> Unicode ordinal number (32-bit)
> 
>   unichr(i) --> Unicode object for character i (provided it is 32-bit);
>                 ValueError otherwise

Why new functions? Why not extend the definition of ord() and chr()?

In terms of backwards compatibility, the only issue could possibly be that
people relied on chr(x) to throw an error when x>=256. They certainly
couldn't pass a Unicode object to ord(), so that function can safely be
extended to accept a Unicode object and return a larger integer.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From jcw at equi4.com  Wed Nov 10 12:14:17 1999
From: jcw at equi4.com (Jean-Claude Wippler)
Date: Wed, 10 Nov 1999 12:14:17 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <38295389.397DDE5E@equi4.com>

Greg Stein wrote:
[MAL:]
> > The downside of using UTF16: it is a variable length format,
> > so iterations over it will be slower than for UCS4.
> 
> Bzzt. May as well go with UTF-8 as the internal format, much like Perl
> is doing (as I recall).

Ehm, pardon me for asking - what is the brief rationale for selecting
UCS2/4, or whetever it ends up being, over UTF8?

I couldn't find a discussion in the last months of the string SIG, was
this decided upon and frozen long ago?

I'm not trying to re-open a can of worms, just to understand.

-- Jean-Claude



From gstein at lyra.org  Thu Nov 11 12:17:56 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 03:17:56 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38295389.397DDE5E@equi4.com>
Message-ID: 

On Wed, 10 Nov 1999, Jean-Claude Wippler wrote:
> Greg Stein wrote:
> > Bzzt. May as well go with UTF-8 as the internal format, much like Perl
> > is doing (as I recall).
> 
> Ehm, pardon me for asking - what is the brief rationale for selecting
> UCS2/4, or whetever it ends up being, over UTF8?
> 
> I couldn't find a discussion in the last months of the string SIG, was
> this decided upon and frozen long ago?

Try sometime last year :-) ... something like July thru September as I
recall.

Things will be a lot faster if we have a fixed-size character. Variable
length formats like UTF-8 are a lot harder to slice, search, etc. Also,
(IMO) a big reason for this new type is for interaction with the
underlying OS/platform. I don't know of any platforms right now that
really use UTF-8 as their Unicode string representation (meaning we'd
have to convert back/forth from our UTF-8 representation to talk to the
OS).

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Wed Nov 10 10:55:42 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 10:55:42 +0100
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
Message-ID: <3829411E.FD32F8CC@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum  wrote:
> > One specific question: in you discussion of typed strings, I'm not
> > sure why you couldn't convert everything to Unicode and be done with
> > it.  I have a feeling that the answer is somewhere in your case study
> > -- maybe you can elaborate?
> 
> Marc-Andre writes:
> 
>     Unicode objects should have a pointer to a cached (read-only) char
>     buffer  holding the object's value using the current
>     .  This is needed for performance and internal
>     parsing (see below) reasons. The buffer is filled when the first
>     conversion request to the  is issued on the object.
> 
> keeping track of an external encoding is better left
> for the application programmers -- I'm pretty sure that
> different application builders will want to handle this
> in radically different ways, depending on their environ-
> ment, underlying user interface toolkit, etc.

It's not that hard to implement. All you have to do is check
whether the current encoding in  still is the same
as the threads view of . The 
buffer is needed to implement "s" et al. argument parsing
anyways.
 
> besides, this is how Tcl would have done it.  Python's
> not Tcl, and I think you need *very* good arguments
> for moving in that direction.
> 
> 
> 
> _______________________________________________
> Python-Dev maillist  -  Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 10 12:42:00 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 12:42:00 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
Message-ID: <38295A08.D3928401@lemburg.com>

Andy Robinson wrote:
> 
> In general, I like this proposal a lot, but I think it
> only covers half the story.  How we actually build the
> encoder/decoder for each encoding is a very big issue.
>  Thoughts on this below.
> 
> First, a little nit
> >  u = u''
> I don't like using funny prime characters - why not an
> explicit function like "utf8()"

u = unicode('...I am UTF8...','utf-8')

will do just that. I've moved to Tim's proposal with the
\uXXXX encoding for u'', BTW.
 
> On to the important stuff:>
> >  unicodec.register(,,
> >  [,, ])
> 
> > This registers the codecs under the given encoding
> > name in the module global dictionary
> > unicodec.codecs. Stream codecs are optional:
> > the unicodec module will provide appropriate
> > wrappers around  and
> >  if not given.
> 
> I would MUCH prefer a single 'Encoding' class or type
> to wrap up these things, rather than up to four
> disconnected objects/functions.  Essentially it would
> be an interface standard and would offer methods to do
> the four things above.
> 
> There are several reasons for this.
>
> ...
>
> In summary, firm up the concept of an Encoding object
> and give it room to grow - that's the key to
> real-world usefulness.   If people feel the same way
> I'll have a go at an interface for that, and try show
> how it would have simplified specific problems I have
> faced.

Ok, you have a point there.

Here's a proposal (note that this only defines an interface,
not a class structure):

Codec Interface Definition:
---------------------------

The following base class should be defined in the module unicodec.

class Codec:

    def encode(self,u):
	
	""" Return the Unicode object u encoded as Python string.

	"""
	...

    def decode(self,s):

	""" Return an equivalent Unicode object for the encoded Python
	    string s.

	""" 
	...
	
    def dump(self,u,stream,slice=None):

	""" Writes the Unicode object's contents encoded to the stream.

	    stream must be a file-like object open for writing binary
	    data.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def load(self,stream,length=None):

	""" Reads an encoded string (up to  bytes) from the
	    stream and returns an equivalent Unicode object.

	    stream must be a file-like object open for reading binary
	    data.

	    If length is given, only length bytes are read. Note that
	    this can cause the decoding algorithm to fail due to
	    truncations in the encoding.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...

Codecs should raise an UnicodeError in case the conversion is
not possible.

It is not required by the unicodec.register() API to provide a
subclass of this base class, only the 4 given methods must be present.
This allows writing Codecs as extensions types.

XXX Still to be discussed: 

    ? support for line breaks (see
      http://www.unicode.org/unicode/reports/tr13/ )

    ? support for case conversion: 

      Problems: string lengths can change due to multiple
      characters being mapped to a single new one, capital letters
      starting a word can be different than ones occurring in the
      middle, there are locale dependent deviations from the standard
      mappings.

    ? support for numbers, digits, whitespace, etc.

    ? support (or no support) for private code point areas


> We also need to think about where encoding info will
> live.  You cannot avoid mapping tables, although you
> can hide them inside code modules or pickled objects
> if you want.  Should there be a standard
> "..\Python\Enc" directory?

Mapping tables should be incorporated into the codec
modules preferably as static C data. That way multiple
processes can share the same data.

> And we're going to need some kind of testing and
> certification procedure when adding new encodings.
> This stuff has to be right.

I will have to rely on your cooperation for the test data.
Roundtrip testing is easy to implement, but I will also have
to verify the output against prechecked data which is probably only
creatable using visual tools to which I don't have access
(e.g. a Japanese Windows installation).
 
> Guido asked about TypedString.  This can probably be
> done on top of the built-in stuff - it is just a
> convenience which would clarify intent, reduce lines
> of code and prevent people shooting themselves in the
> foot when juggling a lot of strings in different
> (non-Unicode) encodings.  I can do a Python module to
> implement that on top of whatever is built.

Ok.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Wed Nov 10 11:03:36 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 11:03:36 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
Message-ID: <382942F8.1921158E@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum  wrote:
> > http://starship.skyport.net/~lemburg/unicode-proposal.txt
> 
> Marc-Andre writes:
> 
>     The internal format for Unicode objects should either use a Python
>     specific fixed cross-platform format  (e.g. 2-byte
>     little endian byte order) or a compiler provided wchar_t format (if
>     available). Using the wchar_t format will ease embedding of Python in
>     other Unicode aware applications, but will also make internal format
>     dumps platform dependent.
> 
> having been there and done that, I strongly suggest
> a third option: a 16-bit unsigned integer, in platform
> specific byte order (PY_UNICODE_T).  along all other
> roads lie code bloat and speed penalties...
>
> (besides, this is exactly how it's already done in
> unicode.c and what 'sre' prefers...)

Ok, byte order can cause a speed penalty, so it might be
worthwhile introducing sys.bom (or sys.endianness) for this
reason and sticking to 16-bit integers as you have already done
in unicode.h.

What I don't like is using wchar_t if available (and then addressing
it as if it were defined as unsigned integer). IMO, it's better
to define a Python Unicode representation which then gets converted
to whatever wchar_t represents on the target machine.

Another issue is whether to use UCS2 (as you have done) or UTF16
(which is what Unicode 3.0 requires)... see my other post
for a discussion.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fredrik at pythonware.com  Wed Nov 10 13:32:16 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 13:32:16 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com>
Message-ID: <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com>

> What I don't like is using wchar_t if available (and then addressing
> it as if it were defined as unsigned integer). IMO, it's better
> to define a Python Unicode representation which then gets converted
> to whatever wchar_t represents on the target machine.

you should read the unicode.h file a bit more carefully:

...

/* Unicode declarations. Tweak these to match your platform */

/* set this flag if the platform has "wchar.h", "wctype.h" and the
   wchar_t type is a 16-bit unsigned type */
#define HAVE_USABLE_WCHAR_H

#if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H)

    (this uses wchar_t, and also iswspace and friends)

...

#else

/* Use if you have a standard ANSI compiler, without wchar_t support.
   If a short is not 16 bits on your platform, you have to fix the
   typedef below, or the module initialization code will complain. */

    (this maps iswspace to isspace, for 8-bit characters).

#endif

...

the plan was to use the second solution (using "configure"
to figure out what integer type to use), and its own uni-
code database table for the is/to primitives

(iirc, the unicode.txt file discussed this, but that one
seems to be missing from the zip archive).






From fredrik at pythonware.com  Wed Nov 10 13:39:56 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 10 Nov 1999 13:39:56 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com>

Greg Stein  wrote:
> Have you ever noticed how Python modules, packages, tools, etc, never
> define an import hook?

hey, didn't MAL use one in one of his mx kits? ;-)

> I say axe it and say "UTF-8" is the fixed, default encoding. If you want
> something else, then do that explicitly.

exactly.

modes are evil.  python is not perl.  etc.

> Are we digging a hole for ourselves? Maybe. But there are two other big
> platforms that have the same hole to dig out of *IF* it ever comes to
> that. I posit that it won't be necessary; that the people needing UCS-4
> can do so entirely in Python.

last time I checked, there were no characters (even in the
ISO standard) outside the 16-bit range.  has that changed?






From mal at lemburg.com  Wed Nov 10 13:44:39 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 13:44:39 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <382968B7.ABFFD4C0@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
> > Tim Peters wrote:
> > > BTW, is ord(unicode_char) defined?  And as what?  And does ord have an
> > > inverse in the Unicode world?  Both seem essential.
> >
> > Good points.
> >
> > How about
> >
> >   uniord(u[:1]) --> Unicode ordinal number (32-bit)
> >
> >   unichr(i) --> Unicode object for character i (provided it is 32-bit);
> >                 ValueError otherwise
> 
> Why new functions? Why not extend the definition of ord() and chr()?
> 
> In terms of backwards compatibility, the only issue could possibly be that
> people relied on chr(x) to throw an error when x>=256. They certainly
> couldn't pass a Unicode object to ord(), so that function can safely be
> extended to accept a Unicode object and return a larger integer.

Because unichr() will always have to return Unicode objects. You don't
want chr(i) to return Unicode for i>255 and strings for i<256.

OTOH, ord() could probably be extended to also work on Unicode objects.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 10 14:08:30 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 14:08:30 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <38296E4E.914C0ED7@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 10 Nov 1999, M.-A. Lemburg wrote:
> >...
> > Well almost... it depends on the current value of .
> 
> Default encodings are kind of nasty when they can be altered. The same
> problem occurred with import hooks. Only one can be present at a time.
> This implies that modules, packages, subsystems, whatever, cannot set a
> default encoding because something else might depend on it having a
> different value. In the end, nobody uses the default encoding because it
> is unreliable, so you end up with extra implementation/semantics that
> aren't used/needed.

I know, but this is a little different: you use strings a lot while
import hooks are rarely used directly by the user.

E.g. people in Europe will probably prefer Latin-1 as default
encoding while people in Asia will use one of the common CJK encodings.

The  decides what encoding to use for many typical
tasks: printing, str(u), "s" argument parsing, etc.

Note that setting the  is not intended to be
done prior to single operations. It is meant to be settable at
thread creation time.

> [...]
> 
> > BTW, I'm still not too sure about the underlying internal format.
> > The problem here is that Unicode started out as 2-byte fixed length
> > representation (UCS2) but then shifted towards a 4-byte fixed length
> > reprensetation known as UCS4. Since having 4 bytes per character
> > is hard sell to customers, UTF16 was created to stuff the UCS4
> > code points (this is how character entities are called in Unicode)
> > into 2 bytes... with a variable length encoding.
> 
> History is basically irrelevant. What is the situation today? What is in
> use, and what are people planning for right now?
> 
> >...
> > The downside of using UTF16: it is a variable length format,
> > so iterations over it will be slower than for UCS4.
> 
> Bzzt. May as well go with UTF-8 as the internal format, much like Perl is
> doing (as I recall).
> 
> Why go with a variable length format, when people seem to be doing fine
> with UCS-2?

The reason for UTF-16 is simply that it is identical to UCS-2
over large ranges which makes optimizations (e.g. the UCS2 flag
I mentioned in an earlier post) feasable and effective. UTF-8
slows things down for CJK encodings, since the APIs will very often
have to scan the string to find the correct logical position in
the data.
 
Here's a quote from the Unicode FAQ (http://www.unicode.org/unicode/faq/ ):
"""
Q: How about using UCS-4 interfaces in my APIs?

Given an internal UTF-16 storage, you can, of course, still index into text
using UCS-4 indices. However, while converting from a UCS-4 index to a
UTF-16 index or vice versa is fairly straightforward, it does involve a
scan through the 16-bit units up to the index point. In a test run, for
example, accessing UTF-16 storage as UCS-4 characters results in a
10X degradation. Of course, the precise differences will depend on the
compiler, and there are some interesting optimizations that can be
performed, but it will always be slower on average. This kind of
performance hit is unacceptable in many environments.

Most Unicode APIs are using UTF-16. The low-level character indexing
are at the common storage level, with higher-level mechanisms for
graphemes or words specifying their boundaries in terms of the storage
units. This provides efficiency at the low levels, and the required
functionality at the high levels.

Convenience APIs can be produced that take parameters in UCS-4
methods for common utilities: e.g. converting UCS-4 indices back and
forth, accessing character properties, etc. Outside of indexing, differences
between UCS-4 and UTF-16 are not as important. For most other APIs
outside of indexing, characters values cannot really be considered
outside of their context--not when you are writing internationalized code.
For such operations as display, input, collation, editing, and even upper
and lowercasing, characters need to be considered in the context of a
string. That means that in any event you end up looking at more than one
character. In our experience, the incremental cost of doing surrogates is
pretty small.
"""

> Like I said in the other mail note: two large platforms out there are
> UCS-2 based. They seem to be doing quite well with that approach.
> 
> If people truly need UCS-4, then they can work with that on their own. One
> of the major reasons for putting Unicode into Python is to
> increase/simplify its ability to speak to the underlying platform. Hey!
> Guess what? That generally means UCS2.

All those formats are upward compatible (within certain ranges) and
the Python Unicode API will provide converters between its internal
format and the few common Unicode implementations, e.g. for MS
compilers (16-bit UCS2 AFAIK), GLIBC (32-bit UCS4).
 
> If we didn't need to speak to the OS with these Unicode values, then
> people can work with the values entirely in Python,
> PyUnicodeType-be-damned.
> 
> Are we digging a hole for ourselves? Maybe. But there are two other big
> platforms that have the same hole to dig out of *IF* it ever comes to
> that. I posit that it won't be necessary; that the people needing UCS-4
> can do so entirely in Python.
> 
> Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and
> vice-versa. But: it only does it from String to String -- you can't use
> Unicode objects anywhere in there.

See above.
 
> > Simply sticking to UCS2 is probably out of the question,
> > since Unicode 3.0 requires UCS4 and we are targetting
> > Unicode 3.0.
> 
> Oh? Who says?

>From the FAQ:
"""
Q: What is UTF-16?

Unicode was originally designed as a pure 16-bit encoding, aimed at
representing all modern scripts. (Ancient scripts were to be represented
with private-use characters.) Over time, and especially after the addition
of over 14,500 composite characters for compatibility with legacy sets, it
became clear that 16-bits were not sufficient for the user community. Out
of this arose UTF-16.
"""

Note that there currently are no defined surrogate pairs for
UTF-16, meaning that in practice the difference between UCS-2 and
UTF-16 is probably negligable, e.g. we could define the internal
format to be UTF-16 and raise exception whenever the border between
UTF-16 and UCS-2 is crossed -- sort of as political compromise ;-).

But... I think HP has the last word on this one.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 10 13:36:44 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 13:36:44 +0100
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>  <199911091646.LAA21467@eric.cnri.reston.va.us> <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com> <382942F8.1921158E@lemburg.com> <038501bf2b77$a06046f0$f29b12c2@secret.pythonware.com>
Message-ID: <382966DC.F33E340E@lemburg.com>

Fredrik Lundh wrote:
> 
> > What I don't like is using wchar_t if available (and then addressing
> > it as if it were defined as unsigned integer). IMO, it's better
> > to define a Python Unicode representation which then gets converted
> > to whatever wchar_t represents on the target machine.
> 
> you should read the unicode.h file a bit more carefully:
> 
> ...
> 
> /* Unicode declarations. Tweak these to match your platform */
> 
> /* set this flag if the platform has "wchar.h", "wctype.h" and the
>    wchar_t type is a 16-bit unsigned type */
> #define HAVE_USABLE_WCHAR_H
> 
> #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H)
> 
>     (this uses wchar_t, and also iswspace and friends)
> 
> ...
> 
> #else
> 
> /* Use if you have a standard ANSI compiler, without wchar_t support.
>    If a short is not 16 bits on your platform, you have to fix the
>    typedef below, or the module initialization code will complain. */
> 
>     (this maps iswspace to isspace, for 8-bit characters).
> 
> #endif
> 
> ...
> 
> the plan was to use the second solution (using "configure"
> to figure out what integer type to use), and its own uni-
> code database table for the is/to primitives

Oh, I did read unicode.h, stumbled across the mixed usage
and decided not to like it ;-)

Seriously, I find the second solution where you use the 'unsigned
short' much more portable and straight forward. You never know what
the compiler does for isw*() and it's probably better sticking
to one format for all platforms. Only endianness gets in the way,
but that's easy to handle.

So I opt for 'unsigned short'. The encoding used in these 2 bytes
is a different question though. If HP insists on Unicode 3.0, there's
probably no other way than to use UTF-16.
 
> (iirc, the unicode.txt file discussed this, but that one
> seems to be missing from the zip archive).

It's not in the file I downloaded from your site. Could you post
it here ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 10 14:13:10 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 14:13:10 +0100
Subject: [Python-Dev] Internationalization Toolkit
References:  <38295389.397DDE5E@equi4.com>
Message-ID: <38296F66.5DF9263E@lemburg.com>

Jean-Claude Wippler wrote:
> 
> Greg Stein wrote:
> [MAL:]
> > > The downside of using UTF16: it is a variable length format,
> > > so iterations over it will be slower than for UCS4.
> >
> > Bzzt. May as well go with UTF-8 as the internal format, much like Perl
> > is doing (as I recall).
> 
> Ehm, pardon me for asking - what is the brief rationale for selecting
> UCS2/4, or whetever it ends up being, over UTF8?

UCS-2 is the native format on major platforms (meaning straight
fixed length encoding using 2 bytes), ie. interfacing between
Python's Unicode object and the platform APIs will be simple and
fast.

UTF-8 is short for ASCII users, but imposes a performance 
hit for the CJK (Asian character sets) world, since UTF8 uses
*variable* length encodings.
 
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From akuchlin at mems-exchange.org  Wed Nov 10 15:56:16 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Wed, 10 Nov 1999 09:56:16 -0500 (EST)
Subject: [Python-Dev] Re: regexp performance
In-Reply-To: <027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<14376.22527.323888.677816@amarok.cnri.reston.va.us>
	<199911091726.MAA21754@eric.cnri.reston.va.us>
	<14376.23671.250752.637144@amarok.cnri.reston.va.us>
	<14376.25768.368164.88151@anthem.cnri.reston.va.us>
	<14376.30652.201552.116828@amarok.cnri.reston.va.us>
	<027c01bf2b69$59e60330$f29b12c2@secret.pythonware.com>
Message-ID: <14377.34704.639462.794509@amarok.cnri.reston.va.us>

[Cc'ed to the String-SIG; sheesh, what's the point of having SIGs
otherwise?]

Fredrik Lundh writes:
>any special pattern constructs that are in need of per-
>formance improvements?  (compared to Perl, that is).

In the 1.5 source tree, I think one major slowdown is coming from the
malloc'ed failure stack.  This was introduced in order to prevent an
expression like (x)* from filling the stack when applied to a string
contained 50,000 'x' characters (hence 50,000 recursive function
calls).  I'd like to get rid of this stack because it's slow and
requires much tedious patching of the upstream PCRE.

>or maybe anyone has an extensive performance test
>suite for perlish regular expressions?  (preferrably based
>on how real people use regular expressions, not only on
>things that are known to be slow if not optimized)

Friedl's book describes several optimizations which aren't implemented
in PCRE.  The problem is that PCRE never builds a parse tree, and
parse trees are easy to analyse recursively.  Instead, PCRE's
functions actually look at the compiled byte codes (for example, look
at find_firstchar or is_anchored in pypcre.c), but this makes analysis
functions hard to write, and rearranging the code near-impossible.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I didn't say it was my fault. I said it was my responsibility. I know the
difference.
    -- Rose Walker, in SANDMAN #60: "The Kindly Ones:4"



From jack at oratrix.nl  Wed Nov 10 16:04:58 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Wed, 10 Nov 1999 16:04:58 +0100
Subject: [Python-Dev] I18N Toolkit 
In-Reply-To: Message by "Fredrik Lundh"  ,
	     Wed, 10 Nov 1999 11:52:28 +0100 , <029c01bf2b69$af0da250$f29b12c2@secret.pythonware.com> 
Message-ID: <19991110150458.B542735BB1E@snelboot.oratrix.nl>

> a slightly hairer design issue is what combinations
> of pattern and string the new 're' will handle.
> 
> the first two are obvious:
>  
>      ordinary pattern, ordinary string
>      unicode pattern, unicode string
>  
>  but what about these?
>  
>      ordinary pattern, unicode string
>      unicode pattern, ordinary string

I think the logical thing to do would be to "promote" the ordinary pattern or 
string to unicode, in a similar way to what happens if you combine ints and 
floats in a single expression.

The result may be a bit surprising if your pattern is in ascii and you've 
never been aware of unicode and are given such a string from somewhere else, 
but then if you're only aware of integer arithmetic and are suddenly presented 
with a couple of floats you'll also be pretty surprised at the result. At 
least it's easily explained.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From fdrake at acm.org  Wed Nov 10 16:22:17 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 10 Nov 1999 10:22:17 -0500 (EST)
Subject: Internal Format (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<00f401bf2b53$3c234e90$f29b12c2@secret.pythonware.com>
Message-ID: <14377.36265.315127.788319@weyr.cnri.reston.va.us>

Fredrik Lundh writes:
 > having been there and done that, I strongly suggest
 > a third option: a 16-bit unsigned integer, in platform
 > specific byte order (PY_UNICODE_T).  along all other

  I actually like this best, but I understand that there are reasons
for using wchar_t, especially for interfacing with other code that
uses Unicode.
  Perhaps someone who knows more about the specific issues with
interfacing using wchar_t can summarize them, or point me to whatever
I've already missed.  p-)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From skip at mojam.com  Wed Nov 10 16:54:30 1999
From: skip at mojam.com (Skip Montanaro)
Date: Wed, 10 Nov 1999 09:54:30 -0600 (CST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
Message-ID: <14377.38198.793496.870273@dolphin.mojam.com>

Just a couple observations from the peanut gallery...

1. I'm glad I don't have to do this Unicode/UTF/internationalization stuff.
   Seems like it would be easier to just get the whole world speaking
   Esperanto.

2. Are there plans for an internationalization session at IPC8?  Perhaps a
   few key players could be locked into a room for a couple days, to emerge
   bloodied, but with an implementation in-hand...

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From fdrake at acm.org  Wed Nov 10 16:58:30 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 10 Nov 1999 10:58:30 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <38295A08.D3928401@lemburg.com>
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
	<38295A08.D3928401@lemburg.com>
Message-ID: <14377.38438.615701.231437@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 >     def encode(self,u):
 > 	
 > 	""" Return the Unicode object u encoded as Python string.

  This should accept an optional slice parameter, and use it in the
same way as .dump().

 >     def dump(self,u,stream,slice=None):
...
 >     def load(self,stream,length=None):

  Why not have something like .wrapFile(f) that returns a file-like
object with all the file methods implemented, and doing to "right
thing" regarding encoding/decoding?  That way, the new file-like
object can be used directly with code that works with files and
doesn't care whether it uses 8-bit or unicode strings.

 > Codecs should raise an UnicodeError in case the conversion is
 > not possible.

  I think that should be ValueError, or UnicodeError should be a
subclass of ValueError.
  (Can the -X interpreter option be removed yet?)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From bwarsaw at cnri.reston.va.us  Wed Nov 10 17:41:29 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Wed, 10 Nov 1999 11:41:29 -0500 (EST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
References: <19991109135839.25864.rocketmail@web607.mail.yahoo.com>
	<199911091646.LAA21467@eric.cnri.reston.va.us>
	<010b01bf2b54$fb107430$f29b12c2@secret.pythonware.com>
	<14377.38198.793496.870273@dolphin.mojam.com>
Message-ID: <14377.41017.413515.887236@anthem.cnri.reston.va.us>

>>>>> "SM" == Skip Montanaro  writes:

    SM> 2. Are there plans for an internationalization session at
    SM> IPC8?  Perhaps a few key players could be locked into a room
    SM> for a couple days, to emerge bloodied, but with an
    SM> implementation in-hand...

I'm starting to think about devday topics.  Sounds like an I18n
session would be very useful.  Champions?

-Barry



From mal at lemburg.com  Wed Nov 10 14:31:47 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 10 Nov 1999 14:31:47 +0100
Subject: [Python-Dev] Internationalization Toolkit
References:  <039c01bf2b78$b234d520$f29b12c2@secret.pythonware.com>
Message-ID: <382973C3.DCA77051@lemburg.com>

Fredrik Lundh wrote:
> 
> Greg Stein  wrote:
> > Have you ever noticed how Python modules, packages, tools, etc, never
> > define an import hook?
> 
> hey, didn't MAL use one in one of his mx kits? ;-)

Not yet, but I will unless my last patch ("walk me up, Scotty" - import)
goes into the core interpreter.
 
> > I say axe it and say "UTF-8" is the fixed, default encoding. If you want
> > something else, then do that explicitly.
> 
> exactly.
> 
> modes are evil.  python is not perl.  etc.

But a requirement by the customer... they want to be able to set the locale
on a per thread basis. Not exactly my preference (I think all locale
settings should be passed as parameters, not via globals).
 
> > Are we digging a hole for ourselves? Maybe. But there are two other big
> > platforms that have the same hole to dig out of *IF* it ever comes to
> > that. I posit that it won't be necessary; that the people needing UCS-4
> > can do so entirely in Python.
> 
> last time I checked, there were no characters (even in the
> ISO standard) outside the 16-bit range.  has that changed?

No, but people are already thinking about it and there is
a defined range in the >16-bit area for private encodings
(F0000..FFFFD and 100000..10FFFD).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    51 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mhammond at skippinet.com.au  Wed Nov 10 22:36:04 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 11 Nov 1999 08:36:04 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382973C3.DCA77051@lemburg.com>
Message-ID: <005701bf2bc3$980f4d60$0501a8c0@bobcat>

Marc writes:

> > modes are evil.  python is not perl.  etc.
>
> But a requirement by the customer... they want to be able to
> set the locale
> on a per thread basis. Not exactly my preference (I think all locale
> settings should be passed as parameters, not via globals).

Sure - that is what this customer wants, but we need to be clear about
the "best thing" for Python generally versus what this particular
client wants.

For example, if we went with UTF-8 as the only default encoding, then
HP may be forced to use a helper function to perform the conversion,
rather than the built-in functions.  This helper function can use TLS
(in Python) to store the encoding.  At least it is localized.

I agree that having a default encoding that can be changed is a bad
idea.  It may make 3 line scripts that need to print something easier
to work with, but at the cost of reliability in large systems.  Kinda
like the existing "locale" support, which is thread specific, and is
well known to cause these sorts of problems.  The end result is that
in your app, you find _someone_ has changed the default encoding, and
some code no longer works.  So the solution is to change the default
encoding back, so _your_ code works again.  You just know that whoever
it was that changed the default encoding in the first place is now
going to break - but what else can you do?

Having a fixed, default encoding may make life slightly more difficult
when you want to work primarily in a different encoding, but at least
your system is predictable and reliable.

Mark.

>
> > > Are we digging a hole for ourselves? Maybe. But there are
> two other big
> > > platforms that have the same hole to dig out of *IF* it
> ever comes to
> > > that. I posit that it won't be necessary; that the people
> needing UCS-4
> > > can do so entirely in Python.
> >
> > last time I checked, there were no characters (even in the
> > ISO standard) outside the 16-bit range.  has that changed?
>
> No, but people are already thinking about it and there is
> a defined range in the >16-bit area for private encodings
> (F0000..FFFFD and 100000..10FFFD).
>
> --
> Marc-Andre Lemburg
>
______________________________________________________________________
> Y2000:                                                    51 days
left
> Business:
http://www.lemburg.com/
> Python Pages:
http://www.lemburg.com/python/
>
>
> _______________________________________________
> Python-Dev maillist  -  Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
>




From gstein at lyra.org  Fri Nov 12 00:14:55 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 15:14:55 -0800 (PST)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat>
Message-ID: 

On Thu, 11 Nov 1999, Mark Hammond wrote:
> Marc writes:
> > > modes are evil.  python is not perl.  etc.
> >
> > But a requirement by the customer... they want to be able to
> > set the locale
> > on a per thread basis. Not exactly my preference (I think all locale
> > settings should be passed as parameters, not via globals).
> 
> Sure - that is what this customer wants, but we need to be clear about
> the "best thing" for Python generally versus what this particular
> client wants.

Ha! I was getting ready to say exactly the same thing. Are building Python
for a particular customer, or are we building it to Do The Right Thing?

I've been getting increasingly annoyed at "well, HP says this" or "HP
wants that." I'm ecstatic that they are a Consortium member and are
helping to fund the development of Python. However, if that means we are
selling Python's soul to corporate wishes rather than programming and
design ideals... well, it reduces my enthusiasm :-)

>...
> I agree that having a default encoding that can be changed is a bad
> idea.  It may make 3 line scripts that need to print something easier
> to work with, but at the cost of reliability in large systems.  Kinda
> like the existing "locale" support, which is thread specific, and is
> well known to cause these sorts of problems.  The end result is that
> in your app, you find _someone_ has changed the default encoding, and
> some code no longer works.  So the solution is to change the default
> encoding back, so _your_ code works again.  You just know that whoever
> it was that changed the default encoding in the first place is now
> going to break - but what else can you do?

Yes! Yes! Example #2.

My first example (import hooks) was shrugged off by some as "well, nobody
uses those." Okay, maybe people don't use them (but I believe that is
*because* of this kind of problem).

In Mark's example, however... this is a definite problem. I ran into this
when I was building some code for Microsoft Site Server. IIS was setting a
different locale on my thread -- one that I definitely was not expecting.
All of a sudden, strlwr() no longer worked as I expected -- certain
characters didn't get lower-cased, so my dictionary lookups failed because
the keys were not all lower-cased.

Solution? Before passing control from C++ into Python, I set the locale to
the default locale. Restored it on the way back out. Extreme measures, and
costly to do, but it had to be done.

I think I'll pick up Fredrik's phrase here...

(chanting) "Modes Are Evil!"  "Modes Are Evil!"  "Down with Modes!"

:-)

> Having a fixed, default encoding may make life slightly more difficult
> when you want to work primarily in a different encoding, but at least
> your system is predictable and reliable.

*bing*

I'm with Mark on this one. Global modes and state are a serious pain when
it comes to developing a system.

Python is very amenable to utility functions and classes. Any "customer"
can use a utility function to manually do the encoding according to a
per-thread setting stashed in some module-global dictionary (map thread-id
to default-encoding). Done. Keep it out of the interpreter...

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From da at ski.org  Thu Nov 11 00:21:54 1999
From: da at ski.org (David Ascher)
Date: Wed, 10 Nov 1999 15:21:54 -0800 (Pacific Standard Time)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: 
Message-ID: 

On Thu, 11 Nov 1999, Greg Stein wrote:

> Ha! I was getting ready to say exactly the same thing. Are building Python
> for a particular customer, or are we building it to Do The Right Thing?
> 
> I've been getting increasingly annoyed at "well, HP says this" or "HP
> wants that." I'm ecstatic that they are a Consortium member and are
> helping to fund the development of Python. However, if that means we are
> selling Python's soul to corporate wishes rather than programming and
> design ideals... well, it reduces my enthusiasm :-)

What about just explaining the rationale for the default-less point of
view to whoever is in charge of this at HP and see why they came up with
their rationale in the first place?  They might have a good reason, or
they might be willing to change said requirement.

--david




From gstein at lyra.org  Fri Nov 12 00:31:43 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 15:31:43 -0800 (PST)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: 
Message-ID: 

Damn, you're smooth... maybe you should have run for SF Mayor...

:-)

On Wed, 10 Nov 1999, David Ascher wrote:
> On Thu, 11 Nov 1999, Greg Stein wrote:
> 
> > Ha! I was getting ready to say exactly the same thing. Are building Python
> > for a particular customer, or are we building it to Do The Right Thing?
> > 
> > I've been getting increasingly annoyed at "well, HP says this" or "HP
> > wants that." I'm ecstatic that they are a Consortium member and are
> > helping to fund the development of Python. However, if that means we are
> > selling Python's soul to corporate wishes rather than programming and
> > design ideals... well, it reduces my enthusiasm :-)
> 
> What about just explaining the rationale for the default-less point of
> view to whoever is in charge of this at HP and see why they came up with
> their rationale in the first place?  They might have a good reason, or
> they might be willing to change said requirement.
> 
> --david
> 

--
Greg Stein, http://www.lyra.org/




From tim_one at email.msn.com  Thu Nov 11 07:25:27 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 01:25:27 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <00f501bf2b53$9872e610$f29b12c2@secret.pythonware.com>
Message-ID: <000201bf2c0d$8b866160$262d153f@tim>

[/F, dripping with code]
> ...
> Note that the 'u' must be followed by four hexadecimal digits.  If
> fewer digits are given, the sequence is left in the resulting string
> exactly as given.

Yuck -- don't let probable error pass without comment.  "must be" == "must
be"!

[moving backwards]
> \uxxxx -- Unicode character with hexadecimal value xxxx.  The
> character is stored using UTF-8 encoding, which means that this
> sequence can result in up to three encoded characters.

The code is fine, but I've gotten confused about what the intent is now.
Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8
literals, but now he's got Unicode-escaped literals instead -- and you favor
an internal 2-byte-per-char Unicode storage format.  In that combination of
worlds, is there any use in the *language* (as opposed to in a runtime
module) for \uxxxx -> UTF-8 conversion?

And MAL, if you're listening, I'm not clear on what a Unicode-escaped
literal means.  When you had UTF-8 literals, the meaning of something like

    u"a\340\341"

was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals
were just a way of specifying a byte stream.  As a Unicode-escaped string, I
assume the "a" maps to the Unicode "a", but what of the rest?  Are the octal
escapes to be taken as two separate Latin-1 characters (in their role as a
Unicode subset), or as an especially clumsy way to specify a single 16-bit
Unicode character?  I'm afraid I'd vote for the former.  Same issue wrt \x
escapes.

One other issue:  are there "raw" Unicode strings too, as in ur"\u20ac"?
There probably should be; and while Guido will hate this, a ur string should
probably *not* leave \uxxxx escapes untouched.  Nasties like this are why
Java defines \uxxxx expansion as occurring in a preprocessing step.

BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or
isn't \uxxxx allowed in a non-Unicode string?  that's what I would do ...).





From tim_one at email.msn.com  Thu Nov 11 07:49:16 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 01:49:16 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000501bf2c10$df4679e0$262d153f@tim>

[ Greg Stein]
> ...
> Things will be a lot faster if we have a fixed-size character. Variable
> length formats like UTF-8 are a lot harder to slice, search, etc.

The initial byte of any UTF-8 encoded character never appears in a
*non*-initial position of any UTF-8 encoded character.  Which means
searching is not only tractable in UTF-8, but also that whatever optimized
8-bit clean string searching routines you happen to have sitting around
today can be used as-is on UTF-8 encoded strings.  This is not true of UCS-2
encoded strings (in which "the first" byte is not distinguished, so 8-bit
search is vulnerable to finding a hit starting "in the middle" of a
character).  More, to the extent that the bulk of your text is plain ASCII,
the UTF-8 search will run much faster than when using a 2-byte encoding,
simply because it has half as many bytes to chew over.

UTF-8 is certainly slower for random-access indexing, including slicing.

I don't know what "etc" means, but if it follows the pattern so far,
sometimes it's faster and sometimes it's slower .

> (IMO) a big reason for this new type is for interaction with the
> underlying OS/platform. I don't know of any platforms right now that
> really use UTF-8 as their Unicode string representation (meaning we'd
> have to convert back/forth from our UTF-8 representation to talk to the
> OS).

No argument here.





From tim_one at email.msn.com  Thu Nov 11 07:56:35 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 01:56:35 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382968B7.ABFFD4C0@lemburg.com>
Message-ID: <000601bf2c11$e4b07920$262d153f@tim>

[MAL, on Unicode chr() and ord()
> ...
> Because unichr() will always have to return Unicode objects. You don't
> want chr(i) to return Unicode for i>255 and strings for i<256.

Indeed I do not!

> OTOH, ord() could probably be extended to also work on Unicode objects.

I think should be -- it's a good & natural use of polymorphism; introducing
a new function *here* would be as odd as introducing a unilen() function to
get the length of a Unicode string.





From tim_one at email.msn.com  Thu Nov 11 08:03:34 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 02:03:34 -0500
Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance
In-Reply-To: <14377.34704.639462.794509@amarok.cnri.reston.va.us>
Message-ID: <000701bf2c12$de8bca80$262d153f@tim>

[Andrew M. Kuchling]
> ...
> Friedl's book describes several optimizations which aren't implemented
> in PCRE.  The problem is that PCRE never builds a parse tree, and
> parse trees are easy to analyse recursively.  Instead, PCRE's
> functions actually look at the compiled byte codes (for example, look
> at find_firstchar or is_anchored in pypcre.c), but this makes analysis
> functions hard to write, and rearranging the code near-impossible.

This is wonderfully & ironically Pythonic.  That is, the Python compiler
itself goes straight to byte code, and the optimization that's done works at
the latter low level.  Luckily , very little optimization is
attempted, and what's there only replaces one bytecode with another of the
same length.  If it tried to do more, it would have to rearrange the code
...

the-more-things-differ-the-more-things-don't-ly y'rs  - tim





From tim_one at email.msn.com  Thu Nov 11 08:27:52 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 02:27:52 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382973C3.DCA77051@lemburg.com>
Message-ID: <000801bf2c16$43f9a4c0$262d153f@tim>

[/F]
> last time I checked, there were no characters (even in the
> ISO standard) outside the 16-bit range.  has that changed?

[MAL]
> No, but people are already thinking about it and there is
> a defined range in the >16-bit area for private encodings
> (F0000..FFFFD and 100000..10FFFD).

Over the decades I've developed a rule of thumb that has never wound up
stuck in my ass :  If I engineer code that I expect to be in use for N
years, I make damn sure that every internal limit is at least 10x larger
than the largest I can conceive of a user making reasonable use of at the
end of those N years.  The invariable result is that the N years pass, and
fewer than half of the users have bumped into the limit <0.5 wink>.

At the risk of offending everyone, I'll suggest that, qualitatively
speaking, Unicode is as Eurocentric as ASCII is Anglocentric.  We've just
replaced "256 characters?!  We'll *never* run out of those!" with 64K.  But
when Asian languages consume them 7K at a pop, 64K isn't even in my 10x
comfort range for some individual languages.  In just a few months, Unicode
3 will already have used up > 56K of the 64K slots.

As I understand it, UTF-16 "only" adds 1M new code points.  That's in my 10x
zone, for about a decade.

predicting-we'll-live-to-regret-it-either-way-ly y'rs  - tim





From captainrobbo at yahoo.com  Thu Nov 11 08:29:05 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:29:05 -0800 (PST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
Message-ID: <19991111072905.25203.rocketmail@web607.mail.yahoo.com>

> 2. Are there plans for an internationalization
> session at IPC8?  Perhaps a
>    few key players could be locked into a room for a
> couple days, to emerge
>    bloodied, but with an implementation in-hand...

Excellent idea.  

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From tim_one at email.msn.com  Thu Nov 11 08:29:50 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 02:29:50 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <005701bf2bc3$980f4d60$0501a8c0@bobcat>
Message-ID: <000901bf2c16$8a107420$262d153f@tim>

[Mark Hammond]
> Sure - that is what this customer wants, but we need to be clear about
> the "best thing" for Python generally versus what this particular
> client wants.
> ...
> Having a fixed, default encoding may make life slightly more difficult
> when you want to work primarily in a different encoding, but at least
> your system is predictable and reliable.

Well said, Mark!  Me too.  It's like HP is suffering from Windows envy
.





From captainrobbo at yahoo.com  Thu Nov 11 08:30:53 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:30:53 -0800 (PST)
Subject: cached encoding (Re: [Python-Dev] Internationalization Toolkit)
Message-ID: <19991111073053.7884.rocketmail@web602.mail.yahoo.com>


--- "Barry A. Warsaw" 
wrote:
> 
> I'm starting to think about devday topics.  Sounds
> like an I18n
> session would be very useful.  Champions?
> 
I'm willing to explain what the fuss is about to
bemused onlookers and give some examples of problems
it should be able to solve - plenty of good slides and
screen shots.  I'll stay well away from the C
implementation issues.

Regards,

Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From captainrobbo at yahoo.com  Thu Nov 11 08:33:25 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:33:25 -0800 (PST)
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
Message-ID: <19991111073325.8024.rocketmail@web602.mail.yahoo.com>

> 
> What about just explaining the rationale for the
> default-less point of
> view to whoever is in charge of this at HP and see
> why they came up with
> their rationale in the first place?  They might have
> a good reason, or
> they might be willing to change said requirement.
> 
> --david

For that matter (I came into this a bit late), is
there a statement somewhere of what HP actually want
to do?  

- Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From captainrobbo at yahoo.com  Thu Nov 11 08:44:50 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 10 Nov 1999 23:44:50 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991111074450.20451.rocketmail@web606.mail.yahoo.com>

> I say axe it and say "UTF-8" is the fixed, default
> encoding. If you want
> something else, then do that explicitly.
> 
Let me tell you why you would want to have an encoding
which can be set:

(1) sday I am on a Japanese Windows box, I have a
string called 'address' and I do 'print address'.  If
I see utf8, I see garbage.  If I see Shift-JIS, I see
the correct Japanese address.  At this point in time,
utf8 is an interchange format but 99% of the world's
data is in various native encodings.  

Analogous problems occur on input.

(2) I'm using htmlgen, which 'prints' objects to
standard output.  My web site is supposed to be
encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
etc.)  Yes, browsers CAN detect and display UTF8 but
you just don't find UTF8 sites in the real world - and
most users just don't know about the encoding menu,
and will get pissed off if they have to reach for it.

Ditto for streaming output in some protocol.

Java solves this (and we could too by hacking stdout)
using Writer classes which are created as wrappers
around an output stream and can take an encoding, but
you lose the flexibility to 'just print'.  

I think being able to change encoding would be useful.
 What I do not want is to auto-detect it from the
operating system when Python boots - that would be a
portability nightmare. 

Regards,

Andy





=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From fredrik at pythonware.com  Thu Nov 11 09:06:04 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Thu, 11 Nov 1999 09:06:04 +0100
Subject: [Python-Dev] RE: [String-SIG] Re: regexp performance
References: <000701bf2c12$de8bca80$262d153f@tim>
Message-ID: <009201bf2c1b$9a5c1b90$f29b12c2@secret.pythonware.com>

Tim Peters  wrote:
> > The problem is that PCRE never builds a parse tree, and
> > parse trees are easy to analyse recursively.  Instead, PCRE's
> > functions actually look at the compiled byte codes (for example, look
> > at find_firstchar or is_anchored in pypcre.c), but this makes analysis
> > functions hard to write, and rearranging the code near-impossible.
> 
> This is wonderfully & ironically Pythonic.  That is, the Python compiler
> itself goes straight to byte code, and the optimization that's done works at
> the latter low level.

yeah, but by some reason, people (including GvR) expect a
regular expression machinery to be more optimized than the
language interpreter ;-)






From tim_one at email.msn.com  Thu Nov 11 09:01:58 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Thu, 11 Nov 1999 03:01:58 -0500
Subject: [Python-Dev] default encodings (was: Internationalization Toolkit)
In-Reply-To: <19991111073325.8024.rocketmail@web602.mail.yahoo.com>
Message-ID: <000c01bf2c1b$0734c060$262d153f@tim>

[Andy Robinson]
> For that matter (I came into this a bit late), is
> there a statement somewhere of what HP actually want
> to do?

On this list, the best explanation we got was from Guido:  they want
"internationalization", and "Perl-compatible Unicode regexps".  I'm not sure
they even know the two aren't identical <0.9 wink>.

code-without-requirements-is-like-sex-without-consequences-ly y'rs  - tim





From guido at CNRI.Reston.VA.US  Thu Nov 11 13:03:51 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 11 Nov 1999 07:03:51 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: Your message of "Wed, 10 Nov 1999 23:44:50 PST."
             <19991111074450.20451.rocketmail@web606.mail.yahoo.com> 
References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> 
Message-ID: <199911111203.HAA24221@eric.cnri.reston.va.us>

> Let me tell you why you would want to have an encoding
> which can be set:
> 
> (1) sday I am on a Japanese Windows box, I have a
> string called 'address' and I do 'print address'.  If
> I see utf8, I see garbage.  If I see Shift-JIS, I see
> the correct Japanese address.  At this point in time,
> utf8 is an interchange format but 99% of the world's
> data is in various native encodings.  
> 
> Analogous problems occur on input.
> 
> (2) I'm using htmlgen, which 'prints' objects to
> standard output.  My web site is supposed to be
> encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
> etc.)  Yes, browsers CAN detect and display UTF8 but
> you just don't find UTF8 sites in the real world - and
> most users just don't know about the encoding menu,
> and will get pissed off if they have to reach for it.
> 
> Ditto for streaming output in some protocol.
> 
> Java solves this (and we could too by hacking stdout)
> using Writer classes which are created as wrappers
> around an output stream and can take an encoding, but
> you lose the flexibility to 'just print'.  
> 
> I think being able to change encoding would be useful.
>  What I do not want is to auto-detect it from the
> operating system when Python boots - that would be a
> portability nightmare. 

You almost convinced me there, but I think this can still be done
without changing the default encoding: simply reopen stdout with a
different encoding.  This is how Java does it.  I/O streams with an
encoding specified at open() are a very powerful feature.  You can
hide this in your $PYTHONSTARTUP.

Fran?ois Pinard might not like it though...

BTW, someone asked what HP asked for: I can't reveal what exactly they
asked for, basically because they don't seem to agree amongst
themselves.  The only firm statements I have is that they want i18n
and that they want it fast (before the end of the year).

The desire from Perl-compatible regexps comes from me, and the only
reason is compatibility with re.py.  (HP did ask for regexps, but they
don't know the difference between POSIX and Perl if it poked them in
the eye.)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein at lyra.org  Thu Nov 11 13:20:39 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 11 Nov 1999 04:20:39 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit (fwd)
Message-ID: 

Andy originally sent this just to me... I replied in kind, but saw that he
sent another copy to python-dev. Sending my reply there...

---------- Forwarded message ----------
Date: Thu, 11 Nov 1999 04:00:38 -0800 (PST)
From: Greg Stein 
To: andy at robanal.demon.co.uk
Subject: Re: [Python-Dev] Internationalization Toolkit

[ note: you sent direct to me; replying in kind in case that was your
  intent ]

On Wed, 10 Nov 1999, [iso-8859-1] Andy Robinson wrote:
>...
> Let me tell you why you would want to have an encoding
> which can be set:
>...snip: two examples of how "print" fails...

Neither of those examples are solid reasons for having a default encoding
that can be changed. Both can easily be altered at the Python level by
using an encoding function before printing.

You're asking for convenience, *not* providing a reason.

> Java solves this (and we could too) using Writer
> classes which are created as wrappers around an output
> stream and can take an encoding, but you lose the
> flexibility to just print.  

Not flexibility: convenience. You can certainly do:

  print encode(u,'Shift-JIS')

> I think being able to change encoding would be useful.
>  What I do not want is to auto-detect it from the
> operating system when Python boots - that would be a
> portability nightmare. 

Useful, but not a requirement.

Keep the interpreter simple, understandable, and predictable. A module
that changes the default over to 'utf-8' because it is interacting with a
network object is going to screw up your app if you're relying on an
encoding of 'shift-jis' to be present.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/





From captainrobbo at yahoo.com  Thu Nov 11 13:49:10 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Thu, 11 Nov 1999 04:49:10 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991111124910.6373.rocketmail@web603.mail.yahoo.com>

> You almost convinced me there, but I think this can
> still be done
> without changing the default encoding: simply reopen
> stdout with a
> different encoding.  This is how Java does it.  I/O
> streams with an
> encoding specified at open() are a very powerful
> feature.  You can
> hide this in your $PYTHONSTARTUP.

Good point, I'm happy with this.  Make sure we specify
it in the docs as the right way to do it.  In an IDE,
we'd have an Options screen somewhere for the output
encoding.

What the Java code I have seen does is to open a raw
file and construct wrappers (InputStreamReader,
OutputStreamWriter) around it to do an encoding
conversion.  This kind of obfuscates what is going on
- Python just needs the extra argument.  

- Andy








=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From mal at lemburg.com  Thu Nov 11 13:42:51 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 13:42:51 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
		<38295A08.D3928401@lemburg.com> <14377.38438.615701.231437@weyr.cnri.reston.va.us>
Message-ID: <382AB9CB.634A9782@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  >     def encode(self,u):
>  >
>  >      """ Return the Unicode object u encoded as Python string.
> 
>   This should accept an optional slice parameter, and use it in the
> same way as .dump().

Ok.
 
>  >     def dump(self,u,stream,slice=None):
> ...
>  >     def load(self,stream,length=None):
> 
>   Why not have something like .wrapFile(f) that returns a file-like
> object with all the file methods implemented, and doing to "right
> thing" regarding encoding/decoding?  That way, the new file-like
> object can be used directly with code that works with files and
> doesn't care whether it uses 8-bit or unicode strings.

See File Output of the latest version:

File/Stream Output:
-------------------

Since file.write(object) and most other stream writers use the 's#'
argument parsing marker, the buffer interface implementation
determines the encoding to use (see Buffer Interface).

For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(,,) could be provided as
    short-hand for unicodec.file(open(,),) which
    also assures that  contains the 'b' character when needed.
 
>  > Codecs should raise an UnicodeError in case the conversion is
>  > not possible.
> 
>   I think that should be ValueError, or UnicodeError should be a
> subclass of ValueError.

Ok.

>   (Can the -X interpreter option be removed yet?)

Doesn't Python convert class exceptions to strings when -X is
used ? I would guess that many scripts already rely on the class
based mechanism (much of my stuff does for sure), so by the time
1.6 is out, I think -X should be considered an option to run
pre 1.5 code rather than using it for performance reasons.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 14:01:40 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 14:01:40 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <005701bf2bc3$980f4d60$0501a8c0@bobcat>
Message-ID: <382ABE34.5D27C701@lemburg.com>

Mark Hammond wrote:
> 
> Marc writes:
> 
> > > modes are evil.  python is not perl.  etc.
> >
> > But a requirement by the customer... they want to be able to
> > set the locale
> > on a per thread basis. Not exactly my preference (I think all locale
> > settings should be passed as parameters, not via globals).
> 
> Sure - that is what this customer wants, but we need to be clear about
> the "best thing" for Python generally versus what this particular
> client wants.
> 
> For example, if we went with UTF-8 as the only default encoding, then
> HP may be forced to use a helper function to perform the conversion,
> rather than the built-in functions.  This helper function can use TLS
> (in Python) to store the encoding.  At least it is localized.
> 
> I agree that having a default encoding that can be changed is a bad
> idea.  It may make 3 line scripts that need to print something easier
> to work with, but at the cost of reliability in large systems.  Kinda
> like the existing "locale" support, which is thread specific, and is
> well known to cause these sorts of problems.  The end result is that
> in your app, you find _someone_ has changed the default encoding, and
> some code no longer works.  So the solution is to change the default
> encoding back, so _your_ code works again.  You just know that whoever
> it was that changed the default encoding in the first place is now
> going to break - but what else can you do?
> 
> Having a fixed, default encoding may make life slightly more difficult
> when you want to work primarily in a different encoding, but at least
> your system is predictable and reliable.

I think the discussion on this is getting a little too hot. The point
is simply that the option of changing the per-thread default encoding
is there. You are not required to use it and if you do you are on
your own when something breaks.

Think of it as a HP specific feature... perhaps I should wrap the code
in #ifdefs and leave it undocumented.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Thu Nov 11 16:02:32 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 11 Nov 1999 10:02:32 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AB9CB.634A9782@lemburg.com>
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
	<38295A08.D3928401@lemburg.com>
	<14377.38438.615701.231437@weyr.cnri.reston.va.us>
	<382AB9CB.634A9782@lemburg.com>
Message-ID: <14378.55944.371933.613604@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > For explicit handling of Unicode using files, the unicodec module
 > could provide stream wrappers which provide transparent
 > encoding/decoding for any open stream (file-like object):

  Sounds good to me!  I guess I just missed, there's been so much
going on lately.

 > XXX unicodec.file(,,) could be provided as
 >     short-hand for unicodec.file(open(,),) which
 >     also assures that  contains the 'b' character when needed.

  Actually, I'd call it unicodec.open().

I asked:
 >   (Can the -X interpreter option be removed yet?)

You commented:
 > Doesn't Python convert class exceptions to strings when -X is
 > used ? I would guess that many scripts already rely on the class
 > based mechanism (much of my stuff does for sure), so by the time
 > 1.6 is out, I think -X should be considered an option to run
 > pre 1.5 code rather than using it for performance reasons.

  Gosh, I never thought of it as a performance issue!
  What I'd like to do is avoid code like this:

	try:
            class UnicodeError(ValueError):
                # well, something would probably go here...
                pass
        except TypeError:
            class UnicodeError:
                # something slightly different for this one...
                pass

  Trying to use class exceptions can be really tedious, and often I'd
like to pick up the stuff from Exception.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From mal at lemburg.com  Thu Nov 11 15:21:50 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:21:50 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000201bf2c0d$8b866160$262d153f@tim>
Message-ID: <382AD0FE.B604876A@lemburg.com>

Tim Peters wrote:
> 
> [/F, dripping with code]
> > ...
> > Note that the 'u' must be followed by four hexadecimal digits.  If
> > fewer digits are given, the sequence is left in the resulting string
> > exactly as given.
> 
> Yuck -- don't let probable error pass without comment.  "must be" == "must
> be"!

I second that.
 
> [moving backwards]
> > \uxxxx -- Unicode character with hexadecimal value xxxx.  The
> > character is stored using UTF-8 encoding, which means that this
> > sequence can result in up to three encoded characters.
> 
> The code is fine, but I've gotten confused about what the intent is now.
> Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8
> literals, but now he's got Unicode-escaped literals instead -- and you favor
> an internal 2-byte-per-char Unicode storage format.  In that combination of
> worlds, is there any use in the *language* (as opposed to in a runtime
> module) for \uxxxx -> UTF-8 conversion?

No, no...  :-) 

I think it was a simple misunderstanding... \uXXXX is only to be
used within u'' strings and then gets expanded to *one* character
encoded in the internal Python format (which is heading towards UTF-16
without surrogates).
 
> And MAL, if you're listening, I'm not clear on what a Unicode-escaped
> literal means.  When you had UTF-8 literals, the meaning of something like
> 
>     u"a\340\341"
> 
> was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals
> were just a way of specifying a byte stream.  As a Unicode-escaped string, I
> assume the "a" maps to the Unicode "a", but what of the rest?  Are the octal
> escapes to be taken as two separate Latin-1 characters (in their role as a
> Unicode subset), or as an especially clumsy way to specify a single 16-bit
> Unicode character?  I'm afraid I'd vote for the former.  Same issue wrt \x
> escapes.

Good points.

The conversion goes as follows:
? for single characters (and this includes all \XXX sequences except \uXXXX),
  take the ordinal and interpret it as Unicode ordinal
? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX 
  instead
 
> One other issue:  are there "raw" Unicode strings too, as in ur"\u20ac"?
> There probably should be; and while Guido will hate this, a ur string should
> probably *not* leave \uxxxx escapes untouched.  Nasties like this are why
> Java defines \uxxxx expansion as occurring in a preprocessing step.

Not sure whether we really need to make this even more complicated...
The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or filenames
won't hurt much in the context of those \uXXXX monsters :-)

> BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or
> isn't \uxxxx allowed in a non-Unicode string?  that's what I would do ...).

Right. \uXXXX will only be allowed in u'' strings, not in "normal"
strings.

BTW, if you want to type in UTF-8 strings and have them converted
to Unicode, you can use the standard:

u = unicode('...string with UTF-8 encoded characters...','utf-8')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 15:23:45 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:23:45 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000601bf2c11$e4b07920$262d153f@tim>
Message-ID: <382AD171.D22A1D6E@lemburg.com>

Tim Peters wrote:
> 
> [MAL, on Unicode chr() and ord()
> > ...
> > Because unichr() will always have to return Unicode objects. You don't
> > want chr(i) to return Unicode for i>255 and strings for i<256.
> 
> Indeed I do not!
> 
> > OTOH, ord() could probably be extended to also work on Unicode objects.
> 
> I think should be -- it's a good & natural use of polymorphism; introducing
> a new function *here* would be as odd as introducing a unilen() function to
> get the length of a Unicode string.

Fine. So I'll drop the uniord() API and extend ord() instead.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 15:36:41 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:36:41 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000901bf2c16$8a107420$262d153f@tim>
Message-ID: <382AD479.5261B43B@lemburg.com>

Tim Peters wrote:
> 
> [Mark Hammond]
> > Sure - that is what this customer wants, but we need to be clear about
> > the "best thing" for Python generally versus what this particular
> > client wants.
> > ...
> > Having a fixed, default encoding may make life slightly more difficult
> > when you want to work primarily in a different encoding, but at least
> > your system is predictable and reliable.
> 
> Well said, Mark!  Me too.  It's like HP is suffering from Windows envy
> .

See my other post on the subject...

Note that if we make UTF-8 the standard encoding, nearly all 
special Latin-1 characters will produce UTF-8 errors on input
and unreadable garbage on output. That will probably be unacceptable
in Europe. To remedy this, one would *always* have to use
u.encode('latin-1') to get readable output for Latin-1 strings
repesented in Unicode.

I'd rather see this happen the other way around: *always* explicitly
state the encoding you want in case you rely on it, e.g. write

file.write(u.encode('utf-8'))

instead of

file.write(u) # let's hope this goes out as UTF-8...

Using the  as site dependent setting is useful
for convenience in those cases where the output format should be
readable rather than parseable.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 15:26:59 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:26:59 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000801bf2c16$43f9a4c0$262d153f@tim>
Message-ID: <382AD233.BE6DE888@lemburg.com>

Tim Peters wrote:
> 
> [/F]
> > last time I checked, there were no characters (even in the
> > ISO standard) outside the 16-bit range.  has that changed?
> 
> [MAL]
> > No, but people are already thinking about it and there is
> > a defined range in the >16-bit area for private encodings
> > (F0000..FFFFD and 100000..10FFFD).
> 
> Over the decades I've developed a rule of thumb that has never wound up
> stuck in my ass :  If I engineer code that I expect to be in use for N
> years, I make damn sure that every internal limit is at least 10x larger
> than the largest I can conceive of a user making reasonable use of at the
> end of those N years.  The invariable result is that the N years pass, and
> fewer than half of the users have bumped into the limit <0.5 wink>.
> 
> At the risk of offending everyone, I'll suggest that, qualitatively
> speaking, Unicode is as Eurocentric as ASCII is Anglocentric.  We've just
> replaced "256 characters?!  We'll *never* run out of those!" with 64K.  But
> when Asian languages consume them 7K at a pop, 64K isn't even in my 10x
> comfort range for some individual languages.  In just a few months, Unicode
> 3 will already have used up > 56K of the 64K slots.
> 
> As I understand it, UTF-16 "only" adds 1M new code points.  That's in my 10x
> zone, for about a decade.

If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
signal failure of this assertion at Unicode object construction time
via an exception. That way we are within the standard, can use
reasonably fast code for Unicode manipulation and add those extra 1M
character at a later stage.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 15:47:49 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 15:47:49 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991111074450.20451.rocketmail@web606.mail.yahoo.com> <199911111203.HAA24221@eric.cnri.reston.va.us>
Message-ID: <382AD715.66DBA125@lemburg.com>

Guido van Rossum wrote:
> 
> > Let me tell you why you would want to have an encoding
> > which can be set:
> >
> > (1) sday I am on a Japanese Windows box, I have a
> > string called 'address' and I do 'print address'.  If
> > I see utf8, I see garbage.  If I see Shift-JIS, I see
> > the correct Japanese address.  At this point in time,
> > utf8 is an interchange format but 99% of the world's
> > data is in various native encodings.
> >
> > Analogous problems occur on input.
> >
> > (2) I'm using htmlgen, which 'prints' objects to
> > standard output.  My web site is supposed to be
> > encoded in Shift-JIS (or EUC, or Big 5 for Taiwan,
> > etc.)  Yes, browsers CAN detect and display UTF8 but
> > you just don't find UTF8 sites in the real world - and
> > most users just don't know about the encoding menu,
> > and will get pissed off if they have to reach for it.
> >
> > Ditto for streaming output in some protocol.
> >
> > Java solves this (and we could too by hacking stdout)
> > using Writer classes which are created as wrappers
> > around an output stream and can take an encoding, but
> > you lose the flexibility to 'just print'.
> >
> > I think being able to change encoding would be useful.
> >  What I do not want is to auto-detect it from the
> > operating system when Python boots - that would be a
> > portability nightmare.
> 
> You almost convinced me there, but I think this can still be done
> without changing the default encoding: simply reopen stdout with a
> different encoding.  This is how Java does it.  I/O streams with an
> encoding specified at open() are a very powerful feature.  You can
> hide this in your $PYTHONSTARTUP.

True and it probably covers all cases where setting the
default encoding to something other than UTF-8 makes sense.

I guess you've convinced me there ;-)

The current proposal has wrappers around stream for this purpose:

For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(,,) could be provided as
    short-hand for unicodec.file(open(,),) which
    also assures that  contains the 'b' character when needed.

The above can be done using:

import sys,unicodec
sys.stdin = unicodec.stream(sys.stdin,'jis')
sys.stdout = unicodec.stream(sys.stdout,'jis')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From jack at oratrix.nl  Thu Nov 11 16:58:39 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Thu, 11 Nov 1999 16:58:39 +0100
Subject: [Python-Dev] Internationalization Toolkit 
In-Reply-To: Message by "M.-A. Lemburg"  ,
	     Thu, 11 Nov 1999 15:23:45 +0100 , <382AD171.D22A1D6E@lemburg.com> 
Message-ID: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl>

> > [MAL, on Unicode chr() and ord()
> > > ...
> > > Because unichr() will always have to return Unicode objects. You don't
> > > want chr(i) to return Unicode for i>255 and strings for i<256.

> > > OTOH, ord() could probably be extended to also work on Unicode objects.

> Fine. So I'll drop the uniord() API and extend ord() instead.

Hmm, then wouldn't it be more logical to drop unichr() too, but add an 
optional parameter to chr() to specify what sort of a string you want? The 
type-object of a unicode string comes to mind...
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From bwarsaw at cnri.reston.va.us  Thu Nov 11 17:04:29 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Thu, 11 Nov 1999 11:04:29 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
References: <19991110080926.2400.rocketmail@web602.mail.yahoo.com>
	<38295A08.D3928401@lemburg.com>
	<14377.38438.615701.231437@weyr.cnri.reston.va.us>
	<382AB9CB.634A9782@lemburg.com>
Message-ID: <14378.59661.376434.449820@anthem.cnri.reston.va.us>

>>>>> "M" == M   writes:

    M> Doesn't Python convert class exceptions to strings when -X is
    M> used ? I would guess that many scripts already rely on the
    M> class based mechanism (much of my stuff does for sure), so by
    M> the time 1.6 is out, I think -X should be considered an option
    M> to run pre 1.5 code rather than using it for performance
    M> reasons.

This is a little off-topic so I'll be brief.  When using -X Python
never even creates the class exceptions, so it isn't really a
conversion.  It just uses string exceptions and tries to craft tuples
for what would be the superclasses in the class-based exception
hierarchy.  Yes, class-based exceptions are a bit of a performance hit
when you are catching exceptions in Python (because they need to be
instantiated), but they're just so darn *useful*.  I wouldn't mind
seeing the -X option go away for 1.6.

-Barry



From captainrobbo at yahoo.com  Thu Nov 11 17:08:15 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Thu, 11 Nov 1999 08:08:15 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991111160815.5235.rocketmail@web608.mail.yahoo.com>

> See my other post on the subject...
> 
> Note that if we make UTF-8 the standard encoding,
> nearly all 
> special Latin-1 characters will produce UTF-8 errors
> on input
> and unreadable garbage on output. That will probably
> be unacceptable
> in Europe. To remedy this, one would *always* have
> to use
> u.encode('latin-1') to get readable output for
> Latin-1 strings
> repesented in Unicode.

You beat me to it - a colleague and I were just
discussing this verbally.  Specifically we Brits will
get annoyed as soon as we read in a text file with
pound (sterling) signs.

We concluded that the only reasonable default (if you
have one at all) is pure ASCII.  At least that way I
will get a clear and intelligible warning when I load
in such a file, and will remember to specify
ISO-Latin-1.  

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From mal at lemburg.com  Thu Nov 11 16:59:21 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 16:59:21 +0100
Subject: [Python-Dev] Unicode proposal: %-formatting ?
Message-ID: <382AE7D9.147D58CB@lemburg.com>

I wonder how we could add %-formatting to Unicode strings without
duplicating the PyString_Format() logic.

First, do we need Unicode object %-formatting at all ?

Second, here is an emulation using strings and 
that should give an idea of one could work with the different
encodings:

    s = '%s %i abc???' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string via Unicode
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

Note that .encode() defaults to the current setting of
.

Provided u maps to Latin-1, an alternative would be:

    u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 18:04:37 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 18:04:37 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991111155839.BFB0235BB1E@snelboot.oratrix.nl>
Message-ID: <382AF725.FC66C9B6@lemburg.com>

Jack Jansen wrote:
> 
> > > [MAL, on Unicode chr() and ord()
> > > > ...
> > > > Because unichr() will always have to return Unicode objects. You don't
> > > > want chr(i) to return Unicode for i>255 and strings for i<256.
> 
> > > > OTOH, ord() could probably be extended to also work on Unicode objects.
> 
> > Fine. So I'll drop the uniord() API and extend ord() instead.
> 
> Hmm, then wouldn't it be more logical to drop unichr() too, but add an
> optional parameter to chr() to specify what sort of a string you want? The
> type-object of a unicode string comes to mind...

Like:

import types
uc = chr(12,types.UnicodeType)

... looks overly complicated, IMHO.

uc = unichr(12)

and

u = unicode('abc')

look pretty intuitive to me.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 16:59:21 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 16:59:21 +0100
Subject: [Python-Dev] Unicode proposal: %-formatting ?
Message-ID: <382AE7D9.147D58CB@lemburg.com>

I wonder how we could add %-formatting to Unicode strings without
duplicating the PyString_Format() logic.

First, do we need Unicode object %-formatting at all ?

Second, here is an emulation using strings and 
that should give an idea of one could work with the different
encodings:

    s = '%s %i abc???' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string via Unicode
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

Note that .encode() defaults to the current setting of
.

Provided u maps to Latin-1, an alternative would be:

    u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1')

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 11 18:31:34 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 11 Nov 1999 18:31:34 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991111160815.5235.rocketmail@web608.mail.yahoo.com>
Message-ID: <382AFD76.A0D3FEC4@lemburg.com>

Andy Robinson wrote:
> 
> > See my other post on the subject...
> >
> > Note that if we make UTF-8 the standard encoding,
> > nearly all
> > special Latin-1 characters will produce UTF-8 errors
> > on input
> > and unreadable garbage on output. That will probably
> > be unacceptable
> > in Europe. To remedy this, one would *always* have
> > to use
> > u.encode('latin-1') to get readable output for
> > Latin-1 strings
> > repesented in Unicode.
> 
> You beat me to it - a colleague and I were just
> discussing this verbally.  Specifically we Brits will
> get annoyed as soon as we read in a text file with
> pound (sterling) signs.
> 
> We concluded that the only reasonable default (if you
> have one at all) is pure ASCII.  At least that way I
> will get a clear and intelligible warning when I load
> in such a file, and will remember to specify
> ISO-Latin-1.

Well, Guido's post made me rethink the approach...

1. Setting  to any non UTF encoding
   will result in data lossage due to the encoding limits
   imposed by the other formats -- this is dangerous and
   will result in errors (some of which may not even be
   noticed due to the interpreter ignoring them) in case
   your strings use non encodable characters.

2. You basically only want to set  to
   anything other than UTF-8 for stream input and output.
   This can be done using the unicodec stream wrapper without
   too much inconvenience. (We'll have to extend the wrapper a little,
   though, because it currently only accept Unicode objects for
   writing and always return Unicode object when reading.)

3. We should leave the issue open until some code is there
   to be tested... I have a feeling that there will be quite
   a few strange effects when APIs expecting strings are fed
   with Unicode objects returning UTF-8.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    50 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mhammond at skippinet.com.au  Fri Nov 12 02:10:09 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 12 Nov 1999 12:10:09 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382ABE34.5D27C701@lemburg.com>
Message-ID: <007a01bf2caa$aabdef60$0501a8c0@bobcat>

> Mark Hammond wrote:
> > Having a fixed, default encoding may make life slightly
> more difficult
> > when you want to work primarily in a different encoding,
> but at least
> > your system is predictable and reliable.
>
> I think the discussion on this is getting a little too hot.

Really - I see it as moving to a rational consensus that doesnt
support the proposal in this regard.  I see no heat in it at all.  Im
sorry if you saw my post or any of the followups as "emotional", but I
certainly not getting passionate about this.  I dont see any of this
as affecting me personally.  I believe that I can replace my Unicode
implementation with this either way we go.  Just because a we are
trying to get it right doesnt mean we are getting heated.

> The point
> is simply that the option of changing the per-thread default
encoding
> is there. You are not required to use it and if you do you are on
> your own when something breaks.

Hrm - Im having serious trouble following your logic here.  If make
_any_ assumptions about a default encoding, I am in danger of
breaking.  I may not choose to change the default, but as soon as
_anyone_ does, unrelated code may break.

I agree that I will be "on my own", but I wont necessarily have been
the one that changed it :-(

The only answer I can see is, as you suggest, to ignore the fact that
there is _any_ default.  Always specify the encoding.  But obviously
this is not good enough for HP:

> Think of it as a HP specific feature... perhaps I should wrap the
code
> in #ifdefs and leave it undocumented.

That would work - just ensure that no standard Python has those
#ifdefs turned on :-)  I would be sorely dissapointed if the fact that
HP are throwing money for this means they get every whim implemented
in the core language.  Imagine the outcry if it were instead MS'
money, and you were attempting to put an MS spin on all this.

Are you writing a module for HP, or writing a module for Python that
HP are assisting by providing some funding?  Clear difference.  IMO,
it must also be seen that there is a clear difference.

Maybe Im missing something.  Can you explain why it is good enough
everyone else to be required to assume there is no default encoding,
but HP get their thread specific global?  Are their requirements
greater than anyone elses?  Is everyone else not as important?  What
would you, as a consultant, recommend to people who arent HP, but have
a similar requirement?  It would seem obvious to me that HPs
requirement can be met in "pure Python", thereby keeping this out of
the core all together...

Mark.




From gmcm at hypernet.com  Fri Nov 12 03:01:23 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Thu, 11 Nov 1999 21:01:23 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat>
References: <382ABE34.5D27C701@lemburg.com>
Message-ID: <1269750417-7621469@hypernet.com>

[per-thread defaults]

C'mon guys, hasn't anyone ever played consultant before? The 
idea is obviously brain-dead. OTOH, they asked for it 
specifically, meaning they have some assumptions about how 
they think they're going to use it. If you give them what they 
ask for, you'll only have to fix it when they realize there are 
other ways of doing things that don't work with per-thread 
defaults. So, you find out why they think it's a good thing; you 
make it easy for them to code this way (without actually using 
per-thread defaults) and you don't make a fuss about it. More 
than likely, they won't either.

"requirements"-are-only-useful-as-clues-to-the-objectives-
behind-them-ly y'rs



- Gordon



From tim_one at email.msn.com  Fri Nov 12 06:04:44 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 00:04:44 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AB9CB.634A9782@lemburg.com>
Message-ID: <000a01bf2ccb$6f59c2c0$fd2d153f@tim>

[MAL]
>>> Codecs should raise an UnicodeError in case the conversion is
>>> not possible.

[Fred L. Drake, Jr.]
>>   I think that should be ValueError, or UnicodeError should be a
>> subclass of ValueError.
>>   (Can the -X interpreter option be removed yet?)

[MAL]
> Doesn't Python convert class exceptions to strings when -X is
> used ? I would guess that many scripts already rely on the class
> based mechanism (much of my stuff does for sure), so by the time
> 1.6 is out, I think -X should be considered an option to run
> pre 1.5 code rather than using it for performance reasons.

-X is a red herring.  That is, do what seems best without regard for -X.  I
already added one subclass exception to the CVS tree (UnboundLocalError as a
subclass of NameError), and in doing that had to figure out how to make it
do the right thing under -X too.  It's a bit clumsy to arrange, but not a
problem.





From tim_one at email.msn.com  Fri Nov 12 06:18:09 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 00:18:09 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <382AD0FE.B604876A@lemburg.com>
Message-ID: <000e01bf2ccd$4f4b0e60$fd2d153f@tim>

[MAL]
> ...
> The conversion goes as follows:
> ? for single characters (and this includes all \XXX sequences
>   except \uXXXX), take the ordinal and interpret it as Unicode
>   ordinal for \uXXXX sequences, insert the Unicode character
>   with ordinal 0xXXXX instead

Perfect!

[about "raw" Unicode strings]
> ...
> Not sure whether we really need to make this even more complicated...
> The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or
> filenames won't hurt much in the context of those \uXXXX monsters :-)

Alas, this won't stand over the long term.  Eventually people will write
Python using nothing but Unicode strings -- "regular strings" will
eventurally become a backward compatibility headache <0.7 wink>.  IOW,
Unicode regexps and Unicode docstrings and Unicode formatting ops ...
nothing will escape.  Nor should it.

I don't think it all needs to be done at once, though -- existing languages
usually take years to graft in gimmicks to cover all the fine points.  So,
happy to let raw Unicode strings pass for now, as a relatively minor point,
but without agreeing it can be ignored forever.

> ...
> BTW, if you want to type in UTF-8 strings and have them converted
> to Unicode, you can use the standard:
>
> u = unicode('...string with UTF-8 encoded characters...','utf-8')

That's what I figured, and thanks for the confirmation.





From tim_one at email.msn.com  Fri Nov 12 06:42:32 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 00:42:32 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AD233.BE6DE888@lemburg.com>
Message-ID: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>

[MAL]
> If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
> signal failure of this assertion at Unicode object construction time
> via an exception. That way we are within the standard, can use
> reasonably fast code for Unicode manipulation and add those extra 1M
> character at a later stage.

I think this is reasonable.

Using UTF-8 internally is also reasonable, and if it's being rejected on the
grounds of supposed slowness, that deserves a closer look (it's an ingenious
encoding scheme that works correctly with a surprising number of existing
8-bit string routines as-is).  Indexing UTF-8 strings is greatly speeded by
adding a simple finger (i.e., store along with the string an index+offset
pair identifying the most recent position indexed to -- since string
indexing is overwhelmingly sequential, this makes most indexing
constant-time; and UTF-8 can be scanned either forward or backward from a
random internal point because "the first byte" of each encoding is
recognizable as such).

I expect either would work well.  It's at least curious that Perl and Tcl
both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
people here saying UCS-2 is the obviously better choice are all from the
Microsoft camp .  It's not obvious to me, but then neither do I claim
that UTF-8 is obviously better.





From tim_one at email.msn.com  Fri Nov 12 07:02:01 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 01:02:01 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382AD479.5261B43B@lemburg.com>
Message-ID: <001001bf2cd3$6fa57820$fd2d153f@tim>

[MAL]
> Note that if we make UTF-8 the standard encoding, nearly all
> special Latin-1 characters will produce UTF-8 errors on input
> and unreadable garbage on output. That will probably be unacceptable
> in Europe. To remedy this, one would *always* have to use
> u.encode('latin-1') to get readable output for Latin-1 strings
> repesented in Unicode.

I think it's time for the Europeans to pronounce on what's acceptable in
Europe.  To the limited extent that I can pretend I'm Eurpoean, I'm happy
with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea.

> I'd rather see this happen the other way around: *always* explicitly
> state the encoding you want in case you rely on it, e.g. write
>
> file.write(u.encode('utf-8'))
>
> instead of
>
> file.write(u) # let's hope this goes out as UTF-8...

By the same argument, those pesky Europeans who are relying on Latin-1
should write

file.write(u.encode('latin-1'))

instead of

file.write(u)  # let's hope this goes out as Latin-1

> Using the  as site dependent setting is useful
> for convenience in those cases where the output format should be
> readable rather than parseable.

Well, "convenience" is always the argument advanced in favor of modes.
Conflicts and nasty intermittent bugs are always the result.  The latter
will happen under Guido's idea too, as various careless modules rebind stdin
& stdout to their own ideas of what "the proper" encoding should be.  But at
least the blame doesn't fall on the core language then <0.3 wink>.

Since there doesn't appear to be anything (either or good or bad) you can do
(or avoid) by using Guido's scheme instead of magical core thread state,
there's no *need* for the latter.  That is, it can be done with a user-level
API without involving the core.





From tim_one at email.msn.com  Fri Nov 12 07:17:08 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Fri, 12 Nov 1999 01:17:08 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <007a01bf2caa$aabdef60$0501a8c0@bobcat>
Message-ID: <001501bf2cd5$8c380140$fd2d153f@tim>

[Mark Hammond]
> ...
> Are you writing a module for HP, or writing a module for Python that
> HP are assisting by providing some funding?  Clear difference.  IMO,
> it must also be seen that there is a clear difference.

I can resolve this easily, but only with input from Guido.  Guido, did HP's
check clear yet?  If so, we can ignore them .





From captainrobbo at yahoo.com  Fri Nov 12 09:15:19 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Fri, 12 Nov 1999 00:15:19 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991112081519.20636.rocketmail@web603.mail.yahoo.com>

--- Gordon McMillan  wrote:
> [per-thread defaults]
> 
> C'mon guys, hasn't anyone ever played consultant
> before? The 
> idea is obviously brain-dead. OTOH, they asked for
> it 
> specifically, meaning they have some assumptions
> about how 
> they think they're going to use it. If you give them
> what they 
> ask for, you'll only have to fix it when they
> realize there are 
> other ways of doing things that don't work with
> per-thread 
> defaults. So, you find out why they think it's a
> good thing; you 
> make it easy for them to code this way (without
> actually using 
> per-thread defaults) and you don't make a fuss about
> it. More 
> than likely, they won't either.
> 

I wrote directly to ask them exactly this last night. 
Let's forget the per-thread thing until we get an
answer.

- Andy




=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From mal at lemburg.com  Fri Nov 12 10:27:29 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:27:29 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000e01bf2ccd$4f4b0e60$fd2d153f@tim>
Message-ID: <382BDD81.458D3125@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > ...
> > The conversion goes as follows:
> > ? for single characters (and this includes all \XXX sequences
> >   except \uXXXX), take the ordinal and interpret it as Unicode
> >   ordinal for \uXXXX sequences, insert the Unicode character
> >   with ordinal 0xXXXX instead
> 
> Perfect!

Thanks :-)
 
> [about "raw" Unicode strings]
> > ...
> > Not sure whether we really need to make this even more complicated...
> > The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or
> > filenames won't hurt much in the context of those \uXXXX monsters :-)
> 
> Alas, this won't stand over the long term.  Eventually people will write
> Python using nothing but Unicode strings -- "regular strings" will
> eventurally become a backward compatibility headache <0.7 wink>.  IOW,
> Unicode regexps and Unicode docstrings and Unicode formatting ops ...
> nothing will escape.  Nor should it.
> 
> I don't think it all needs to be done at once, though -- existing languages
> usually take years to graft in gimmicks to cover all the fine points.  So,
> happy to let raw Unicode strings pass for now, as a relatively minor point,
> but without agreeing it can be ignored forever.

Agreed... note that you could also write your own codec for just this
reason and then use:

u = unicode('....\u1234...\...\...','raw-unicode-escaped')

Put that into a function called 'ur' and you have:

u = ur('...\u4545...\...\...')

which is not that far away from ur'...' w/r to cosmetics.

> > ...
> > BTW, if you want to type in UTF-8 strings and have them converted
> > to Unicode, you can use the standard:
> >
> > u = unicode('...string with UTF-8 encoded characters...','utf-8')
> 
> That's what I figured, and thanks for the confirmation.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 10:00:47 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:00:47 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <19991112081519.20636.rocketmail@web603.mail.yahoo.com>
Message-ID: <382BD73E.E6729C79@lemburg.com>

Andy Robinson wrote:
> 
> --- Gordon McMillan  wrote:
> > [per-thread defaults]
> >
> > C'mon guys, hasn't anyone ever played consultant
> > before? The
> > idea is obviously brain-dead. OTOH, they asked for
> > it
> > specifically, meaning they have some assumptions
> > about how
> > they think they're going to use it. If you give them
> > what they
> > ask for, you'll only have to fix it when they
> > realize there are
> > other ways of doing things that don't work with
> > per-thread
> > defaults. So, you find out why they think it's a
> > good thing; you
> > make it easy for them to code this way (without
> > actually using
> > per-thread defaults) and you don't make a fuss about
> > it. More
> > than likely, they won't either.
> >
> 
> I wrote directly to ask them exactly this last night.
> Let's forget the per-thread thing until we get an
> answer.

That's the way to go, Andy.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 10:44:14 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:44:14 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <007a01bf2caa$aabdef60$0501a8c0@bobcat>
Message-ID: <382BE16E.D17C80E1@lemburg.com>

Mark Hammond wrote:
> 
> > Mark Hammond wrote:
> > > Having a fixed, default encoding may make life slightly
> > more difficult
> > > when you want to work primarily in a different encoding,
> > but at least
> > > your system is predictable and reliable.
> >
> > I think the discussion on this is getting a little too hot.
> 
> Really - I see it as moving to a rational consensus that doesnt
> support the proposal in this regard.  I see no heat in it at all.  Im
> sorry if you saw my post or any of the followups as "emotional", but I
> certainly not getting passionate about this.  I dont see any of this
> as affecting me personally.  I believe that I can replace my Unicode
> implementation with this either way we go.  Just because a we are
> trying to get it right doesnt mean we are getting heated.

Naa... with "heated" I meant the "HP wants this, HP wants that" side
of things. We'll just have to wait for their answer on this one.

> > The point
> > is simply that the option of changing the per-thread default
> encoding
> > is there. You are not required to use it and if you do you are on
> > your own when something breaks.
> 
> Hrm - Im having serious trouble following your logic here.  If make
> _any_ assumptions about a default encoding, I am in danger of
> breaking.  I may not choose to change the default, but as soon as
> _anyone_ does, unrelated code may break.
> 
> I agree that I will be "on my own", but I wont necessarily have been
> the one that changed it :-(

Sure there are some very subtile dangers in setting the default
to anything other than the default ;-) For some this risk may
be worthwhile taking, for others not. In fact, in large projects
I would never take such a risk... I'm sure we can get this 
message across to them.
 
> The only answer I can see is, as you suggest, to ignore the fact that
> there is _any_ default.  Always specify the encoding.  But obviously
> this is not good enough for HP:
> 
> > Think of it as a HP specific feature... perhaps I should wrap the
> code
> > in #ifdefs and leave it undocumented.
> 
> That would work - just ensure that no standard Python has those
> #ifdefs turned on :-)  I would be sorely dissapointed if the fact that
> HP are throwing money for this means they get every whim implemented
> in the core language.  Imagine the outcry if it were instead MS'
> money, and you were attempting to put an MS spin on all this.
> 
> Are you writing a module for HP, or writing a module for Python that
> HP are assisting by providing some funding?  Clear difference.  IMO,
> it must also be seen that there is a clear difference.
> 
> Maybe Im missing something.  Can you explain why it is good enough
> everyone else to be required to assume there is no default encoding,
> but HP get their thread specific global?  Are their requirements
> greater than anyone elses?  Is everyone else not as important?  What
> would you, as a consultant, recommend to people who arent HP, but have
> a similar requirement?  It would seem obvious to me that HPs
> requirement can be met in "pure Python", thereby keeping this out of
> the core all together...

Again, all I can try is convince them of not really needing
settable default encodings.


Since this is the first time a Python Consortium member is
pushing development, I think we can learn a lot here. For one,
it should be clear that money doesn't buy everything, OTOH,
we cannot put the whole thing at risk just because
of some minor disagreement that cannot be solved between the
parties. The standard solution for the latter should be a
customized Python interpreter.


-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 10:04:31 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:04:31 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <001001bf2cd3$6fa57820$fd2d153f@tim>
Message-ID: <382BD81F.B2BC896A@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > Note that if we make UTF-8 the standard encoding, nearly all
> > special Latin-1 characters will produce UTF-8 errors on input
> > and unreadable garbage on output. That will probably be unacceptable
> > in Europe. To remedy this, one would *always* have to use
> > u.encode('latin-1') to get readable output for Latin-1 strings
> > repesented in Unicode.
> 
> I think it's time for the Europeans to pronounce on what's acceptable in
> Europe.  To the limited extent that I can pretend I'm Eurpoean, I'm happy
> with Guido's rebind-stdin/stdout-in-PYTHONSTARTUP idea.

Agreed.
 
> > I'd rather see this happen the other way around: *always* explicitly
> > state the encoding you want in case you rely on it, e.g. write
> >
> > file.write(u.encode('utf-8'))
> >
> > instead of
> >
> > file.write(u) # let's hope this goes out as UTF-8...
> 
> By the same argument, those pesky Europeans who are relying on Latin-1
> should write
> 
> file.write(u.encode('latin-1'))
> 
> instead of
> 
> file.write(u)  # let's hope this goes out as Latin-1

Right.
 
> > Using the  as site dependent setting is useful
> > for convenience in those cases where the output format should be
> > readable rather than parseable.
> 
> Well, "convenience" is always the argument advanced in favor of modes.
> Conflicts and nasty intermittent bugs are always the result.  The latter
> will happen under Guido's idea too, as various careless modules rebind stdin
> & stdout to their own ideas of what "the proper" encoding should be.  But at
> least the blame doesn't fall on the core language then <0.3 wink>.
> 
> Since there doesn't appear to be anything (either or good or bad) you can do
> (or avoid) by using Guido's scheme instead of magical core thread state,
> there's no *need* for the latter.  That is, it can be done with a user-level
> API without involving the core.

Dito :-)

I have nothing against telling people to take care about the problem
in user space (meaning: not done by the core interpreter) and I'm
pretty sure that HP will agree on this too, provided we give them
the proper user space tools like file wrappers et al.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 10:16:57 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 10:16:57 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
Message-ID: <382BDB09.55583F28@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > If HP approves, I'd propose to use UTF-16 as if it were UCS-2 and
> > signal failure of this assertion at Unicode object construction time
> > via an exception. That way we are within the standard, can use
> > reasonably fast code for Unicode manipulation and add those extra 1M
> > character at a later stage.
> 
> I think this is reasonable.
> 
> Using UTF-8 internally is also reasonable, and if it's being rejected on the
> grounds of supposed slowness, that deserves a closer look (it's an ingenious
> encoding scheme that works correctly with a surprising number of existing
> 8-bit string routines as-is).  Indexing UTF-8 strings is greatly speeded by
> adding a simple finger (i.e., store along with the string an index+offset
> pair identifying the most recent position indexed to -- since string
> indexing is overwhelmingly sequential, this makes most indexing
> constant-time; and UTF-8 can be scanned either forward or backward from a
> random internal point because "the first byte" of each encoding is
> recognizable as such).

Here are some arguments for using the proposed UTF-16 strategy instead:

? all characters have the same length; indexing is fast
? conversion APIs to platform dependent wchar_t implementation are fast
  because they either can simply copy the content or extend the 2-bytes
  to 4 byte
? UTF-8 needs 2 bytes for all the compound Latin-1 characters (e.g. u
  with two dots) which are used in many non-English languages
? from the Unicode Consortium FAQ: "Most Unicode APIs are using UTF-16."

Besides, the Unicode object will have a buffer containing the
 representation of the object, which, if all goes
well, will always hold the UTF-8 value. RE engines etc. can then directly
work with this buffer.
 
> I expect either would work well.  It's at least curious that Perl and Tcl
> both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
> people here saying UCS-2 is the obviously better choice are all from the
> Microsoft camp .  It's not obvious to me, but then neither do I claim
> that UTF-8 is obviously better.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From gstein at lyra.org  Fri Nov 12 11:20:16 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:20:16 -0800 (PST)
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit)
In-Reply-To: <382BE16E.D17C80E1@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> 
> Since this is the first time a Python Consortium member is
> pushing development, I think we can learn a lot here. For one,
> it should be clear that money doesn't buy everything, OTOH,
> we cannot put the whole thing at risk just because
> of some minor disagreement that cannot be solved between the
> parties. The standard solution for the latter should be a
> customized Python interpreter.
> 

hehe... funny you mention this. Go read the Consortium docs. Last time
that I read them, there are no "parties" to reach consensus. *Every*
technical decision regarding the Python language falls to the Technical
Director (Guido, of course). I looked. I found nothing that can override
the T.D.'s decisions and no way to force a particular decision.

Guido is still the Benevolent Dictator :-)

Cheers,
-g

p.s. yes, there is always the caveat that "sure, Guido has final say" but
"Al can fire him at will for being too stubborn" :-) ... but hey, Guido's
title does have the word Benevolent in it, so things are cool...

--
Greg Stein, http://www.lyra.org/





From gstein at lyra.org  Fri Nov 12 11:24:56 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:24:56 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382BE16E.D17C80E1@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> Sure there are some very subtile dangers in setting the default
> to anything other than the default ;-) For some this risk may
> be worthwhile taking, for others not. In fact, in large projects
> I would never take such a risk... I'm sure we can get this 
> message across to them.

It's a lot easier to just never provide the rope (per-thread default
encodings) in the first place.

If the feature exists, then it will be used. Period. Try to get the
message across until you're blue in the face, but it would be used.

Anyhow... discussion is pretty moot until somebody can state that it
is/isn't a "real requirement" and/or until The Guido takes a position.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Fri Nov 12 11:30:04 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:30:04 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
Message-ID: 

On Fri, 12 Nov 1999, Tim Peters wrote:
>...
> Using UTF-8 internally is also reasonable, and if it's being rejected on the
> grounds of supposed slowness

No... my main point was interaction with the underlying OS. I made a SWAG
(Scientific Wild Ass Guess :-) and stated that UTF-8 is probably slower
for various types of operations. As always, your infernal meddling has
dashed that hypothesis, so I must retreat...

>...
> I expect either would work well.  It's at least curious that Perl and Tcl
> both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
> people here saying UCS-2 is the obviously better choice are all from the
> Microsoft camp .  It's not obvious to me, but then neither do I claim
> that UTF-8 is obviously better.

Probably for the exact reason that you stated in your messages: many 8-bit
(7-bit?) functions continue to work quite well when given a UTF-8-encoded
string. i.e. they didn't have to rewrite the entire Perl/TCL interpreter
to deal with a new string type.

I'd guess it is a helluva lot easier for us to add a Python Type than for
Perl or TCL to whack around with new string types (since they use strings
so heavily).

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Fri Nov 12 11:30:28 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 11:30:28 +0100
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization 
 Toolkit)
References: 
Message-ID: <382BEC44.A2541C7E@lemburg.com>

Greg Stein wrote:
> 
> On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> > 
> > Since this is the first time a Python Consortium member is
> > pushing development, I think we can learn a lot here. For one,
> > it should be clear that money doesn't buy everything, OTOH,
> > we cannot put the whole thing at risk just because
> > of some minor disagreement that cannot be solved between the
> > parties. The standard solution for the latter should be a
> > customized Python interpreter.
> > 
> 
> hehe... funny you mention this. Go read the Consortium docs. Last time
> that I read them, there are no "parties" to reach consensus. *Every*
> technical decision regarding the Python language falls to the Technical
> Director (Guido, of course). I looked. I found nothing that can override
> the T.D.'s decisions and no way to force a particular decision.
> 
> Guido is still the Benevolent Dictator :-)

Sure, but have you considered the option of a member simply bailing
out ? HP could always stop funding Unicode integration. That wouldn't
help us either...
 
> Cheers,
> -g
> 
> p.s. yes, there is always the caveat that "sure, Guido has final say" but
> "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's
> title does have the word Benevolent in it, so things are cool...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From gstein at lyra.org  Fri Nov 12 11:39:45 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 02:39:45 -0800 (PST)
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization
  Toolkit)
In-Reply-To: <382BEC44.A2541C7E@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
>...
> Sure, but have you considered the option of a member simply bailing
> out ? HP could always stop funding Unicode integration. That wouldn't
> help us either...

I'm not that dumb... come on. That was my whole point about "Benevolent"
below... Guido is a fair and reasonable Dictator... he wouldn't let that
happen.

>...
> > p.s. yes, there is always the caveat that "sure, Guido has final say" but
> > "Al can fire him at will for being too stubborn" :-) ... but hey, Guido's
> > title does have the word Benevolent in it, so things are cool...


Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From Mike.Da.Silva at uk.fid-intl.com  Fri Nov 12 12:00:49 1999
From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike)
Date: Fri, 12 Nov 1999 11:00:49 -0000
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: 

Most of the ASCII string functions do indeed work for UTF-8.  I have made
extensive use of this feature when writing translation logic to harmonize
ASCII text (an SQL statement) with substitution parameters that must be
converted from IBM EBCDIC code pages (5035, 1027) into UTF8.  Since UTF-8 is
a superset of ASCII, this all works fine.

Some of the character classification functions etc can be flaky when used
with UTF8 characters outside the ASCII range, but simple string operations
work fine.

As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an
internal string representation are:

1.	UTF-8 allows all characters to be displayed (in some form or other)
on the users machine, with or without native fonts installed.  Naturally
anything outside the ASCII range will be garbage, but it is an immense
debugging aid when working with character encodings to be able to touch and
feel something recognizable.  Trying to decode a block of raw UTF-16 is a
pain.
2.	UTF-8 works with most existing string manipulation libraries quite
happily.  It is also portable (a char is always 8 bits, regardless of
platform; wchar_t varies between 16 and 32 bits depending on the underlying
operating system (although unsigned short does seems to work across
platforms, in my experience).
3.	UTF-16 has some advantages in providing fixed width characters and,
(ignoring surrogate pairs etc) a modeless encoding space.  This is an
advantage for fast string operations, especially on CPU's that have
efficient operations for handling 16bit data.
4.	UTF-16 would directly support a tightly coupled character properties
engine, which would enable Unicode compliant case folding and character
decomposition to be performed without an intermediate UTF-8 <----> UTF-16
translation step.
5.	UTF-16 requires string operations that do not make assumptions about
nulls - this means re-implementing most of the C runtime functions to work
with unsigned shorts.

Regards,
Mike da Silva

	-----Original Message-----
	From:	Greg Stein [SMTP:gstein at lyra.org]
	Sent:	12 November 1999 10:30
	To:	Tim Peters
	Cc:	python-dev at python.org
	Subject:	RE: [Python-Dev] Internationalization Toolkit

	On Fri, 12 Nov 1999, Tim Peters wrote:
	>...
	> Using UTF-8 internally is also reasonable, and if it's being
rejected on the
	> grounds of supposed slowness

	No... my main point was interaction with the underlying OS. I made a
SWAG
	(Scientific Wild Ass Guess :-) and stated that UTF-8 is probably
slower
	for various types of operations. As always, your infernal meddling
has
	dashed that hypothesis, so I must retreat...

	>...
	> I expect either would work well.  It's at least curious that Perl
and Tcl
	> both went with UTF-8 -- does anyone think they know *why*?  I
don't.  The
	> people here saying UCS-2 is the obviously better choice are all
from the
	> Microsoft camp .  It's not obvious to me, but then neither
do I claim
	> that UTF-8 is obviously better.

	Probably for the exact reason that you stated in your messages: many
8-bit
	(7-bit?) functions continue to work quite well when given a
UTF-8-encoded
	string. i.e. they didn't have to rewrite the entire Perl/TCL
interpreter
	to deal with a new string type.

	I'd guess it is a helluva lot easier for us to add a Python Type
than for
	Perl or TCL to whack around with new string types (since they use
strings
	so heavily).

	Cheers,
	-g

	--
	Greg Stein, http://www.lyra.org/


	_______________________________________________
	Python-Dev maillist  -  Python-Dev at python.org
	http://www.python.org/mailman/listinfo/python-dev



From fredrik at pythonware.com  Fri Nov 12 12:23:24 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 12:23:24 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com>
Message-ID: <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>

> Besides, the Unicode object will have a buffer containing the
>  representation of the object, which, if all goes
> well, will always hold the UTF-8 value.



over my dead body, that one...

(fwiw, over the last 20 years, I've implemented about a
dozen image processing libraries, supporting loads of
pixel layouts and file formats.  one important lesson
from that is to stick to a single internal representation,
and let the application programmers build their own
layers if they need to speed things up -- yes, they're
actually happier that way.  and text strings are not
that different from pixel buffers or sound streams or
scientific data sets, after all...)

(and sticks and modes will break your bones, but you
know that...)

> RE engines etc. can then directly work with this buffer.

sidebar: the RE engine that's being developed for this
project can handle 8-bit, 16-bit, and (optionally) 32-bit
text buffers. a single compiled expression can be used
with any character size, and performance is about the
same for all sizes (at least on any decent cpu).

> > I expect either would work well.  It's at least curious that Perl and Tcl
> > both went with UTF-8 -- does anyone think they know *why*?  I don't.  The
> > people here saying UCS-2 is the obviously better choice are all from the
> > Microsoft camp .

(hey, I'm not a microsofter.  but I've been writing "i/o
libraries" for various "object types" all my life, so I do
have strong preferences on what works, and what
doesn't...  I use Python for good reasons, you know ;-)



thanks.  I feel better now.






From fredrik at pythonware.com  Fri Nov 12 12:23:38 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 12:23:38 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <027f01bf2d00$648745e0$f29b12c2@secret.pythonware.com>

> 5. UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.

footnote: the mad scientist has been there
and done that:

http://www.pythonware.com/madscientist/

(and you can replace "unsigned short" with
"whatever's suitable on this platform")






From fredrik at pythonware.com  Fri Nov 12 12:36:03 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 12:36:03 +0100
Subject: [Python-Dev] the Benevolent Dictator (was: Internationalization Toolkit)
References: 
Message-ID: <02a701bf2d02$20c66280$f29b12c2@secret.pythonware.com>

> Guido is a fair and reasonable Dictator... he wouldn't let that
> happen.

...but where is he when we need him? ;-)






From Mike.Da.Silva at uk.fid-intl.com  Fri Nov 12 12:43:21 1999
From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike)
Date: Fri, 12 Nov 1999 11:43:21 -0000
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: 

Fredrik Lundh wrote:

> 5. UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.

footnote: the mad scientist has been there and done that:
http://www.pythonware.com/madscientist/
 
(and you can replace "unsigned short" with "whatever's suitable on this
platform")

Surely using a different type on different platforms means that we throw
away the concept of a platform independent Unicode string?
I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.
Does this mean that to transfer a file between a Windows box and Solaris, an
implicit conversion has to be done to go from 16 bits to 32 bits (and vice
versa)?  What about byte ordering issues?
Or do you mean whatever 16 bit data type is available on the platform, with
a standard (platform independent) byte ordering maintained?
Mike da S



From fredrik at pythonware.com  Fri Nov 12 13:16:24 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 13:16:24 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>

Mike wrote:
> Surely using a different type on different platforms means that we throw
> away the concept of a platform independent Unicode string?
> I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.

so?  the interchange format doesn't have to be
the same as the internal format, does it?

> Does this mean that to transfer a file between a Windows box and Solaris, an
> implicit conversion has to be done to go from 16 bits to 32 bits (and vice
> versa)?  What about byte ordering issues?

no problem at all: unicode has special byte order
marks for this purpose (and utf-8 doesn't care, of
course).

> Or do you mean whatever 16 bit data type is available on the platform, with
> a standard (platform independent) byte ordering maintained?

well, my preference is a 16-bit data type in the plat-
form's native byte order (exactly how it's done in the
unicode module -- for the moment, it can use the
platform's wchar_t, but only if it happens to be a
16-bit unsigned type).  gives you good performance,
compact storage, and cleanest possible code.

...

anyway, I think it would help the discussion a little bit
if people looked at (and played with) the existing code
base.  at least that'll change arguments like "but then
we have to implement that" to "but then we have to
maintain that code" ;-)






From captainrobbo at yahoo.com  Fri Nov 12 13:13:03 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Fri, 12 Nov 1999 04:13:03 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
Message-ID: <19991112121303.27452.rocketmail@ web605.yahoomail.com>

--- "Da Silva, Mike" 
wrote:
> As I see it, the relative pros and cons of UTF-8
> versus UTF-16 for use as an
> internal string representation are:
> [snip]
> Regards,
> Mike da Silva
> 

Note that by going with UTF16, we get both.  We will
certainly have a codec for utf8, just as we will for
ISO-Latin-1, Shift-JIS or whatever.  And a perfectly
ordinary Python string is a great place to hold UTF8;
you can look at it and use most of the ordinary string
algorithms on it.  

I presume no one is actually advocating dropping
ordinary Python strings, or the ability to do
   rawdata = open('myfile.txt', 'rb').read()
without any transformations?


- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From mhammond at skippinet.com.au  Fri Nov 12 13:27:19 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 12 Nov 1999 23:27:19 +1100
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
Message-ID: <007e01bf2d09$44738440$0501a8c0@bobcat>

/F writes
> anyway, I think it would help the discussion a little bit
> if people looked at (and played with) the existing code
> base.  at least that'll change arguments like "but then
> we have to implement that" to "but then we have to
> maintain that code" ;-)

I second that.  It is good enough for me (although my requirements
arent stringent) - its been used on CE, so would slot directly into
the win32 stuff.  It is pretty much the consensus of the string-sig of
last year, but as code!

The only "problem" with it is the code that hasnt been written yet,
specifically:
* Encoders as streams, and a concrete proposal for them.
* Decent PyArg_ParseTuple support and Py_BuildValue support.
* The ord(), chr() stuff, and other stuff around the edges no doubt.

Couldnt we start with Fredriks implementation, and see how the rest
turns out?  Even if we do choose to change the underlying Unicode
implementation to use a different native encoding, the interface to
the PyUnicode_Type would remain pretty similar.  The advantage is that
we have something now to start working with for the rest of the
support we need.

Mark.




From mal at lemburg.com  Fri Nov 12 13:38:44 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 13:38:44 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.4
Message-ID: <382C0A54.E6E8328D@lemburg.com>

I've uploaded a new version of the proposal which incorporates
a lot of what has been discussed on the list.

Thanks to everybody who helped so far. Note that I have extended
the list of references for those who want to join in, but are
in need of more background information.

The latest version of the proposal is available at:

	http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

	http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ? support for line breaks (see
      http://www.unicode.org/unicode/reports/tr13/ )

    ? support for case conversion: 

      Problems: string lengths can change due to multiple
      characters being mapped to a single new one, capital letters
      starting a word can be different than ones occurring in the
      middle, there are locale dependent deviations from the standard
      mappings.

    ? support for numbers, digits, whitespace, etc.

    ? support (or no support) for private code point areas

    ? should Unicode objects support %-formatting ?

    One possibility would be to emulate this via strings and 
    :

    s = '%s %i abc???' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

    ? specifying file wrappers:

    Open issues: what to do with Python strings
    fed to the .write() method (may need to know the encoding of the
    strings) and when/if to return Python strings through the .read()
    method.

    Perhaps we need more than one type of wrapper here.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 14:11:26 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 14:11:26 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
Message-ID: <382C11FE.D7D9F916@lemburg.com>

Fredrik Lundh wrote:
> 
> > Besides, the Unicode object will have a buffer containing the
> >  representation of the object, which, if all goes
> > well, will always hold the UTF-8 value.
> 
> 
> 
> over my dead body, that one...

Such a buffer is needed to implement "s" and "s#" argument
parsing. It's a simple requirement to support those two
parsing markers -- there's not much to argue about, really...
unless, of course, you want to give up Unicode object support
for all APIs using these parsers.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Fri Nov 12 14:01:28 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 14:01:28 +0100
Subject: [Python-Dev] Internationalization Toolkit
References:  <02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
Message-ID: <382C0FA8.ACB6CCD6@lemburg.com>

Fredrik Lundh wrote:
> 
> Mike wrote:
> > Surely using a different type on different platforms means that we throw
> > away the concept of a platform independent Unicode string?
> > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits.
> 
> so?  the interchange format doesn't have to be
> the same as the internal format, does it?

The interchange format (marshal + pickle) is defined as UTF-8,
so there's no problem with endianness or missing bits w/r to
shipping Unicode data from one platform to another.
 
> > Does this mean that to transfer a file between a Windows box and Solaris, an
> > implicit conversion has to be done to go from 16 bits to 32 bits (and vice
> > versa)?  What about byte ordering issues?
> 
> no problem at all: unicode has special byte order
> marks for this purpose (and utf-8 doesn't care, of
> course).

Access to this mark will go into sys: sys.bom.
 
> > Or do you mean whatever 16 bit data type is available on the platform, with
> > a standard (platform independent) byte ordering maintained?
> 
> well, my preference is a 16-bit data type in the plat-
> form's native byte order (exactly how it's done in the
> unicode module -- for the moment, it can use the
> platform's wchar_t, but only if it happens to be a
> 16-bit unsigned type).  gives you good performance,
> compact storage, and cleanest possible code.

The 0.4 proposal fixes this to 16-bit unsigned short
using UTF-16 encoding with checks for surrogates. This covers
all defined standard Unicode character points, is fast, etc. pp...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 12:15:15 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 12:15:15 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
Message-ID: <382BF6C3.D79840EC@lemburg.com>

"Da Silva, Mike" wrote:
> 
> Most of the ASCII string functions do indeed work for UTF-8.  I have made
> extensive use of this feature when writing translation logic to harmonize
> ASCII text (an SQL statement) with substitution parameters that must be
> converted from IBM EBCDIC code pages (5035, 1027) into UTF8.  Since UTF-8 is
> a superset of ASCII, this all works fine.
> 
> Some of the character classification functions etc can be flaky when used
> with UTF8 characters outside the ASCII range, but simple string operations
> work fine.

That's why there's the  buffer which holds the UTF-8
encoded value...
 
> As I see it, the relative pros and cons of UTF-8 versus UTF-16 for use as an
> internal string representation are:
> 
> 1.      UTF-8 allows all characters to be displayed (in some form or other)
> on the users machine, with or without native fonts installed.  Naturally
> anything outside the ASCII range will be garbage, but it is an immense
> debugging aid when working with character encodings to be able to touch and
> feel something recognizable.  Trying to decode a block of raw UTF-16 is a
> pain.

True.

> 2.      UTF-8 works with most existing string manipulation libraries quite
> happily.  It is also portable (a char is always 8 bits, regardless of
> platform; wchar_t varies between 16 and 32 bits depending on the underlying
> operating system (although unsigned short does seems to work across
> platforms, in my experience).

You mean with the compiler applying the needed 16->32 bit extension ?

> 3.      UTF-16 has some advantages in providing fixed width characters and,
> (ignoring surrogate pairs etc) a modeless encoding space.  This is an
> advantage for fast string operations, especially on CPU's that have
> efficient operations for handling 16bit data.

Right and this is major argument for using 16 bit encodings without
state internally.

> 4.      UTF-16 would directly support a tightly coupled character properties
> engine, which would enable Unicode compliant case folding and character
> decomposition to be performed without an intermediate UTF-8 <----> UTF-16
> translation step.

Could you elaborate on this one ? It is one of the open issues
in the proposal.

> 5.      UTF-16 requires string operations that do not make assumptions about
> nulls - this means re-implementing most of the C runtime functions to work
> with unsigned shorts.

AFAIK, the RE engines in Python are 8-bit clean...

BTW, wouldn't it be possible to take pcre and have it
use Py_Unicode instead of char ? [Of course, there would have to
be some extensions for character classes etc.]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fredrik at pythonware.com  Fri Nov 12 14:43:12 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 14:43:12 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com>
Message-ID: <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com>

> > > Besides, the Unicode object will have a buffer containing the
> > >  representation of the object, which, if all goes
> > > well, will always hold the UTF-8 value.
> > 
> > 
> > 
> > over my dead body, that one...
> 
> Such a buffer is needed to implement "s" and "s#" argument
> parsing. It's a simple requirement to support those two
> parsing markers -- there's not much to argue about, really...

why?  I don't understand why "s" and "s#" has
to deal with encoding issues at all...

> unless, of course, you want to give up Unicode object support
> for all APIs using these parsers.

hmm.  maybe that's exactly what I want...






From fdrake at acm.org  Fri Nov 12 15:34:56 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 09:34:56 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C11FE.D7D9F916@lemburg.com>
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
	<382BDB09.55583F28@lemburg.com>
	<027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
	<382C11FE.D7D9F916@lemburg.com>
Message-ID: <14380.9616.245419.138261@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Such a buffer is needed to implement "s" and "s#" argument
 > parsing. It's a simple requirement to support those two
 > parsing markers -- there's not much to argue about, really...
 > unless, of course, you want to give up Unicode object support
 > for all APIs using these parsers.

  Perhaps I missed the agreement that these should always receive
UTF-8 from Unicode strings.  Was this agreed upon, or has it simply
not been argued over in favor of other topics?
  If this has indeed been agreed upon... at least it can be computed
on demand rather than at initialization!  Perhaps there should be two
pointers: one to the UTF-8 buffer and one to a PyObject; if the
PyObject is there it's a "old-style" string that's actually providing
the buffer.  This may or may not be a good idea; there's a lot of
memory expense for long Unicode strings converted from UTF-8 that
aren't ever converted back to UTF-8 or accessed using "s" or "s#".
Ok, I've talked myself out of that.  ;-)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fdrake at acm.org  Fri Nov 12 15:57:15 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 09:57:15 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C0FA8.ACB6CCD6@lemburg.com>
References: 
	<02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
	<382C0FA8.ACB6CCD6@lemburg.com>
Message-ID: <14380.10955.420102.327867@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Access to this mark will go into sys: sys.bom.

  Can the name in sys be a little more descriptive?
sys.byte_order_mark would be reasonable.
  I think that a support module (possibly unicodec) should provide
constants for all four byte order marks as strings (2- & 4-byte,
little- and big-endian).  Names could be short BOM_2_LE, BOM_4_LE,
etc.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fredrik at pythonware.com  Fri Nov 12 16:00:45 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 12 Nov 1999 16:00:45 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim><382BDB09.55583F28@lemburg.com><027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com><382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us>
Message-ID: <009101bf2d1f$21f5b490$f29b12c2@secret.pythonware.com>

Fred L. Drake, Jr.  wrote:
> M.-A. Lemburg writes:
>  > Such a buffer is needed to implement "s" and "s#" argument
>  > parsing. It's a simple requirement to support those two
>  > parsing markers -- there's not much to argue about, really...
>  > unless, of course, you want to give up Unicode object support
>  > for all APIs using these parsers.
>
>   Perhaps I missed the agreement that these should always receive
> UTF-8 from Unicode strings.

from unicode import *

def getname():
    # hidden in some database engine, or so...
    return unicode("Link?ping", "iso-8859-1")

...

name = getname()

# emulate automatic conversion to utf-8
name = str(name)

# print it in uppercase, in the usual way
import string
print string.upper(name)

## LINK??PING

I don't know, but I think that I think that it
perhaps should raise an exception instead...






From mal at lemburg.com  Fri Nov 12 16:17:43 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 16:17:43 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim> <382BDB09.55583F28@lemburg.com> <027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com> <382C11FE.D7D9F916@lemburg.com> <005201bf2d13$ddd75ad0$f29b12c2@secret.pythonware.com>
Message-ID: <382C2F97.8E7D7A4D@lemburg.com>

Fredrik Lundh wrote:
> 
> > > > Besides, the Unicode object will have a buffer containing the
> > > >  representation of the object, which, if all goes
> > > > well, will always hold the UTF-8 value.
> > >
> > > 
> > >
> > > over my dead body, that one...
> >
> > Such a buffer is needed to implement "s" and "s#" argument
> > parsing. It's a simple requirement to support those two
> > parsing markers -- there's not much to argue about, really...
> 
> why?  I don't understand why "s" and "s#" has
> to deal with encoding issues at all...
> 
> > unless, of course, you want to give up Unicode object support
> > for all APIs using these parsers.
> 
> hmm.  maybe that's exactly what I want...

If we don't add that support, lot's of existing APIs won't
accept Unicode object instead of strings. While it could be
argued that automatic conversion to UTF-8 is not transparent
enough for the user, the other solution of using str(u)
everywhere would probably make writing Unicode-aware code a
rather clumsy task and introduce other pitfalls, since str(obj)
calls PyObject_Str() which also works on integers, floats,
etc.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 12 16:50:33 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 16:50:33 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
		<02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
		<382C0FA8.ACB6CCD6@lemburg.com> <14380.10955.420102.327867@weyr.cnri.reston.va.us>
Message-ID: <382C3749.198EEBC6@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > Access to this mark will go into sys: sys.bom.
> 
>   Can the name in sys be a little more descriptive?
> sys.byte_order_mark would be reasonable.

The abbreviation BOM is quite common w/r to Unicode.

>   I think that a support module (possibly unicodec) should provide
> constants for all four byte order marks as strings (2- & 4-byte,
> little- and big-endian).  Names could be short BOM_2_LE, BOM_4_LE,
> etc.

Good idea...

sys.bom should return the byte order mark (BOM) for the format used
internally. The unicodec module should provide symbols for all
possible values of this variable:

  BOM_BE: '\376\377' 
    (corresponds to Unicode 0x0000FEFF in UTF-16 
     == ZERO WIDTH NO-BREAK SPACE)

  BOM_LE: '\377\376' 
    (corresponds to Unicode 0x0000FFFE in UTF-16 
     == illegal Unicode character)

  BOM4_BE: '\000\000\377\376'
    (corresponds to Unicode 0x0000FEFF in UCS-4)

  BOM4_LE: '\376\377\000\000'
    (corresponds to Unicode 0x0000FFFE in UCS-4)

Note that Unicode sees big endian byte order as being "correct". The
swapped order is taken to be an indicator for a "wrong" format, hence
the illegal character definition.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Fri Nov 12 16:24:33 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 16:24:33 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
		<382BDB09.55583F28@lemburg.com>
		<027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
		<382C11FE.D7D9F916@lemburg.com> <14380.9616.245419.138261@weyr.cnri.reston.va.us>
Message-ID: <382C3131.A8965CA5@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > Such a buffer is needed to implement "s" and "s#" argument
>  > parsing. It's a simple requirement to support those two
>  > parsing markers -- there's not much to argue about, really...
>  > unless, of course, you want to give up Unicode object support
>  > for all APIs using these parsers.
> 
>   Perhaps I missed the agreement that these should always receive
> UTF-8 from Unicode strings.  Was this agreed upon, or has it simply
> not been argued over in favor of other topics?

It's been in the proposal since version 0.1. The idea is to
provide a decent way of making existing script Unicode aware.

>   If this has indeed been agreed upon... at least it can be computed
> on demand rather than at initialization!

This is what I intended to implement. The  buffer
will be filled upon the first request to the UTF-8 encoding.
"s" and "s#" are examples of such requests. The buffer will
remain intact until the object is destroyed (since other code
could store the pointer received via e.g. "s").

> Perhaps there should be two
> pointers: one to the UTF-8 buffer and one to a PyObject; if the
> PyObject is there it's a "old-style" string that's actually providing
> the buffer.  This may or may not be a good idea; there's a lot of
> memory expense for long Unicode strings converted from UTF-8 that
> aren't ever converted back to UTF-8 or accessed using "s" or "s#".
> Ok, I've talked myself out of that.  ;-)

Note that Unicode object are completely different beast ;-)
String object are not touched in any way by the proposal.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fdrake at acm.org  Fri Nov 12 17:22:24 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 11:22:24 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C3749.198EEBC6@lemburg.com>
References: 
	<02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
	<382C0FA8.ACB6CCD6@lemburg.com>
	<14380.10955.420102.327867@weyr.cnri.reston.va.us>
	<382C3749.198EEBC6@lemburg.com>
Message-ID: <14380.16064.723277.586881@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > The abbreviation BOM is quite common w/r to Unicode.

  Yes: "w/r to Unicode".  In sys, it's out of context and should
receive a more descriptive name.  I think using BOM in unicodec is
good.

 >   BOM_BE: '\376\377' 
 >     (corresponds to Unicode 0x0000FEFF in UTF-16 
 >      == ZERO WIDTH NO-BREAK SPACE)

  I'd also add BOM to be the same as sys.byte_order_mark.  Perhaps
even instead of sys.byte_order_mark (just to localize the areas of
code that are affected).

 > Note that Unicode sees big endian byte order as being "correct". The

  A lot of us do.  ;-)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fdrake at acm.org  Fri Nov 12 17:28:37 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Fri, 12 Nov 1999 11:28:37 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C3131.A8965CA5@lemburg.com>
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
	<382BDB09.55583F28@lemburg.com>
	<027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
	<382C11FE.D7D9F916@lemburg.com>
	<14380.9616.245419.138261@weyr.cnri.reston.va.us>
	<382C3131.A8965CA5@lemburg.com>
Message-ID: <14380.16437.71847.832880@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > It's been in the proposal since version 0.1. The idea is to
 > provide a decent way of making existing script Unicode aware.

  Ok, so I haven't read closely enough.

 > This is what I intended to implement. The  buffer
 > will be filled upon the first request to the UTF-8 encoding.
 > "s" and "s#" are examples of such requests. The buffer will
 > remain intact until the object is destroyed (since other code
 > could store the pointer received via e.g. "s").

  Right.

 > Note that Unicode object are completely different beast ;-)
 > String object are not touched in any way by the proposal.

  I wasn't suggesting the PyStringObject be changed, only that the
PyUnicodeObject could maintain a reference.  Consider:

        s = fp.read()
        u = unicode(s, 'utf-8')

u would now hold a reference to s, and s/s# would return a pointer
into s instead of re-building the UTF-8 form.  I talked myself out of
this because it would be too easy to keep a lot more string objects
around than were actually needed.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From jack at oratrix.nl  Fri Nov 12 17:33:46 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Fri, 12 Nov 1999 17:33:46 +0100
Subject: [Python-Dev] just say no... 
In-Reply-To: Message by "M.-A. Lemburg"  ,
	     Fri, 12 Nov 1999 16:24:33 +0100 , <382C3131.A8965CA5@lemburg.com> 
Message-ID: <19991112163347.5527635BB1E@snelboot.oratrix.nl>

The problem with "s" and "s#"  is that they're already semantically 
overloaded, and will become more so with support for multiple charsets.

Some modules use "s#" when they mean "give me a pointer to an area of memory 
and its length". Writing to binary files is an example of this.

Some modules use it to mean "give me a pointer to a string". Writing to a text 
file is (probably) an example of this.

Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This 
is the case if we're going to actually look at the contents (think of 
string.upper() and such).

I think that the only real solution is to define what "s" means, come up with 
new getarg-formats for the other two use cases and convert all modules to use 
the new standard. It'll still cause grief to extension modules that aren't 
part of the core, but at least the problem will go away after a while.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From mal at lemburg.com  Fri Nov 12 19:36:55 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 19:36:55 +0100
Subject: [Python-Dev] just say no...
References: <000f01bf2cd0$b6d9a5c0$fd2d153f@tim>
		<382BDB09.55583F28@lemburg.com>
		<027e01bf2d00$56a1fdd0$f29b12c2@secret.pythonware.com>
		<382C11FE.D7D9F916@lemburg.com>
		<14380.9616.245419.138261@weyr.cnri.reston.va.us>
		<382C3131.A8965CA5@lemburg.com> <14380.16437.71847.832880@weyr.cnri.reston.va.us>
Message-ID: <382C5E47.21FB4DD@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > It's been in the proposal since version 0.1. The idea is to
>  > provide a decent way of making existing script Unicode aware.
> 
>   Ok, so I haven't read closely enough.
> 
>  > This is what I intended to implement. The  buffer
>  > will be filled upon the first request to the UTF-8 encoding.
>  > "s" and "s#" are examples of such requests. The buffer will
>  > remain intact until the object is destroyed (since other code
>  > could store the pointer received via e.g. "s").
> 
>   Right.
> 
>  > Note that Unicode object are completely different beast ;-)
>  > String object are not touched in any way by the proposal.
> 
>   I wasn't suggesting the PyStringObject be changed, only that the
> PyUnicodeObject could maintain a reference.  Consider:
> 
>         s = fp.read()
>         u = unicode(s, 'utf-8')
> 
> u would now hold a reference to s, and s/s# would return a pointer
> into s instead of re-building the UTF-8 form.  I talked myself out of
> this because it would be too easy to keep a lot more string objects
> around than were actually needed.

Agreed. Also, the encoding would always be correct. 
will always hold the  version (which should
be UTF-8...).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From gstein at lyra.org  Fri Nov 12 23:19:15 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 14:19:15 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <007e01bf2d09$44738440$0501a8c0@bobcat>
Message-ID: 

On Fri, 12 Nov 1999, Mark Hammond wrote:
> Couldnt we start with Fredriks implementation, and see how the rest
> turns out?  Even if we do choose to change the underlying Unicode
> implementation to use a different native encoding, the interface to
> the PyUnicode_Type would remain pretty similar.  The advantage is that
> we have something now to start working with for the rest of the
> support we need.

I agree with "start with" here, and will go one step further (which Mark
may have implied) -- *check in* Fredrik's code.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Fri Nov 12 23:59:03 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 14:59:03 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C11FE.D7D9F916@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> Fredrik Lundh wrote:
> > > Besides, the Unicode object will have a buffer containing the
> > >  representation of the object, which, if all goes
> > > well, will always hold the UTF-8 value.
> > 
> > 
> > 
> > over my dead body, that one...
> 
> Such a buffer is needed to implement "s" and "s#" argument
> parsing. It's a simple requirement to support those two
> parsing markers -- there's not much to argue about, really...
> unless, of course, you want to give up Unicode object support
> for all APIs using these parsers.

Bull!

You can easily support "s#" support by returning the pointer to the
Unicode buffer. The *entire* reason for introducing "t#" is to
differentiate between returning a pointer to an 8-bit [character] buffer
and a not-8-bit buffer.

In other words, the work done to introduce "t#" was done *SPECIFICALLY* to
allow "s#" to return a pointer to the Unicode data.

I am with Fredrik on that auxilliary buffer. You'll have two dead bodies
to deal with :-)

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sat Nov 13 00:05:11 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 15:05:11 -0800 (PST)
Subject: [Python-Dev] just say no... 
In-Reply-To: <19991112163347.5527635BB1E@snelboot.oratrix.nl>
Message-ID: 

This was done last year!! We have "s#" meaning "give me some bytes." We
have "t#" meaning "give me some 8-bit characters." The Python distribution
has been completely updated to use the appropriate format in each call.

The was done *specifically* to support the introduction of a Unicode type.
The intent was that "s#" returns the *raw* bytes of the Unicode string --
NOT a UTF-8 encoding!

As a separate argument, MAL can argue that "t#" should create an internal,
associated buffer to hold a UTF-8 encoding and then return that. But the
"s#" should return the raw bytes!
[ and I'll argue against the response to "t#" anyhow... ]

-g

On Fri, 12 Nov 1999, Jack Jansen wrote:
> The problem with "s" and "s#"  is that they're already semantically 
> overloaded, and will become more so with support for multiple charsets.
> 
> Some modules use "s#" when they mean "give me a pointer to an area of memory 
> and its length". Writing to binary files is an example of this.
> 
> Some modules use it to mean "give me a pointer to a string". Writing to a text 
> file is (probably) an example of this.
> 
> Some modules use it to mean "give me a pointer to an 8-bit ASCII string". This 
> is the case if we're going to actually look at the contents (think of 
> string.upper() and such).
> 
> I think that the only real solution is to define what "s" means, come up with 
> new getarg-formats for the other two use cases and convert all modules to use 
> the new standard. It'll still cause grief to extension modules that aren't 
> part of the core, but at least the problem will go away after a while.
> --
> Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
> Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
> www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 
> 
> 
> 
> _______________________________________________
> Python-Dev maillist  -  Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sat Nov 13 00:09:13 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 15:09:13 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <382C2F97.8E7D7A4D@lemburg.com>
Message-ID: 

On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> Fredrik Lundh wrote:
>...
> > why?  I don't understand why "s" and "s#" has
> > to deal with encoding issues at all...
> > 
> > > unless, of course, you want to give up Unicode object support
> > > for all APIs using these parsers.
> > 
> > hmm.  maybe that's exactly what I want...
> 
> If we don't add that support, lot's of existing APIs won't
> accept Unicode object instead of strings. While it could be
> argued that automatic conversion to UTF-8 is not transparent
> enough for the user, the other solution of using str(u)
> everywhere would probably make writing Unicode-aware code a
> rather clumsy task and introduce other pitfalls, since str(obj)
> calls PyObject_Str() which also works on integers, floats,
> etc.

No no no...

"s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are
supposed to return the raw bytes.

If a caller wants 8-bit characters, then that caller will use "t#".

If you want to argue for that separate, encoded buffer, then argue for it
for support for the "t#" format. But do NOT say that it is needed for "s#"
which simply means "give me some bytes."

-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sat Nov 13 00:26:08 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 15:26:08 -0800 (PST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <14380.16064.723277.586881@weyr.cnri.reston.va.us>
Message-ID: 

On Fri, 12 Nov 1999, Fred L. Drake, Jr. wrote:
> M.-A. Lemburg writes:
>  > The abbreviation BOM is quite common w/r to Unicode.

True.

>   Yes: "w/r to Unicode".  In sys, it's out of context and should
> receive a more descriptive name.  I think using BOM in unicodec is
> good.

I agree and believe that we can avoid putting it into sys altogether.

>  >   BOM_BE: '\376\377' 
>  >     (corresponds to Unicode 0x0000FEFF in UTF-16 
>  >      == ZERO WIDTH NO-BREAK SPACE)

Are you sure about that interpretation? I thought the BOM characters
(0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.

>   I'd also add BOM to be the same as sys.byte_order_mark.  Perhaps
> even instead of sys.byte_order_mark (just to localize the areas of
> code that are affected).

### unicodec.py ###
import struct

BOM = struct.pack('h', 0x0000FEFF)
BOM_BE = '\376\377'
...


If somebody needs the BOM, then they should go to unicodec.py (or some
other module). I do not believe we need to put that stuff into the sys
module. It is just too easy to create the value in Python.

Cheers,
-g

p.s. to be pedantic, the pack() format could be '@h'

--
Greg Stein, http://www.lyra.org/




From mhammond at skippinet.com.au  Sat Nov 13 00:41:16 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sat, 13 Nov 1999 10:41:16 +1100
Subject: [Python-Dev] just say no... 
In-Reply-To: 
Message-ID: <008601bf2d67$6a9982b0$0501a8c0@bobcat>

[Greg writes]

> As a separate argument, MAL can argue that "t#" should create
> an internal,
> associated buffer to hold a UTF-8 encoding and then return
> that. But the
> "s#" should return the raw bytes!
> [ and I'll argue against the response to "t#" anyhow... ]

Hmm.  Climbing over these dead bodies could get a bit smelly :-)

Im inclined to agree that holding 2 internal buffers for the unicode
object is not ideal.  However, I _am_ concerned with getting decent
PyArg_ParseTuple and Py_BuildValue support, and if the cost is an
extra buffer I will survive.  So lets look for solutions that dont
require it, rather than holding it up as evil when no other solution
is obvious.

My requirements appear to me to be very simple (for an anglophile):

Lets say I have a platform Unicode value - eg, I got a Unicode value
from some external library (say COM :-)  Lets assume for now that the
Unicode string is fully representable as ASCII  - say a file or
directory name that COM gave me.  I simply want to be able to pass
this Unicode object to "open()", and have it work.  This assumes that
open() will not become "native unicode", simply as the underlying C
support is not unicode aware - it needs to be converted to a "char *"
(ie, will use the "t#" format)

The second side of the equation is when I expose a Python function
that talks Unicode - eg, I need to _pass_ a platform Unicode value to
an external library.  The Python programmer should be able to pass a
Unicode object (no problem), or a PyString object.

In code terms:
Prob1:
  name = SomeComObject.GetFileName() # A Unicode object
  f = open(name)
Prob2:
  SomeComObject.SetFileName("foo.txt")

IMO it is important that we have a good strategy for dealing with this
for extensions.  MAL addresses one direction, but not the other.

Maybe if we toss around general solutions for this the implementation
will fall out.  MALs idea of the additional buffer starts to address
this, but isnt the whole story.

Any ideas on this?




From gstein at lyra.org  Sat Nov 13 01:49:34 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 12 Nov 1999 16:49:34 -0800 (PST)
Subject: [Python-Dev] argument parsing (was: just say no...)
In-Reply-To: <008601bf2d67$6a9982b0$0501a8c0@bobcat>
Message-ID: 

On Sat, 13 Nov 1999, Mark Hammond wrote:
>...
> Im inclined to agree that holding 2 internal buffers for the unicode
> object is not ideal.  However, I _am_ concerned with getting decent
> PyArg_ParseTuple and Py_BuildValue support, and if the cost is an
> extra buffer I will survive.  So lets look for solutions that dont
> require it, rather than holding it up as evil when no other solution
> is obvious.

I believe Py_BuildValue is pretty straight-forward. Simply state that it
is allowed to perform conversions and place the resulting object into the
resulting tuple.
(with appropriate refcounting)

In other words:

  tuple = Py_BuildValue("U", stringOb);

The stringOb will be converted to a Unicode object. The new Unicode object
will go into the tuple (with the tuple holding the only reference!). The
stringOb will NOT acquire any additional references.

[ "U" format may be wrong; it is here for example purposes ]


Okay... now the PyArg_ParseTuple() is the *real* kicker.

>...
> Prob1:
>   name = SomeComObject.GetFileName() # A Unicode object
>   f = open(name)
> Prob2:
>   SomeComObject.SetFileName("foo.txt")

Both of these issues are due to PyArg_ParseTuple. In Prob1, you want a
string-like object which can be passed to the OS as an 8-bit string. In
Prob2, you want a string-like object which can be passed to the OS as a
Unicode string.

I see three options for PyArg_ParseTuple:

1) allow it to return NEW objects which must be DECREF'd.
   [ current policy only loans out references ]

   This option could be difficult in the presence of errors during the
   parse. For example, the current idiom is:

     if (!PyArg_ParseTuple(args, "..."))
        return NULL;

   If an object was produced, but then a later argument cause a failure,
   then who is responsible for freeing the object?

2) like step 1, but PyArg_ParseTuple is smart enough to NOT return any new
   objects when an error occurred.

   This basically answers the last question in option (1) -- ParseTuple is
   responsible.

3) Return loaned-out-references to objects which have been tested for
   convertability. Helper functions perform the conversion and the caller
   will then free the reference.
   [ this is the model used in PyWin32 ]

   Code in PyWin32 typically looks like:

     if (!PyArg_ParseTuple(args, "O", &ob))
       return NULL;
     if ((unicodeOb = GiveMeUnicode(ob)) == NULL)
       return NULL;
     ...
     Py_DECREF(unicodeOb);

   [ GiveMeUnicode is descriptive here; I forget the name used in PyWin32 ]

   In a "real" situation, the ParseTuple format would be "U" and the
   object would be type-tested for PyStringType or PyUnicodeType.

   Note that GiveMeUnicode() would also do a type-test, but it can't
   produce a *specific* error like ParseTuple (e.g. "string/unicode object
   expected" vs "parameter 3 must be a string/unicode object")

Are there more options? Anybody?


All three of these avoid the secondary buffer. The last is cleanest w.r.t.
to keeping the existing "loaned references" behavior, but can get a bit
wordy when you need to convert a bunch of string arguments.

Option (2) adds a good amount of complexity to PyArg_ParseTuple -- it
would need to keep a "free list" in case an error occurred.

Option (1) adds DECREF logic to callers to ensure they clean up. The add'l
logic isn't much more than the other two options (the only change is
adding DECREFs before returning NULL from the "if (!PyArg_ParseTuple..."
condition). Note that the caller would probably need to initialize each
object to NULL before calling ParseTuple.


Personally, I prefer (3) as it makes it very clear that a new object has
been created and must be DECREF'd at some point. Also note that
GiveMeUnicode() could also accept a second argument for the type of
decoding to do (or NULL meaning "UTF-8").

Oh: note there are equivalents of all options for going from
unicode-to-string; the above is all about string-to-unicode. However, the
tricky part of unicode-to-string is determining whether backwards
compatibility will be a requirement. i.e. does existing code that uses the
"t" format suddenly achieve the capability to accept a Unicode object?
This obviously causes problems in all three options: since a new reference
must be created to handle the situation, then who DECREF's it? The old
code certainly doesn't.
[  I'm with Fredrik in saying "no, old code *doesn't* suddenly get
  the ability to accept a Unicode object." The Python code must use str() to
  do the encoding manually (until the old code is upgraded to one of the
  above three options).  ]

I think that's it for me. In the several years I've been thinking on this
problem, I haven't come up with anything but the above three. There may be
a whole new paradigm for argument parsing, but I haven't tried to think on
that one (and just fit in around ParseTuple).

Cheers,
-g

--
Greg Stein, http://www.lyra.org/






From mal at lemburg.com  Fri Nov 12 19:49:52 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 12 Nov 1999 19:49:52 +0100
Subject: [Python-Dev] Internationalization Toolkit
References: 
		<02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
		<382C0FA8.ACB6CCD6@lemburg.com>
		<14380.10955.420102.327867@weyr.cnri.reston.va.us>
		<382C3749.198EEBC6@lemburg.com> <14380.16064.723277.586881@weyr.cnri.reston.va.us>
Message-ID: <382C6150.53BDC803@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > The abbreviation BOM is quite common w/r to Unicode.
> 
>   Yes: "w/r to Unicode".  In sys, it's out of context and should
> receive a more descriptive name.  I think using BOM in unicodec is
> good.

Guido proposed to add it to sys. I originally had it defined in
unicodec.

Perhaps a sys.endian would be more appropriate for sys
with values 'little' and 'big' or '<' and '>' to be conform
to the struct module.

unicodec could then define unicodec.bom depending on the setting
in sys.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    49 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Sat Nov 13 10:37:35 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Sat, 13 Nov 1999 10:37:35 +0100
Subject: [Python-Dev] just say no...
References: 
Message-ID: <382D315F.A7ADEC42@lemburg.com>

Greg Stein wrote:
> 
> On Fri, 12 Nov 1999, M.-A. Lemburg wrote:
> > Fredrik Lundh wrote:
> >...
> > > why?  I don't understand why "s" and "s#" has
> > > to deal with encoding issues at all...
> > >
> > > > unless, of course, you want to give up Unicode object support
> > > > for all APIs using these parsers.
> > >
> > > hmm.  maybe that's exactly what I want...
> >
> > If we don't add that support, lot's of existing APIs won't
> > accept Unicode object instead of strings. While it could be
> > argued that automatic conversion to UTF-8 is not transparent
> > enough for the user, the other solution of using str(u)
> > everywhere would probably make writing Unicode-aware code a
> > rather clumsy task and introduce other pitfalls, since str(obj)
> > calls PyObject_Str() which also works on integers, floats,
> > etc.
> 
> No no no...
> 
> "s" and "s#" are NOT SUPPOSED TO return a UTF-8 encoding. They are
> supposed to return the raw bytes.

[I've waited quite some time for you to chime in on this one ;-)]

Let me summarize a bit on the general ideas behind "s", "s#"
and the extra buffer:

First, we have a general design question here: should old code
become Unicode compatible or not. As I recall the original idea
about Unicode integration was to follow Perl's idea to have
scripts become Unicode aware by simply adding a 'use utf8;'.

If this is still the case, then we'll have to come with a
resonable approach for integrating classical string based
APIs with the new type.

Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
the Latin-1 folks) which has some very nice features (see
http://czyborra.com/utf/ ) and which is a true extension of ASCII,
this encoding seems best fit for the purpose.

However, one should not forget that UTF-8 is in fact a
variable length encoding of Unicode characters, that is up to
3 bytes form a *single* character. This is obviously not compatible
with definitions that explicitly state data to be using a
8-bit single character encoding, e.g. indexing in UTF-8 doesn't
work like it does in Latin-1 text.

So if we are to do the integration, we'll have to choose
argument parser markers that allow for multi byte characters.
"t#" does not fall into this category, "s#" certainly does,
"s" is argueable.

Also note that we have to watch out for embedded NULL bytes.
UTF-16 has NULL bytes for every character from the Latin-1
domain. If "s" were to give back a pointer to the internal
buffer which is encoded in UTF-16, you would loose data.
UTF-8 doesn't have this problem, since only NULL bytes
map to (single) NULL bytes.

Now Greg would chime in with the buffer interface and
argue that it should make the underlying internal
format accessible. This is a bad idea, IMHO, since you
shouldn't really have to know what the internal data format
is.

Defining "s#" to return UTF-8 data does not only
make "s" and "s#" return the same data format (which should
always be the case, IMO), but also hides the internal
format from the user and gives him a reliable cross-platform
data representation of Unicode data (note that UTF-8 doesn't
have the byte order problems of UTF-16).

If you are still with, let's look at what "s" and "s#"
do: they return pointers into data areas which have to
be kept alive until the corresponding object dies.

The only way to support this feature is by allocating
a buffer for just this purpose (on the fly and only if
needed to prevent excessive memory load). The other
options of adding new magic parser markers or switching
to more generic one all have one downside: you need to
change existing code which is in conflict with the idea
we started out with.

So, again, the question is: do we want this magical
intergration or not ? Note that this is a design question,
not one of memory consumption...

--

Ok, the above covered Unicode -> String conversion. Mark
mentioned that he wanted the other way around to also
work in the same fashion, ie. automatic String -> Unicode
conversion. 

This could also be done in the same way by
interpreting the string as UTF-8 encoded Unicode... but we
have the same problem: where to put the data without
generating new intermediate objects. Since only newly
written code will use this feature there is a way to do
this though:

PyArg_ParseTuple(args,"s#",&utf8,&len);

If your C API understands UTF-8 there's nothing more to do,
if not, take Greg's option 3 approach:

PyArg_ParseTuple(args,"O",&obj);
unicode = PyUnicode_FromObject(obj);
...
Py_DECREF(unicode);

Here PyUnicode_FromObject() will return a new
reference if obj is an Unicode object or create a new
Unicode object by interpreting str(obj) as UTF-8 encoded string.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    48 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From guido at CNRI.Reston.VA.US  Sat Nov 13 13:12:41 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Sat, 13 Nov 1999 07:12:41 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Fri, 12 Nov 1999 14:59:03 PST."
              
References:  
Message-ID: <199911131212.HAA25895@eric.cnri.reston.va.us>

> I am with Fredrik on that auxilliary buffer. You'll have two dead bodies
> to deal with :-)

I haven't made up my mind yet (due to a very successful
Python-promoting visit to SD'99 east, I'm about 100 msgs behind in
this thread alone) but let me warn you that I can deal with the
carnage, if necessary. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein at lyra.org  Sat Nov 13 13:23:54 1999
From: gstein at lyra.org (Greg Stein)
Date: Sat, 13 Nov 1999 04:23:54 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <199911131212.HAA25895@eric.cnri.reston.va.us>
Message-ID: 

On Sat, 13 Nov 1999, Guido van Rossum wrote:
> > I am with Fredrik on that auxilliary buffer. You'll have two dead bodies
> > to deal with :-)
> 
> I haven't made up my mind yet (due to a very successful
> Python-promoting visit to SD'99 east, I'm about 100 msgs behind in
> this thread alone) but let me warn you that I can deal with the
> carnage, if necessary. :-)

Bring it on, big boy!

:-)

--
Greg Stein, http://www.lyra.org/




From mhammond at skippinet.com.au  Sat Nov 13 13:52:18 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sat, 13 Nov 1999 23:52:18 +1100
Subject: [Python-Dev] argument parsing (was: just say no...)
In-Reply-To: 
Message-ID: <00b301bf2dd5$ec4df840$0501a8c0@bobcat>

[Lamenting about PyArg_ParseTuple and managing memory buffers for
String/Unicode conversions.]

So what is really wrong with Marc's proposal about the extra pointer
on the Unicode object?  And to double the carnage, who not add the
equivilent native Unicode buffer to the PyString object?

These would only ever be filled when requested by the conversion
routines.  They have no other effect than their memory is managed by
the object itself; simply a convenience to avoid having extension
modules manage the conversion buffers.

The only overheads appear to be:
* The conversion buffers may be slightly (or much :-) longer-lived -
ie, they are not freed until the object itself is freed.
* String object slightly bigger, and slightly slower to destroy.

It appears to solve the problems, and the cost doesnt seem too high...

Mark.




From guido at CNRI.Reston.VA.US  Sat Nov 13 14:06:26 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Sat, 13 Nov 1999 08:06:26 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Sat, 13 Nov 1999 10:37:35 +0100."
             <382D315F.A7ADEC42@lemburg.com> 
References:   
            <382D315F.A7ADEC42@lemburg.com> 
Message-ID: <199911131306.IAA26030@eric.cnri.reston.va.us>

I think I have a reasonable grasp of the issues here, even though I
still haven't read about 100 msgs in this thread.  Note that t# and
the charbuffer addition to the buffer API were added by Greg Stein
with my support; I'll attempt to reconstruct our thinking at the
time...

[MAL]
> Let me summarize a bit on the general ideas behind "s", "s#"
> and the extra buffer:

I think you left out t#.

> First, we have a general design question here: should old code
> become Unicode compatible or not. As I recall the original idea
> about Unicode integration was to follow Perl's idea to have
> scripts become Unicode aware by simply adding a 'use utf8;'.

I've never heard of this idea before -- or am I taking it too literal?
It smells of a mode to me :-)  I'd rather live in a world where
Unicode just works as long as you use u'...' literals or whatever
convention we decide.

> If this is still the case, then we'll have to come with a
> resonable approach for integrating classical string based
> APIs with the new type.
> 
> Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> the Latin-1 folks) which has some very nice features (see
> http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> this encoding seems best fit for the purpose.

Yes, especially if we fix the default encoding as UTF-8.  (I'm
expecting feedback from HP on this next week, hopefully when I see the
details, it'll be clear that don't need a per-thread default encoding
to solve their problems; that's quite a likely outcome.  If not, we
have a real-world argument for allowing a variable default encoding,
without carnage.)

> However, one should not forget that UTF-8 is in fact a
> variable length encoding of Unicode characters, that is up to
> 3 bytes form a *single* character. This is obviously not compatible
> with definitions that explicitly state data to be using a
> 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> work like it does in Latin-1 text.

Sure, but where in current Python are there such requirements?

> So if we are to do the integration, we'll have to choose
> argument parser markers that allow for multi byte characters.
> "t#" does not fall into this category, "s#" certainly does,
> "s" is argueable.

I disagree.  I grepped through the source for s# and t#.  Here's a bit
of background.  Before t# was introduced, s# was being used for two
distinct purposes: (1) to get an 8-bit text string plus its length, in
situations where the length was needed; (2) to get binary data (e.g.
GIF data read from a file in "rb" mode).  Greg pointed out that if we
ever introduced some form of Unicode support, these two had to be
disambiguated.  We found that the majority of uses was for (2)!
Therefore we decided to change the definition of s# to mean only (2),
and introduced t# to mean (1).  Also, we introduced getcharbuffer
corresponding to t#, while getreadbuffer was meant for s#.

Note that the definition of the 's' format was left alone -- as
before, it means you need an 8-bit text string not containing null
bytes.

Our expectation was that a Unicode string passed to an s# situation
would give a pointer to the internal format plus a byte count (not a
character count!) while t# would get a pointer to some kind of 8-bit
translation/encoding plus a byte count, with the explicit requirement
that the 8-bit translation would have the same lifetime as the
original unicode object.  We decided to leave it up to the next
generation (i.e., Marc-Andre :-) to decide what kind of translation to
use and what to do when there is no reasonable translation.

Any of the following choices is acceptable (from the point of view of
not breaking the intended t# semantics; we can now start deciding
which we like best):

- utf-8
- latin-1
- ascii
- shift-jis
- lower byte of unicode ordinal
- some user- or os-specified multibyte encoding

As far as t# is concerned, for encodings that don't encode all of
Unicode, untranslatable characters could be dealt with in any number
of ways (raise an exception, ignore, replace with '?', make best
effort, etc.).

Given the current context, it should probably be the same as the
default encoding -- i.e., utf-8.  If we end up making the default
user-settable, we'll have to decide what to do with untranslatable
characters -- but that will probably be decided by the user too (it
would be a property of a specific translation specification).

In any case, I feel that t# could receive a multi-byte encoding, 
s# should receive raw binary data, and they should correspond to
getcharbuffer and getreadbuffer, respectively.

(Aside: the symmetry between 's' and 's#' is now lost; 's' matches
't#', there's no match for 's#'.)

> Also note that we have to watch out for embedded NULL bytes.
> UTF-16 has NULL bytes for every character from the Latin-1
> domain. If "s" were to give back a pointer to the internal
> buffer which is encoded in UTF-16, you would loose data.
> UTF-8 doesn't have this problem, since only NULL bytes
> map to (single) NULL bytes.

This is a red herring given my explanation above.

> Now Greg would chime in with the buffer interface and
> argue that it should make the underlying internal
> format accessible. This is a bad idea, IMHO, since you
> shouldn't really have to know what the internal data format
> is.

This is for C code.  Quite likely it *does* know what the internal
data format is!

> Defining "s#" to return UTF-8 data does not only
> make "s" and "s#" return the same data format (which should
> always be the case, IMO),

That was before t# was introduced.  No more, alas.  If you replace s#
with t#, I agree with you completely.

> but also hides the internal
> format from the user and gives him a reliable cross-platform
> data representation of Unicode data (note that UTF-8 doesn't
> have the byte order problems of UTF-16).
> 
> If you are still with, let's look at what "s" and "s#"

(and t#, which is more relevant here)

> do: they return pointers into data areas which have to
> be kept alive until the corresponding object dies.
> 
> The only way to support this feature is by allocating
> a buffer for just this purpose (on the fly and only if
> needed to prevent excessive memory load). The other
> options of adding new magic parser markers or switching
> to more generic one all have one downside: you need to
> change existing code which is in conflict with the idea
> we started out with.

Agreed.  I think this was our thinking when Greg & I introduced t#.
My own preference would be to allocate a whole string object, not
just a buffer; this could then also be used for the .encode() method
using the default encoding.

> So, again, the question is: do we want this magical
> intergration or not ? Note that this is a design question,
> not one of memory consumption...

Yes, I want it.

Note that this doesn't guarantee that all old extensions will work
flawlessly when passed Unicode objects; but I think that it covers
most cases where you could have a reasonable expectation that it
works.

(Hm, unfortunately many reasonable expectations seem to involve
the current user's preferred encoding. :-( )

> --
> 
> Ok, the above covered Unicode -> String conversion. Mark
> mentioned that he wanted the other way around to also
> work in the same fashion, ie. automatic String -> Unicode
> conversion. 
> 
> This could also be done in the same way by
> interpreting the string as UTF-8 encoded Unicode... but we
> have the same problem: where to put the data without
> generating new intermediate objects. Since only newly
> written code will use this feature there is a way to do
> this though:
> 
> PyArg_ParseTuple(args,"s#",&utf8,&len);

No!  That is supposed to give the native representation of the string
object.

I agree that Mark's problem requires a solution too, but it doesn't
have to use existing formatting characters, since there's no backwards
compatibility issue.

> If your C API understands UTF-8 there's nothing more to do,
> if not, take Greg's option 3 approach:
> 
> PyArg_ParseTuple(args,"O",&obj);
> unicode = PyUnicode_FromObject(obj);
> ...
> Py_DECREF(unicode);
> 
> Here PyUnicode_FromObject() will return a new
> reference if obj is an Unicode object or create a new
> Unicode object by interpreting str(obj) as UTF-8 encoded string.

This might work.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Sat Nov 13 14:06:35 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Sat, 13 Nov 1999 14:06:35 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.5
References: <382C0A54.E6E8328D@lemburg.com>
Message-ID: <382D625B.DC14DBDE@lemburg.com>

FYI, I've uploaded a new version of the proposal which incorporates
proposals for line breaks, case mapping, character properties and
private code points support.

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ? should Unicode objects support %-formatting ?

    One possibility would be to emulate this via strings and 
    :

    s = '%s %i abc???' # a Latin-1 encoded string
    t = (u,3)

    # Convert Latin-1 s to a  string
    s1 = unicode(s,'latin-1').encode()

    # The '%s' will now add u in 
    s2 = s1 % t

    # Finally, convert the  encoded string to Unicode
    u1 = unicode(s2)

    ? specifying file wrappers:

    Open issues: what to do with Python strings
    fed to the .write() method (may need to know the encoding of the
    strings) and when/if to return Python strings through the .read()
    method.

    Perhaps we need more than one type of wrapper here.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    48 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From jack at oratrix.nl  Sat Nov 13 17:40:34 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Sat, 13 Nov 1999 17:40:34 +0100
Subject: [Python-Dev] just say no... 
In-Reply-To: Message by Greg Stein  ,
	     Fri, 12 Nov 1999 15:05:11 -0800 (PST) ,  
Message-ID: <19991113164039.9B697EA11A@oratrix.oratrix.nl>

Recently, Greg Stein  said:
> This was done last year!! We have "s#" meaning "give me some bytes." We
> have "t#" meaning "give me some 8-bit characters." The Python distribution
> has been completely updated to use the appropriate format in each call.

Oops...

I remember the discussion but I wasn't aware that somone had actually
_implemented_ this:-). Part of my misunderstanding was also caused by
the fact that I inspected what I thought would be the prime candidate
for t#: file.write() to a non-binary file, and it doesn't use the new
format.

I also noted a few inconsistencies at first glance, by the way: most
modules seem to use s# for things like filenames and other
data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an 
exception and it uses t# for uuencoded strings...
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 



From guido at CNRI.Reston.VA.US  Sat Nov 13 20:20:51 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Sat, 13 Nov 1999 14:20:51 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Sat, 13 Nov 1999 17:40:34 +0100."
             <19991113164039.9B697EA11A@oratrix.oratrix.nl> 
References: <19991113164039.9B697EA11A@oratrix.oratrix.nl> 
Message-ID: <199911131920.OAA26165@eric.cnri.reston.va.us>

> I remember the discussion but I wasn't aware that somone had actually
> _implemented_ this:-). Part of my misunderstanding was also caused by
> the fact that I inspected what I thought would be the prime candidate
> for t#: file.write() to a non-binary file, and it doesn't use the new
> format.

I guess that's because file.write() doesn't distinguish between text
and binary files.  Maybe it should: the current implementation
together with my proposed semantics for Unicode strings would mean that
printing a unicode string (to stdout) would dump the internal encoding
to the file.  I guess it should do so only when the file is opened in
binary mode; for files opened in text mode it should use an encoding
(opening a file can specify an encoding; can we change the encoding of
an existing file?).

> I also noted a few inconsistencies at first glance, by the way: most
> modules seem to use s# for things like filenames and other
> data-that-is-readable-but-shouldn't-be-messed-with, but binascii is an 
> exception and it uses t# for uuencoded strings...

Actually, binascii seems to do it right: s# for binary data, t# for
text (uuencoded, hqx, base64).  That is, the b2a variants use s# while
the a2b variants use t#.  The only thing I'm not sure about in that
module are binascii_rledecode_hqx() and binascii_rlecode_hqx() -- I
don't understand where these stand in the complexity of binhex
en/decoding.

--Guido van Rossum (home page: http://www.python.org/~guido/)




From mal at lemburg.com  Sun Nov 14 23:11:54 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Sun, 14 Nov 1999 23:11:54 +0100
Subject: [Python-Dev] just say no...
References:   
	            <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us>
Message-ID: <382F33AA.C3EE825A@lemburg.com>

Guido van Rossum wrote:
> 
> I think I have a reasonable grasp of the issues here, even though I
> still haven't read about 100 msgs in this thread.  Note that t# and
> the charbuffer addition to the buffer API were added by Greg Stein
> with my support; I'll attempt to reconstruct our thinking at the
> time...
>
> [MAL]
> > Let me summarize a bit on the general ideas behind "s", "s#"
> > and the extra buffer:
> 
> I think you left out t#.

On purpose -- according to my thinking. I see "t#" as an interface
to bf_getcharbuf which I understand as 8-bit character buffer...
UTF-8 is a multi byte encoding. It still is character data, but
not necessarily 8 bits in length (up to 24 bits are used).

Anyway, I'm not really interested in having an argument about
this. If you say, "t#" fits the purpose, then that's fine with
me. Still, we should clearly define that "t#" returns
text data and "s#" binary data. Encoding, bit length, etc. should
explicitly remain left undefined.

> > First, we have a general design question here: should old code
> > become Unicode compatible or not. As I recall the original idea
> > about Unicode integration was to follow Perl's idea to have
> > scripts become Unicode aware by simply adding a 'use utf8;'.
> 
> I've never heard of this idea before -- or am I taking it too literal?
> It smells of a mode to me :-)  I'd rather live in a world where
> Unicode just works as long as you use u'...' literals or whatever
> convention we decide.
> 
> > If this is still the case, then we'll have to come with a
> > resonable approach for integrating classical string based
> > APIs with the new type.
> >
> > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> > the Latin-1 folks) which has some very nice features (see
> > http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> > this encoding seems best fit for the purpose.
> 
> Yes, especially if we fix the default encoding as UTF-8.  (I'm
> expecting feedback from HP on this next week, hopefully when I see the
> details, it'll be clear that don't need a per-thread default encoding
> to solve their problems; that's quite a likely outcome.  If not, we
> have a real-world argument for allowing a variable default encoding,
> without carnage.)

Fair enough :-)
 
> > However, one should not forget that UTF-8 is in fact a
> > variable length encoding of Unicode characters, that is up to
> > 3 bytes form a *single* character. This is obviously not compatible
> > with definitions that explicitly state data to be using a
> > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> > work like it does in Latin-1 text.
> 
> Sure, but where in current Python are there such requirements?

It was my understanding that "t#" refers to single byte character
data. That's where the above arguments were aiming at...
 
> > So if we are to do the integration, we'll have to choose
> > argument parser markers that allow for multi byte characters.
> > "t#" does not fall into this category, "s#" certainly does,
> > "s" is argueable.
> 
> I disagree.  I grepped through the source for s# and t#.  Here's a bit
> of background.  Before t# was introduced, s# was being used for two
> distinct purposes: (1) to get an 8-bit text string plus its length, in
> situations where the length was needed; (2) to get binary data (e.g.
> GIF data read from a file in "rb" mode).  Greg pointed out that if we
> ever introduced some form of Unicode support, these two had to be
> disambiguated.  We found that the majority of uses was for (2)!
> Therefore we decided to change the definition of s# to mean only (2),
> and introduced t# to mean (1).  Also, we introduced getcharbuffer
> corresponding to t#, while getreadbuffer was meant for s#.

I know its too late now, but I can't really follow the arguments
here: in what ways are (1) and (2) different from the implementations
point of view ? If "t#" is to return UTF-8 then  will not equal , so both parser markers return
essentially the same information. The only difference would be
on the semantic side: (1) means: give me text data, while (2) does
not specify the data type.

Perhaps I'm missing something...
 
> Note that the definition of the 's' format was left alone -- as
> before, it means you need an 8-bit text string not containing null
> bytes.

This definition should then be changed to "text string without
null bytes" dropping the 8-bit reference.
 
> Our expectation was that a Unicode string passed to an s# situation
> would give a pointer to the internal format plus a byte count (not a
> character count!) while t# would get a pointer to some kind of 8-bit
> translation/encoding plus a byte count, with the explicit requirement
> that the 8-bit translation would have the same lifetime as the
> original unicode object.  We decided to leave it up to the next
> generation (i.e., Marc-Andre :-) to decide what kind of translation to
> use and what to do when there is no reasonable translation.

Hmm, I would strongly object to making "s#" return the internal
format. file.write() would then default to writing UTF-16 data
instead of UTF-8 data. This could result in strange errors
due to the UTF-16 format being endian dependent.

It would also break the symmetry between file.write(u) and
unicode(file.read()), since the default encoding is not used as
internal format for other reasons (see proposal).

> Any of the following choices is acceptable (from the point of view of
> not breaking the intended t# semantics; we can now start deciding
> which we like best):

I think we have already agreed on using UTF-8 for the default
encoding. It has quite a few advantages. See

	http://czyborra.com/utf/

for a good overview of the pros and cons.

> - utf-8
> - latin-1
> - ascii
> - shift-jis
> - lower byte of unicode ordinal
> - some user- or os-specified multibyte encoding
> 
> As far as t# is concerned, for encodings that don't encode all of
> Unicode, untranslatable characters could be dealt with in any number
> of ways (raise an exception, ignore, replace with '?', make best
> effort, etc.).

The usual Python way would be: raise an exception. This is what
the proposal defines for Codecs in case an encoding/decoding
mapping is not possible, BTW. (UTF-8 will always succeed on
output.)
 
> Given the current context, it should probably be the same as the
> default encoding -- i.e., utf-8.  If we end up making the default
> user-settable, we'll have to decide what to do with untranslatable
> characters -- but that will probably be decided by the user too (it
> would be a property of a specific translation specification).
> 
> In any case, I feel that t# could receive a multi-byte encoding,
> s# should receive raw binary data, and they should correspond to
> getcharbuffer and getreadbuffer, respectively.

Why would you want to have "s#" return the raw binary data for
Unicode objects ? 

Note that it is not mentioned anywhere that
"s#" and "t#" do have to necessarily return different things
(binary being a superset of text). I'd opt for "s#" and "t#" both
returning UTF-8 data. This can be implemented by delegating the
buffer slots to the  object (see below).

> > Now Greg would chime in with the buffer interface and
> > argue that it should make the underlying internal
> > format accessible. This is a bad idea, IMHO, since you
> > shouldn't really have to know what the internal data format
> > is.
> 
> This is for C code.  Quite likely it *does* know what the internal
> data format is!

C code can use the PyUnicode_* APIs to access the data. I
don't think that argument parsing is powerful enough to
provide the C code with enough information about the data
contents, e.g. it can only state the encoding length, not the
string length.
 
> > Defining "s#" to return UTF-8 data does not only
> > make "s" and "s#" return the same data format (which should
> > always be the case, IMO),
> 
> That was before t# was introduced.  No more, alas.  If you replace s#
> with t#, I agree with you completely.

Done :-)
 
> > but also hides the internal
> > format from the user and gives him a reliable cross-platform
> > data representation of Unicode data (note that UTF-8 doesn't
> > have the byte order problems of UTF-16).
> >
> > If you are still with, let's look at what "s" and "s#"
> 
> (and t#, which is more relevant here)
> 
> > do: they return pointers into data areas which have to
> > be kept alive until the corresponding object dies.
> >
> > The only way to support this feature is by allocating
> > a buffer for just this purpose (on the fly and only if
> > needed to prevent excessive memory load). The other
> > options of adding new magic parser markers or switching
> > to more generic one all have one downside: you need to
> > change existing code which is in conflict with the idea
> > we started out with.
> 
> Agreed.  I think this was our thinking when Greg & I introduced t#.
> My own preference would be to allocate a whole string object, not
> just a buffer; this could then also be used for the .encode() method
> using the default encoding.

Good point. I'll change  to , a Python
string object created on request.
 
> > So, again, the question is: do we want this magical
> > intergration or not ? Note that this is a design question,
> > not one of memory consumption...
> 
> Yes, I want it.
> 
> Note that this doesn't guarantee that all old extensions will work
> flawlessly when passed Unicode objects; but I think that it covers
> most cases where you could have a reasonable expectation that it
> works.
> 
> (Hm, unfortunately many reasonable expectations seem to involve
> the current user's preferred encoding. :-( )

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    47 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From amk1 at erols.com  Mon Nov 15 02:49:08 1999
From: amk1 at erols.com (A.M. Kuchling)
Date: Sun, 14 Nov 1999 20:49:08 -0500
Subject: [Python-Dev] PyErr_Format security note
Message-ID: <199911150149.UAA00408@mira.erols.com>

I noticed this in PyErr_Format(exception, format, va_alist):

	char buffer[500]; /* Caller is responsible for limiting the format */
	...
	vsprintf(buffer, format, vargs);

Making the caller responsible for this is error-prone.  The danger, of
course, is a buffer overflow caused by generating an error string
that's larger than the buffer, possibly letting people execute
arbitrary code.  We could add a test to the configure script for
vsnprintf() and use it when possible, but that only fixes the problem
on platforms which have it.  Can we find an implementation of
vsnprintf() someplace?

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
One form to rule them all, one form to find them, one form to bring them all
and in the darkness rewrite the hell out of them.
    -- Digital Equipment Corporation, in a comment from SENDMAIL Ruleset 3




From gstein at lyra.org  Mon Nov 15 03:11:39 1999
From: gstein at lyra.org (Greg Stein)
Date: Sun, 14 Nov 1999 18:11:39 -0800 (PST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <199911150149.UAA00408@mira.erols.com>
Message-ID: 

On Sun, 14 Nov 1999, A.M. Kuchling wrote:
> Making the caller responsible for this is error-prone.  The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

Apache has a safe implementation (they have reviewed the heck out of it
for obvious reasons :-).

In the Apache source distribution, it is located in src/ap/ap_snprintf.c.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Mon Nov 15 09:09:07 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 09:09:07 +0100
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
Message-ID: <382FBFA3.B28B8E1E@lemburg.com>

"A.M. Kuchling" wrote:
> 
> I noticed this in PyErr_Format(exception, format, va_alist):
> 
>         char buffer[500]; /* Caller is responsible for limiting the format */
>         ...
>         vsprintf(buffer, format, vargs);
> 
> Making the caller responsible for this is error-prone.  The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

In sysmodule.c, this check is done which should be safe enough
since no "return" is issued (Py_FatalError() does an abort()):

  if (vsprintf(buffer, format, va) >= sizeof(buffer))
    Py_FatalError("PySys_WriteStdout/err: buffer overrun");


-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From gstein at lyra.org  Mon Nov 15 10:28:06 1999
From: gstein at lyra.org (Greg Stein)
Date: Mon, 15 Nov 1999 01:28:06 -0800 (PST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <382FBFA3.B28B8E1E@lemburg.com>
Message-ID: 

On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
>...
> In sysmodule.c, this check is done which should be safe enough
> since no "return" is issued (Py_FatalError() does an abort()):
> 
>   if (vsprintf(buffer, format, va) >= sizeof(buffer))
>     Py_FatalError("PySys_WriteStdout/err: buffer overrun");

I believe the return from vsprintf() itself would be the problem.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Mon Nov 15 10:49:26 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 10:49:26 +0100
Subject: [Python-Dev] PyErr_Format security note
References: 
Message-ID: <382FD726.6ACB912F@lemburg.com>

Greg Stein wrote:
> 
> On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> >...
> > In sysmodule.c, this check is done which should be safe enough
> > since no "return" is issued (Py_FatalError() does an abort()):
> >
> >   if (vsprintf(buffer, format, va) >= sizeof(buffer))
> >     Py_FatalError("PySys_WriteStdout/err: buffer overrun");
> 
> I believe the return from vsprintf() itself would be the problem.

Ouch, yes, you are right... but who could exploit this security
hole ? Since PyErr_Format() is only reachable for C code, only
bad programming style in extensions could make it exploitable
via user input.

Wouldn't it be possible to assign thread globals for these
functions to use ? These would live on the heap instead of
on the stack and eliminate the buffer overrun possibilities
(I guess -- I don't have any experience with these...).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From akuchlin at mems-exchange.org  Mon Nov 15 16:17:58 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Mon, 15 Nov 1999 10:17:58 -0500 (EST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <382FD726.6ACB912F@lemburg.com>
References: 
	<382FD726.6ACB912F@lemburg.com>
Message-ID: <14384.9254.152604.11688@amarok.cnri.reston.va.us>

M.-A. Lemburg writes:
>Ouch, yes, you are right... but who could exploit this security
>hole ? Since PyErr_Format() is only reachable for C code, only
>bad programming style in extensions could make it exploitable
>via user input.

99% of security holes arise out of carelessness, and besides, this
buffer size doesn't seem to be documented in either api.tex or
ext.tex.  I'll look into borrowing Apache's implementation and
modifying it into a varargs form.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I can also withstand considerably more G-force than most people, even though I
do say so myself.
    -- The Doctor, in "The Ambassadors of Death"




From guido at CNRI.Reston.VA.US  Mon Nov 15 16:23:57 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 10:23:57 -0500
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: Your message of "Sun, 14 Nov 1999 20:49:08 EST."
             <199911150149.UAA00408@mira.erols.com> 
References: <199911150149.UAA00408@mira.erols.com> 
Message-ID: <199911151523.KAA27163@eric.cnri.reston.va.us>

> I noticed this in PyErr_Format(exception, format, va_alist):
> 
> 	char buffer[500]; /* Caller is responsible for limiting the format */
> 	...
> 	vsprintf(buffer, format, vargs);
> 
> Making the caller responsible for this is error-prone.

Agreed.  The limit of 500 chars, while technically undocumented, is
part of the specs for PyErr_Format (which is currently wholly
undocumented).  The current callers all have explicit precautions, but
of course I agree that this is a potential danger.

> The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

Assuming that Linux and Solaris have vsnprintf(), can't we just use
the configure script to detect it, and issue a warning blaming the
platform for those platforms that don't have it?  That seems much
simpler (from a maintenance perspective) than carrying our own
implementation around (even if we can borrow the Apache version).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake at acm.org  Mon Nov 15 16:24:27 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Mon, 15 Nov 1999 10:24:27 -0500 (EST)
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C6150.53BDC803@lemburg.com>
References: 
	<02f901bf2d07$bdf5d950$f29b12c2@secret.pythonware.com>
	<382C0FA8.ACB6CCD6@lemburg.com>
	<14380.10955.420102.327867@weyr.cnri.reston.va.us>
	<382C3749.198EEBC6@lemburg.com>
	<14380.16064.723277.586881@weyr.cnri.reston.va.us>
	<382C6150.53BDC803@lemburg.com>
Message-ID: <14384.9643.145759.816037@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Guido proposed to add it to sys. I originally had it defined in
 > unicodec.

  Well, he clearly didn't ask me!  ;-)

 > Perhaps a sys.endian would be more appropriate for sys
 > with values 'little' and 'big' or '<' and '>' to be conform
 > to the struct module.
 > 
 > unicodec could then define unicodec.bom depending on the setting
 > in sys.

  This seems more reasonable, though I'd go with BOM instead of bom.
But that's a style issue, so not so important.  If your write bom,
I'll write bom.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From captainrobbo at yahoo.com  Mon Nov 15 16:30:45 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Mon, 15 Nov 1999 07:30:45 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
Message-ID: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>

Some thoughts on the codecs...

1. Stream interface
At the moment a codec has dump and load methods which
read a (slice of a) stream into a string in memory and
vice versa.  As the proposal notes, this could lead to
errors if you take a slice out of a stream.   This is
not just due to character truncation; some Asian
encodings are modal and have shift-in and shift-out
sequences as they move from Western single-byte
characters to double-byte ones.   It also seems a bit
pointless to me as the source (or target) is still a
Unicode string in memory.

This is a real problem - a filter to convert big files
between two encodings should be possible without
knowledge of the particular encoding, as should one on
the input/output of some server.  We can still give a
default implementation for single-byte encodings.

What's a good API for real stream conversion?   just
Codec.encodeStream(infile, outfile)  ?  or is it more
useful to feed the codec with data a chunk at a time?


2. Data driven codecs
I really like codecs being objects, and believe we
could build support for a lot more encodings, a lot
sooner than is otherwise possible, by making them data
driven rather making each one compiled C code with
static mapping tables.  What do people think about the
approach below?

First of all, the ISO8859-1 series are straight
mappings to Unicode code points.  So one Python script
could parse these files and build the mapping table,
and a very small data file could hold these encodings.
  A compiled helper function analogous to
string.translate() could deal with most of them.

Secondly, the double-byte ones involve a mixture of
algorithms and data.  The worst cases I know are modal
encodings which need a single-byte lookup table, a
double-byte lookup table, and have some very simple
rules about escape sequences in between them.  A
simple state machine could still handle these (and the
single-byte mappings above become extra-simple special
cases); I could imagine feeding it a totally
data-driven set of rules.  

Third, we can massively compress the mapping tables
using a notation which just lists contiguous ranges;
and very often there are relationships between
encodings.  For example, "cpXYZ is just like cpXYY but
with an extra 'smiley' at 0XFE32".  In these cases, a
script can build a family of related codecs in an
auditable manner. 

3. What encodings to distribute?
The only clean answers to this are 'almost none', or
'everything that Unicode 3.0 has a mapping for'.  The
latter is going to add some weight to the
distribution.  What are people's feelings?  Do we ship
any at all apart from the Unicode ones?  Should new
encodings be downloadable from www.python.org?  Should
there be an optional package outside the main
distribution?

Thanks,

Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From akuchlin at mems-exchange.org  Mon Nov 15 16:36:47 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Mon, 15 Nov 1999 10:36:47 -0500 (EST)
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: <199911151523.KAA27163@eric.cnri.reston.va.us>
References: <199911150149.UAA00408@mira.erols.com>
	<199911151523.KAA27163@eric.cnri.reston.va.us>
Message-ID: <14384.10383.718373.432606@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>Assuming that Linux and Solaris have vsnprintf(), can't we just use
>the configure script to detect it, and issue a warning blaming the
>platform for those platforms that don't have it?  That seems much

But people using an already-installed Python binary won't see any such
configure-time warning, and won't find out about the potential
problem.  Plus, how do people fix the problem on platforms that don't
have vsnprintf() -- switch to Solaris or Linux?  Not much of a
solution.  (vsnprintf() isn't ANSI C, though it's a common extension,
so platforms that lack it aren't really deficient.)

Hmm... could we maybe use Python's existing (string % vars) machinery?
 No, that seems to be hard, because it would want
PyObjects, and we can't know what Python types to convert the varargs
to, unless we parse the format string (at which point we may as well
get a vsnprintf() implementation.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
A successful tool is one that was used to do something undreamed of by its
author.
    -- S.C. Johnson




From guido at CNRI.Reston.VA.US  Mon Nov 15 16:50:24 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 10:50:24 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Sun, 14 Nov 1999 23:11:54 +0100."
             <382F33AA.C3EE825A@lemburg.com> 
References:  <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us>  
            <382F33AA.C3EE825A@lemburg.com> 
Message-ID: <199911151550.KAA27188@eric.cnri.reston.va.us>

> On purpose -- according to my thinking. I see "t#" as an interface
> to bf_getcharbuf which I understand as 8-bit character buffer...
> UTF-8 is a multi byte encoding. It still is character data, but
> not necessarily 8 bits in length (up to 24 bits are used).
> 
> Anyway, I'm not really interested in having an argument about
> this. If you say, "t#" fits the purpose, then that's fine with
> me. Still, we should clearly define that "t#" returns
> text data and "s#" binary data. Encoding, bit length, etc. should
> explicitly remain left undefined.

Thanks for not picking an argument.  Multibyte encodings typically
have ASCII as a subset (in such a way that an ASCII string is
represented as itself in bytes).  This is the characteristic that's
needed in my view.

> > > First, we have a general design question here: should old code
> > > become Unicode compatible or not. As I recall the original idea
> > > about Unicode integration was to follow Perl's idea to have
> > > scripts become Unicode aware by simply adding a 'use utf8;'.
> > 
> > I've never heard of this idea before -- or am I taking it too literal?
> > It smells of a mode to me :-)  I'd rather live in a world where
> > Unicode just works as long as you use u'...' literals or whatever
> > convention we decide.
> > 
> > > If this is still the case, then we'll have to come with a
> > > resonable approach for integrating classical string based
> > > APIs with the new type.
> > >
> > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g.
> > > the Latin-1 folks) which has some very nice features (see
> > > http://czyborra.com/utf/ ) and which is a true extension of ASCII,
> > > this encoding seems best fit for the purpose.
> > 
> > Yes, especially if we fix the default encoding as UTF-8.  (I'm
> > expecting feedback from HP on this next week, hopefully when I see the
> > details, it'll be clear that don't need a per-thread default encoding
> > to solve their problems; that's quite a likely outcome.  If not, we
> > have a real-world argument for allowing a variable default encoding,
> > without carnage.)
> 
> Fair enough :-)
>  
> > > However, one should not forget that UTF-8 is in fact a
> > > variable length encoding of Unicode characters, that is up to
> > > 3 bytes form a *single* character. This is obviously not compatible
> > > with definitions that explicitly state data to be using a
> > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't
> > > work like it does in Latin-1 text.
> > 
> > Sure, but where in current Python are there such requirements?
> 
> It was my understanding that "t#" refers to single byte character
> data. That's where the above arguments were aiming at...

t# refers to byte-encoded data.  Multibyte encodings are explicitly
designed to be passed cleanly through processing steps that handle
single-byte character data, as long as they are 8-bit clean and don't
do too much processing.

> > > So if we are to do the integration, we'll have to choose
> > > argument parser markers that allow for multi byte characters.
> > > "t#" does not fall into this category, "s#" certainly does,
> > > "s" is argueable.
> > 
> > I disagree.  I grepped through the source for s# and t#.  Here's a bit
> > of background.  Before t# was introduced, s# was being used for two
> > distinct purposes: (1) to get an 8-bit text string plus its length, in
> > situations where the length was needed; (2) to get binary data (e.g.
> > GIF data read from a file in "rb" mode).  Greg pointed out that if we
> > ever introduced some form of Unicode support, these two had to be
> > disambiguated.  We found that the majority of uses was for (2)!
> > Therefore we decided to change the definition of s# to mean only (2),
> > and introduced t# to mean (1).  Also, we introduced getcharbuffer
> > corresponding to t#, while getreadbuffer was meant for s#.
> 
> I know its too late now, but I can't really follow the arguments
> here: in what ways are (1) and (2) different from the implementations
> point of view ? If "t#" is to return UTF-8 then  buffer> will not equal , so both parser markers return
> essentially the same information. The only difference would be
> on the semantic side: (1) means: give me text data, while (2) does
> not specify the data type.
> 
> Perhaps I'm missing something...

The idea is that (1)/s# disallows any translation of the data, while
(2)/t# requires translation of the data to an ASCII superset (possibly
multibyte, such as UTF-8 or shift-JIS).  (2)/t# assumes that the data
contains text and that if the text consists of only ASCII characters
they are represented as themselves.  (1)/s# makes no such assumption.

In terms of implementation, Unicode objects should translate
themselves to the default encoding for t# (if possible), but they
should make the native representation available for s#.

For example, take an encryption engine.  While it is defined in terms
of byte streams, there's no requirement that the bytes represent
characters -- they could be the bytes of a GIF file, an MP3 file, or a
gzipped tar file.  If we pass Unicode to an encryption engine, we want
Unicode to come out at the other end, not UTF-8.  (If we had wanted to
encrypt UTF-8, we should have fed it UTF-8.)

> > Note that the definition of the 's' format was left alone -- as
> > before, it means you need an 8-bit text string not containing null
> > bytes.
> 
> This definition should then be changed to "text string without
> null bytes" dropping the 8-bit reference.

Aha, I think there's a confusion about what "8-bit" means.  For me, a
multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?
(As far as I know, C uses char* to represent multibyte characters.)
Maybe we should disambiguate it more explicitly?

> > Our expectation was that a Unicode string passed to an s# situation
> > would give a pointer to the internal format plus a byte count (not a
> > character count!) while t# would get a pointer to some kind of 8-bit
> > translation/encoding plus a byte count, with the explicit requirement
> > that the 8-bit translation would have the same lifetime as the
> > original unicode object.  We decided to leave it up to the next
> > generation (i.e., Marc-Andre :-) to decide what kind of translation to
> > use and what to do when there is no reasonable translation.
> 
> Hmm, I would strongly object to making "s#" return the internal
> format. file.write() would then default to writing UTF-16 data
> instead of UTF-8 data. This could result in strange errors
> due to the UTF-16 format being endian dependent.

But this was the whole design.  file.write() needs to be changed to
use s# when the file is open in binary mode and t# when the file is
open in text mode.

> It would also break the symmetry between file.write(u) and
> unicode(file.read()), since the default encoding is not used as
> internal format for other reasons (see proposal).

If the file is encoded using UTF-16 or UCS-2, you should open it in
binary mode and use unicode(file.read(), 'utf-16').  (Or perhaps the
app should read the first 2 bytes and check for a BOM and then decide
to choose bewteen 'utf-16-be' and 'utf-16-le'.)

> > Any of the following choices is acceptable (from the point of view of
> > not breaking the intended t# semantics; we can now start deciding
> > which we like best):
> 
> I think we have already agreed on using UTF-8 for the default
> encoding. It has quite a few advantages. See
> 
> 	http://czyborra.com/utf/
> 
> for a good overview of the pros and cons.

Of course.  I was just presenting the list as an argument that if
we changed our mind about the default encoding, t# should follow the
default encoding (and not pick an encoding by other means).

> > - utf-8
> > - latin-1
> > - ascii
> > - shift-jis
> > - lower byte of unicode ordinal
> > - some user- or os-specified multibyte encoding
> > 
> > As far as t# is concerned, for encodings that don't encode all of
> > Unicode, untranslatable characters could be dealt with in any number
> > of ways (raise an exception, ignore, replace with '?', make best
> > effort, etc.).
> 
> The usual Python way would be: raise an exception. This is what
> the proposal defines for Codecs in case an encoding/decoding
> mapping is not possible, BTW. (UTF-8 will always succeed on
> output.)

Did you read Andy Robinson's case study?  He suggested that for
certain encodings there may be other things you can do that are more
user-friendly than raising an exception, depending on the application.
I am proposing to leave this a detail of each specific translation.
There may even be translations that do the same thing except they have
a different behavior for untranslatable cases -- e.g. a strict version
that raises an exception and a non-strict version that replaces bad
characters with '?'.  I think this is one of the powers of having an
extensible set of encodings.

> > Given the current context, it should probably be the same as the
> > default encoding -- i.e., utf-8.  If we end up making the default
> > user-settable, we'll have to decide what to do with untranslatable
> > characters -- but that will probably be decided by the user too (it
> > would be a property of a specific translation specification).
> > 
> > In any case, I feel that t# could receive a multi-byte encoding,
> > s# should receive raw binary data, and they should correspond to
> > getcharbuffer and getreadbuffer, respectively.
> 
> Why would you want to have "s#" return the raw binary data for
> Unicode objects ? 

Because file.write() for a binary file, and other similar things
(e.g. the encryption engine example I mentioned above) must have
*some* way to get at the raw bits.

> Note that it is not mentioned anywhere that
> "s#" and "t#" do have to necessarily return different things
> (binary being a superset of text). I'd opt for "s#" and "t#" both
> returning UTF-8 data. This can be implemented by delegating the
> buffer slots to the  object (see below).

This would defeat the whole purpose of introducing t#.  We might as
well drop t# then altogether if we adopt this.

> > > Now Greg would chime in with the buffer interface and
> > > argue that it should make the underlying internal
> > > format accessible. This is a bad idea, IMHO, since you
> > > shouldn't really have to know what the internal data format
> > > is.
> > 
> > This is for C code.  Quite likely it *does* know what the internal
> > data format is!
> 
> C code can use the PyUnicode_* APIs to access the data. I
> don't think that argument parsing is powerful enough to
> provide the C code with enough information about the data
> contents, e.g. it can only state the encoding length, not the
> string length.

Typically, all the C code does is pass multibyte encoded strings on to
other library routines that know what to do to them, or simply give
them back unchanged at a later time.  It is essential to know the
number of bytes, for memory allocation purposes.  The number of
characters is totally immaterial (and multibyte-handling code knows
how to calculate the number of characters anyway).

> > > Defining "s#" to return UTF-8 data does not only
> > > make "s" and "s#" return the same data format (which should
> > > always be the case, IMO),
> > 
> > That was before t# was introduced.  No more, alas.  If you replace s#
> > with t#, I agree with you completely.
> 
> Done :-)
>  
> > > but also hides the internal
> > > format from the user and gives him a reliable cross-platform
> > > data representation of Unicode data (note that UTF-8 doesn't
> > > have the byte order problems of UTF-16).
> > >
> > > If you are still with, let's look at what "s" and "s#"
> > 
> > (and t#, which is more relevant here)
> > 
> > > do: they return pointers into data areas which have to
> > > be kept alive until the corresponding object dies.
> > >
> > > The only way to support this feature is by allocating
> > > a buffer for just this purpose (on the fly and only if
> > > needed to prevent excessive memory load). The other
> > > options of adding new magic parser markers or switching
> > > to more generic one all have one downside: you need to
> > > change existing code which is in conflict with the idea
> > > we started out with.
> > 
> > Agreed.  I think this was our thinking when Greg & I introduced t#.
> > My own preference would be to allocate a whole string object, not
> > just a buffer; this could then also be used for the .encode() method
> > using the default encoding.
> 
> Good point. I'll change  to , a Python
> string object created on request.
>  
> > > So, again, the question is: do we want this magical
> > > intergration or not ? Note that this is a design question,
> > > not one of memory consumption...
> > 
> > Yes, I want it.
> > 
> > Note that this doesn't guarantee that all old extensions will work
> > flawlessly when passed Unicode objects; but I think that it covers
> > most cases where you could have a reasonable expectation that it
> > works.
> > 
> > (Hm, unfortunately many reasonable expectations seem to involve
> > the current user's preferred encoding. :-( )
> 
> -- 
> Marc-Andre Lemburg

--Guido van Rossum (home page: http://www.python.org/~guido/)



From Mike.Da.Silva at uk.fid-intl.com  Mon Nov 15 17:01:59 1999
From: Mike.Da.Silva at uk.fid-intl.com (Da Silva, Mike)
Date: Mon, 15 Nov 1999 16:01:59 -0000
Subject: [Python-Dev] Some thoughts on the codecs...
Message-ID: 

Andy Robinson wrote:
1.	Stream interface
At the moment a codec has dump and load methods which read a (slice of a)
stream into a string in memory and vice versa.  As the proposal notes, this
could lead to errors if you take a slice out of a stream.   This is not just
due to character truncation; some Asian encodings are modal and have
shift-in and shift-out sequences as they move from Western single-byte
characters to double-byte ones.   It also seems a bit pointless to me as the
source (or target) is still a Unicode string in memory.
This is a real problem - a filter to convert big files between two encodings
should be possible without knowledge of the particular encoding, as should
one on the input/output of some server.  We can still give a default
implementation for single-byte encodings.
What's a good API for real stream conversion?   just
Codec.encodeStream(infile, outfile)  ?  or is it more useful to feed the
codec with data a chunk at a time?

A user defined chunking factor (suitably defaulted) would be useful for
processing large files.

2.	Data driven codecs
I really like codecs being objects, and believe we could build support for a
lot more encodings, a lot sooner than is otherwise possible, by making them
data driven rather making each one compiled C code with static mapping
tables.  What do people think about the approach below?
First of all, the ISO8859-1 series are straight mappings to Unicode code
points.  So one Python script could parse these files and build the mapping
table, and a very small data file could hold these encodings.  A compiled
helper function analogous to string.translate() could deal with most of
them.
Secondly, the double-byte ones involve a mixture of algorithms and data.
The worst cases I know are modal encodings which need a single-byte lookup
table, a double-byte lookup table, and have some very simple rules about
escape sequences in between them.  A simple state machine could still handle
these (and the single-byte mappings above become extra-simple special
cases); I could imagine feeding it a totally data-driven set of rules.  
Third, we can massively compress the mapping tables using a notation which
just lists contiguous ranges; and very often there are relationships between
encodings.  For example, "cpXYZ is just like cpXYY but with an extra
'smiley' at 0XFE32".  In these cases, a script can build a family of related
codecs in an auditable manner. 

The problem here is that we need to decide whether we are Unicode-centric,
or whether Unicode is just another supported encoding. If we are
Unicode-centric, then all code-page translations will require static mapping
tables between the appropriate Unicode character and the relevant code
points in the other encoding.  This would involve (worst case) 64k static
tables for each supported encoding.  Unfortunately this also precludes the
use of algorithmic conversions and or sparse conversion tables because most
of these transformations are relative to a source and target non-Unicode
encoding, eg JIS <---->EUCJIS.  If we are taking the IBM approach (see
CDRA), then we can mix and match approaches, and treat Unicode strings as
just Unicode, and normal strings as being any arbitrary MBCS encoding.

To guarantee the utmost interoperability and Unicode 3.0 (and beyond)
compliance, we should probably assume that all core encodings are relative
to Unicode as the pivot encoding.  This should hopefully avoid any gotcha's
with roundtrips between any two arbitrary native encodings.  The downside is
this will probably be slower than an optimised algorithmic transformation.

3.	What encodings to distribute?
The only clean answers to this are 'almost none', or 'everything that
Unicode 3.0 has a mapping for'.  The latter is going to add some weight to
the distribution.  What are people's feelings?  Do we ship any at all apart
from the Unicode ones?  Should new encodings be downloadable from
www.python.org  ?  Should there be an optional
package outside the main distribution?
Ship with Unicode encodings in the core, the rest should be an add on
package.

If we are truly Unicode-centric, this gives us the most value in terms of
accessing a Unicode character properties database, which will provide
language neutral case folding, Hankaku <----> Zenkaku folding (Japan
specific), and composition / normalisation between composed characters and
their component nonspacing characters.

Regards,
Mike da Silva



From captainrobbo at yahoo.com  Mon Nov 15 17:18:13 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Mon, 15 Nov 1999 08:18:13 -0800 (PST)
Subject: [Python-Dev] just say no...
Message-ID: <19991115161813.13111.rocketmail@web606.mail.yahoo.com>

--- Guido van Rossum  wrote:

> Did you read Andy Robinson's case study?  He 
> suggested that for certain encodings there may be 
> other things you can do that are more
> user-friendly than raising an exception, depending
> on the application. I am proposing to leave this a
> detail of each specific translation.
> There may even be translations that do the same
thing
> except they have a different behavior for 
> untranslatable cases -- e.g. a strict version
> that raises an exception and a non-strict version
> that replaces bad characters with '?'.  I think this
> is one of the powers of having an extensible set of 
> encodings.

This would be a desirable option in almost every case.
 Default is an exception (I want to know my data is
not clean), but an option to specify an error
character.  It is usually a question mark but Mike
tells me that some encodings specify the error
character to use.  

Example - I query a Sybase Unicode database containing
European accents or Japanese.  By default it will give
me question marks.  If I issue the command 'set
char_convert utf8', then I see the lot (as garbage,
but never mind).  If it always errored whenever a
query result contained unexpected data, it would be
almost impossible to maintain the database.

If I wrote my own codec class for a family of
encodings, I'd give it an even wider variety of
error-logging options - maybe a mode where it told me
where in the file the dodgy characters were.

We've already taken the key step by allowing codecs to
be separate objects registered at run-time,
implemented in either C or Python.  This means that
once again Python will have the most flexible solution
around.

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From jim at digicool.com  Mon Nov 15 17:29:13 1999
From: jim at digicool.com (Jim Fulton)
Date: Mon, 15 Nov 1999 11:29:13 -0500
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
Message-ID: <383034D9.6E1E74D4@digicool.com>

"A.M. Kuchling" wrote:
> 
> I noticed this in PyErr_Format(exception, format, va_alist):
> 
>         char buffer[500]; /* Caller is responsible for limiting the format */
>         ...
>         vsprintf(buffer, format, vargs);
> 
> Making the caller responsible for this is error-prone.  The danger, of
> course, is a buffer overflow caused by generating an error string
> that's larger than the buffer, possibly letting people execute
> arbitrary code.  We could add a test to the configure script for
> vsnprintf() and use it when possible, but that only fixes the problem
> on platforms which have it.  Can we find an implementation of
> vsnprintf() someplace?

I would prefer to see a different interface altogether:

  PyObject *PyErr_StringFormat(errtype, format, buildformat, ...)

So, you could generate an error like this:

  return PyErr_StringFormat(ErrorObject, 
     "You had too many, %d, foos. The last one was %s", 
     "iO", n, someObject)

I implemented this in cPickle. See cPickle_ErrFormat.
(Note that it always returns NULL.)

Jim

--
Jim Fulton           mailto:jim at digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.



From bwarsaw at cnri.reston.va.us  Mon Nov 15 17:54:10 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Mon, 15 Nov 1999 11:54:10 -0500 (EST)
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
	<199911151523.KAA27163@eric.cnri.reston.va.us>
Message-ID: <14384.15026.392781.151886@anthem.cnri.reston.va.us>

>>>>> "Guido" == Guido van Rossum  writes:

    Guido> Assuming that Linux and Solaris have vsnprintf(), can't we
    Guido> just use the configure script to detect it, and issue a
    Guido> warning blaming the platform for those platforms that don't
    Guido> have it?  That seems much simpler (from a maintenance
    Guido> perspective) than carrying our own implementation around
    Guido> (even if we can borrow the Apache version).

Mailman uses vsnprintf in it's C wrapper.  There's a simple configure
test...

# Checks for library functions.
AC_CHECK_FUNCS(vsnprintf)

...and for systems that don't have a vsnprintf, I modified a version
from GNU screen.  It may not have gone through the scrutiny of
Apache's implementation, but for Mailman it was more important that it
be GPL'd (not a Python requirement).

-Barry



From jim at digicool.com  Mon Nov 15 17:56:38 1999
From: jim at digicool.com (Jim Fulton)
Date: Mon, 15 Nov 1999 11:56:38 -0500
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com>
		<199911151523.KAA27163@eric.cnri.reston.va.us> <14384.10383.718373.432606@amarok.cnri.reston.va.us>
Message-ID: <38303B46.F6AEEDF1@digicool.com>

"Andrew M. Kuchling" wrote:
> 
> Guido van Rossum writes:
> >Assuming that Linux and Solaris have vsnprintf(), can't we just use
> >the configure script to detect it, and issue a warning blaming the
> >platform for those platforms that don't have it?  That seems much
> 
> But people using an already-installed Python binary won't see any such
> configure-time warning, and won't find out about the potential
> problem.  Plus, how do people fix the problem on platforms that don't
> have vsnprintf() -- switch to Solaris or Linux?  Not much of a
> solution.  (vsnprintf() isn't ANSI C, though it's a common extension,
> so platforms that lack it aren't really deficient.)
> 
> Hmm... could we maybe use Python's existing (string % vars) machinery?
>  No, that seems to be hard, because it would want
> PyObjects, and we can't know what Python types to convert the varargs
> to, unless we parse the format string (at which point we may as well
> get a vsnprintf() implementation.

It's easy. You use two format strings. One a Python string format, 
and the other a Py_BuildValue format. See my other note.

Jim


--
Jim Fulton           mailto:jim at digicool.com   Python Powered!        
Technical Director   (888) 344-4332            http://www.python.org  
Digital Creations    http://www.digicool.com   http://www.zope.org    

Under US Code Title 47, Sec.227(b)(1)(C), Sec.227(a)(2)(B) This email
address may not be added to any commercial mail list with out my
permission.  Violation of my privacy with advertising or SPAM will
result in a suit for a MINIMUM of $500 damages/incident, $1500 for
repeats.



From tismer at appliedbiometrics.com  Mon Nov 15 18:02:20 1999
From: tismer at appliedbiometrics.com (Christian Tismer)
Date: Mon, 15 Nov 1999 18:02:20 +0100
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us>
Message-ID: <38303C9C.42C5C830@appliedbiometrics.com>


Guido van Rossum wrote:
> 
> > I noticed this in PyErr_Format(exception, format, va_alist):
> >
> >       char buffer[500]; /* Caller is responsible for limiting the format */
> >       ...
> >       vsprintf(buffer, format, vargs);
> >
> > Making the caller responsible for this is error-prone.
> 
> Agreed.  The limit of 500 chars, while technically undocumented, is
> part of the specs for PyErr_Format (which is currently wholly
> undocumented).  The current callers all have explicit precautions, but
> of course I agree that this is a potential danger.

All but one (checked them all):
In ceval.c, function call_builtin, there is a possible security hole.
If an extension module happens to create a very long type name
(maybe just via a bug), we will crash.

	}
	PyErr_Format(PyExc_TypeError, "call of non-function (type %s)",
		     func->ob_type->tp_name);
	return NULL;
}

ciao - chris

-- 
Christian Tismer             :^)   
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101   :    *Starship* http://starship.python.net
10553 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home



From guido at CNRI.Reston.VA.US  Mon Nov 15 20:32:00 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 14:32:00 -0500
Subject: [Python-Dev] PyErr_Format security note
In-Reply-To: Your message of "Mon, 15 Nov 1999 18:02:20 +0100."
             <38303C9C.42C5C830@appliedbiometrics.com> 
References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us>  
            <38303C9C.42C5C830@appliedbiometrics.com> 
Message-ID: <199911151932.OAA28008@eric.cnri.reston.va.us>

> All but one (checked them all):

Thanks for checking.

> In ceval.c, function call_builtin, there is a possible security hole.
> If an extension module happens to create a very long type name
> (maybe just via a bug), we will crash.
> 
> 	}
> 	PyErr_Format(PyExc_TypeError, "call of non-function (type %s)",
> 		     func->ob_type->tp_name);
> 	return NULL;
> }

I would think that an extension module with a name of nearly 500
characters would draw a lot of attention as being ridiculous.  If
there was a bug through which you could make tp_name point to such a
long string, you could probably exploit that bug without having to use
this particular PyErr_Format() statement.

However, I agree it's better to be safe than sorry, so I've checked in
a fix making it %.400s.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From tismer at appliedbiometrics.com  Mon Nov 15 20:41:14 1999
From: tismer at appliedbiometrics.com (Christian Tismer)
Date: Mon, 15 Nov 1999 20:41:14 +0100
Subject: [Python-Dev] PyErr_Format security note
References: <199911150149.UAA00408@mira.erols.com> <199911151523.KAA27163@eric.cnri.reston.va.us>  
	            <38303C9C.42C5C830@appliedbiometrics.com> <199911151932.OAA28008@eric.cnri.reston.va.us>
Message-ID: <383061DA.CA5CB373@appliedbiometrics.com>


Guido van Rossum wrote:
> 
> > All but one (checked them all):

[ceval.c without limits]

> I would think that an extension module with a name of nearly 500
> characters would draw a lot of attention as being ridiculous.  If
> there was a bug through which you could make tp_name point to such a
> long string, you could probably exploit that bug without having to use
> this particular PyErr_Format() statement.

Of course this case is very unlikely.
My primary intent was to create such a mess without
an extension, and ExtensionClass seemed to be a candidate since
it synthetizes a type name at runtime (!).
This would have been dangerous since EC is in the heart of Zope.

But, I could not get at this special case since EC always
stands the class/instance checks and so this case can never happen :(

The above lousy result was just to say *something* after no success.

> However, I agree it's better to be safe than sorry, so I've checked in
> a fix making it %.400s.

cheap, consistent, fine - thanks - chris

-- 
Christian Tismer             :^)   
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101   :    *Starship* http://starship.python.net
10553 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home



From mal at lemburg.com  Mon Nov 15 20:04:59 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 20:04:59 +0100
Subject: [Python-Dev] just say no...
References:  <382D315F.A7ADEC42@lemburg.com> <199911131306.IAA26030@eric.cnri.reston.va.us>  
	            <382F33AA.C3EE825A@lemburg.com> <199911151550.KAA27188@eric.cnri.reston.va.us>
Message-ID: <3830595B.348E8CC7@lemburg.com>

Guido van Rossum wrote:
> 
> [Misunderstanding in the reasoning behind "t#" and "s#"]
> 
> Thanks for not picking an argument.  Multibyte encodings typically
> have ASCII as a subset (in such a way that an ASCII string is
> represented as itself in bytes).  This is the characteristic that's
> needed in my view.
> 
> > It was my understanding that "t#" refers to single byte character
> > data. That's where the above arguments were aiming at...
> 
> t# refers to byte-encoded data.  Multibyte encodings are explicitly
> designed to be passed cleanly through processing steps that handle
> single-byte character data, as long as they are 8-bit clean and don't
> do too much processing.

Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
"8-bit clean" as you obviously did.
 
> > Perhaps I'm missing something...
> 
> The idea is that (1)/s# disallows any translation of the data, while
> (2)/t# requires translation of the data to an ASCII superset (possibly
> multibyte, such as UTF-8 or shift-JIS).  (2)/t# assumes that the data
> contains text and that if the text consists of only ASCII characters
> they are represented as themselves.  (1)/s# makes no such assumption.
> 
> In terms of implementation, Unicode objects should translate
> themselves to the default encoding for t# (if possible), but they
> should make the native representation available for s#.
> 
> For example, take an encryption engine.  While it is defined in terms
> of byte streams, there's no requirement that the bytes represent
> characters -- they could be the bytes of a GIF file, an MP3 file, or a
> gzipped tar file.  If we pass Unicode to an encryption engine, we want
> Unicode to come out at the other end, not UTF-8.  (If we had wanted to
> encrypt UTF-8, we should have fed it UTF-8.)
> 
> > > Note that the definition of the 's' format was left alone -- as
> > > before, it means you need an 8-bit text string not containing null
> > > bytes.
> >
> > This definition should then be changed to "text string without
> > null bytes" dropping the 8-bit reference.
> 
> Aha, I think there's a confusion about what "8-bit" means.  For me, a
> multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?
> (As far as I know, C uses char* to represent multibyte characters.)
> Maybe we should disambiguate it more explicitly?

There should be some definition for the two markers and the
ideas behind them in the API guide, I guess.
 
> > Hmm, I would strongly object to making "s#" return the internal
> > format. file.write() would then default to writing UTF-16 data
> > instead of UTF-8 data. This could result in strange errors
> > due to the UTF-16 format being endian dependent.
> 
> But this was the whole design.  file.write() needs to be changed to
> use s# when the file is open in binary mode and t# when the file is
> open in text mode.

Ok, that would make the situation a little clearer (even though
I expect the two different encodings to produce some FAQs). 

I still don't feel very comfortable about the fact that all
existing APIs using "s#" will suddenly receive UTF-16 data if
being passed Unicode objects: this probably won't get us the
"magical" Unicode integration we invision, since "t#" usage is not
very wide spread and character handling code will probably not
work well with UTF-16 encoded strings.

Anyway, we should probably try out both methods...

> > It would also break the symmetry between file.write(u) and
> > unicode(file.read()), since the default encoding is not used as
> > internal format for other reasons (see proposal).
> 
> If the file is encoded using UTF-16 or UCS-2, you should open it in
> binary mode and use unicode(file.read(), 'utf-16').  (Or perhaps the
> app should read the first 2 bytes and check for a BOM and then decide
> to choose bewteen 'utf-16-be' and 'utf-16-le'.)

Right, that's the idea (there is a note on this in the Standard
Codec section of the proposal).
 
> > > Any of the following choices is acceptable (from the point of view of
> > > not breaking the intended t# semantics; we can now start deciding
> > > which we like best):
> >
> > I think we have already agreed on using UTF-8 for the default
> > encoding. It has quite a few advantages. See
> >
> >       http://czyborra.com/utf/
> >
> > for a good overview of the pros and cons.
> 
> Of course.  I was just presenting the list as an argument that if
> we changed our mind about the default encoding, t# should follow the
> default encoding (and not pick an encoding by other means).

Ok.
 
> > > - utf-8
> > > - latin-1
> > > - ascii
> > > - shift-jis
> > > - lower byte of unicode ordinal
> > > - some user- or os-specified multibyte encoding
> > >
> > > As far as t# is concerned, for encodings that don't encode all of
> > > Unicode, untranslatable characters could be dealt with in any number
> > > of ways (raise an exception, ignore, replace with '?', make best
> > > effort, etc.).
> >
> > The usual Python way would be: raise an exception. This is what
> > the proposal defines for Codecs in case an encoding/decoding
> > mapping is not possible, BTW. (UTF-8 will always succeed on
> > output.)
> 
> Did you read Andy Robinson's case study?  He suggested that for
> certain encodings there may be other things you can do that are more
> user-friendly than raising an exception, depending on the application.
> I am proposing to leave this a detail of each specific translation.
> There may even be translations that do the same thing except they have
> a different behavior for untranslatable cases -- e.g. a strict version
> that raises an exception and a non-strict version that replaces bad
> characters with '?'.  I think this is one of the powers of having an
> extensible set of encodings.

Agreed, the Codecs should decide for themselves what to do. I'll
add a note to the next version of the proposal.
 
> > > Given the current context, it should probably be the same as the
> > > default encoding -- i.e., utf-8.  If we end up making the default
> > > user-settable, we'll have to decide what to do with untranslatable
> > > characters -- but that will probably be decided by the user too (it
> > > would be a property of a specific translation specification).
> > >
> > > In any case, I feel that t# could receive a multi-byte encoding,
> > > s# should receive raw binary data, and they should correspond to
> > > getcharbuffer and getreadbuffer, respectively.
> >
> > Why would you want to have "s#" return the raw binary data for
> > Unicode objects ?
> 
> Because file.write() for a binary file, and other similar things
> (e.g. the encryption engine example I mentioned above) must have
> *some* way to get at the raw bits.

What for ? Any lossless encoding should do the trick... UTF-8
is just as good as UTF-16 for binary files; plus it's more compact
for ASCII data. I don't really see a need to get explicitly
at the internal data representation because both encodings are
in fact "internal" w/r to Unicode objects.

The only argument I can come up with is that using UTF-16 for
binary files could (possibly) eliminate the UTF-8 conversion step
which is otherwise always needed.
 
> > Note that it is not mentioned anywhere that
> > "s#" and "t#" do have to necessarily return different things
> > (binary being a superset of text). I'd opt for "s#" and "t#" both
> > returning UTF-8 data. This can be implemented by delegating the
> > buffer slots to the  object (see below).
> 
> This would defeat the whole purpose of introducing t#.  We might as
> well drop t# then altogether if we adopt this.

Well... yes ;-)
 
> > > > Now Greg would chime in with the buffer interface and
> > > > argue that it should make the underlying internal
> > > > format accessible. This is a bad idea, IMHO, since you
> > > > shouldn't really have to know what the internal data format
> > > > is.
> > >
> > > This is for C code.  Quite likely it *does* know what the internal
> > > data format is!
> >
> > C code can use the PyUnicode_* APIs to access the data. I
> > don't think that argument parsing is powerful enough to
> > provide the C code with enough information about the data
> > contents, e.g. it can only state the encoding length, not the
> > string length.
> 
> Typically, all the C code does is pass multibyte encoded strings on to
> other library routines that know what to do to them, or simply give
> them back unchanged at a later time.  It is essential to know the
> number of bytes, for memory allocation purposes.  The number of
> characters is totally immaterial (and multibyte-handling code knows
> how to calculate the number of characters anyway).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Mon Nov 15 20:20:55 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 20:20:55 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>
Message-ID: <38305D17.60EC94D0@lemburg.com>

Andy Robinson wrote:
> 
> Some thoughts on the codecs...
> 
> 1. Stream interface
> At the moment a codec has dump and load methods which
> read a (slice of a) stream into a string in memory and
> vice versa.  As the proposal notes, this could lead to
> errors if you take a slice out of a stream.   This is
> not just due to character truncation; some Asian
> encodings are modal and have shift-in and shift-out
> sequences as they move from Western single-byte
> characters to double-byte ones.   It also seems a bit
> pointless to me as the source (or target) is still a
> Unicode string in memory.
> 
> This is a real problem - a filter to convert big files
> between two encodings should be possible without
> knowledge of the particular encoding, as should one on
> the input/output of some server.  We can still give a
> default implementation for single-byte encodings.
> 
> What's a good API for real stream conversion?   just
> Codec.encodeStream(infile, outfile)  ?  or is it more
> useful to feed the codec with data a chunk at a time?

The idea was to use Unicode as intermediate for all
encoding conversions. 

What you invision here are stream recoders. The can
easily be implemented as an useful addition to the Codec
subclasses, but I don't think that these have to go
into the core.
 
> 2. Data driven codecs
> I really like codecs being objects, and believe we
> could build support for a lot more encodings, a lot
> sooner than is otherwise possible, by making them data
> driven rather making each one compiled C code with
> static mapping tables.  What do people think about the
> approach below?
> 
> First of all, the ISO8859-1 series are straight
> mappings to Unicode code points.  So one Python script
> could parse these files and build the mapping table,
> and a very small data file could hold these encodings.
>   A compiled helper function analogous to
> string.translate() could deal with most of them.

The problem with these large tables is that currently
Python modules are not shared among processes since
every process builds its own table.

Static C data has the advantage of being shareable at
the OS level.

You can of course implement Python based lookup tables,
but these should be too large...
 
> Secondly, the double-byte ones involve a mixture of
> algorithms and data.  The worst cases I know are modal
> encodings which need a single-byte lookup table, a
> double-byte lookup table, and have some very simple
> rules about escape sequences in between them.  A
> simple state machine could still handle these (and the
> single-byte mappings above become extra-simple special
> cases); I could imagine feeding it a totally
> data-driven set of rules.
> 
> Third, we can massively compress the mapping tables
> using a notation which just lists contiguous ranges;
> and very often there are relationships between
> encodings.  For example, "cpXYZ is just like cpXYY but
> with an extra 'smiley' at 0XFE32".  In these cases, a
> script can build a family of related codecs in an
> auditable manner.

These are all great ideas, but I think they unnecessarily
complicate the proposal.
 
> 3. What encodings to distribute?
> The only clean answers to this are 'almost none', or
> 'everything that Unicode 3.0 has a mapping for'.  The
> latter is going to add some weight to the
> distribution.  What are people's feelings?  Do we ship
> any at all apart from the Unicode ones?  Should new
> encodings be downloadable from www.python.org?  Should
> there be an optional package outside the main
> distribution?

Since Codecs can be registered at runtime, there is quite
some potential there for extension writers coding their
own fast codecs. E.g. one could use mxTextTools as codec
engine working at C speeds.

I would propose to only add some very basic encodings to
the standard distribution, e.g. the ones mentioned under
Standard Codecs in the proposal:

  'utf-8':		8-bit variable length encoding
  'utf-16':		16-bit variable length encoding (litte/big endian)
  'utf-16-le':		utf-16 but explicitly little endian
  'utf-16-be':		utf-16 but explicitly big endian
  'ascii':		7-bit ASCII codepage
  'latin-1':		Latin-1 codepage
  'html-entities':	Latin-1 + HTML entities;
			see htmlentitydefs.py from the standard Pythin Lib
  'jis' (a popular version XXX):
			Japanese character encoding
  'unicode-escape':	See Unicode Constructors for a definition
  'native':		Dump of the Internal Format used by Python

Perhaps not even 'html-entities' (even though it would make
a cool replacement for cgi.escape()) and maybe we should
also place the JIS encoding into a separate Unicode package.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Mon Nov 15 20:26:16 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 20:26:16 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: 
Message-ID: <38305E58.28B20E24@lemburg.com>

"Da Silva, Mike" wrote:
> 
> Andy Robinson wrote:
> --
> 1.      Stream interface
> At the moment a codec has dump and load methods which read a (slice of a)
> stream into a string in memory and vice versa.  As the proposal notes, this
> could lead to errors if you take a slice out of a stream.   This is not just
> due to character truncation; some Asian encodings are modal and have
> shift-in and shift-out sequences as they move from Western single-byte
> characters to double-byte ones.   It also seems a bit pointless to me as the
> source (or target) is still a Unicode string in memory.
> This is a real problem - a filter to convert big files between two encodings
> should be possible without knowledge of the particular encoding, as should
> one on the input/output of some server.  We can still give a default
> implementation for single-byte encodings.
> What's a good API for real stream conversion?   just
> Codec.encodeStream(infile, outfile)  ?  or is it more useful to feed the
> codec with data a chunk at a time?
> --
> A user defined chunking factor (suitably defaulted) would be useful for
> processing large files.
> --
> 2.      Data driven codecs
> I really like codecs being objects, and believe we could build support for a
> lot more encodings, a lot sooner than is otherwise possible, by making them
> data driven rather making each one compiled C code with static mapping
> tables.  What do people think about the approach below?
> First of all, the ISO8859-1 series are straight mappings to Unicode code
> points.  So one Python script could parse these files and build the mapping
> table, and a very small data file could hold these encodings.  A compiled
> helper function analogous to string.translate() could deal with most of
> them.
> Secondly, the double-byte ones involve a mixture of algorithms and data.
> The worst cases I know are modal encodings which need a single-byte lookup
> table, a double-byte lookup table, and have some very simple rules about
> escape sequences in between them.  A simple state machine could still handle
> these (and the single-byte mappings above become extra-simple special
> cases); I could imagine feeding it a totally data-driven set of rules.
> Third, we can massively compress the mapping tables using a notation which
> just lists contiguous ranges; and very often there are relationships between
> encodings.  For example, "cpXYZ is just like cpXYY but with an extra
> 'smiley' at 0XFE32".  In these cases, a script can build a family of related
> codecs in an auditable manner.
> --
> The problem here is that we need to decide whether we are Unicode-centric,
> or whether Unicode is just another supported encoding. If we are
> Unicode-centric, then all code-page translations will require static mapping
> tables between the appropriate Unicode character and the relevant code
> points in the other encoding.  This would involve (worst case) 64k static
> tables for each supported encoding.  Unfortunately this also precludes the
> use of algorithmic conversions and or sparse conversion tables because most
> of these transformations are relative to a source and target non-Unicode
> encoding, eg JIS <---->EUCJIS.  If we are taking the IBM approach (see
> CDRA), then we can mix and match approaches, and treat Unicode strings as
> just Unicode, and normal strings as being any arbitrary MBCS encoding.
> 
> To guarantee the utmost interoperability and Unicode 3.0 (and beyond)
> compliance, we should probably assume that all core encodings are relative
> to Unicode as the pivot encoding.  This should hopefully avoid any gotcha's
> with roundtrips between any two arbitrary native encodings.  The downside is
> this will probably be slower than an optimised algorithmic transformation.

Optimizations should go into separate packages for direct EncodingA
-> EncodingB conversions. I don't think we need them in the core.

> --
> 3.      What encodings to distribute?
> The only clean answers to this are 'almost none', or 'everything that
> Unicode 3.0 has a mapping for'.  The latter is going to add some weight to
> the distribution.  What are people's feelings?  Do we ship any at all apart
> from the Unicode ones?  Should new encodings be downloadable from
> www.python.org  ?  Should there be an optional
> package outside the main distribution?
> --
> Ship with Unicode encodings in the core, the rest should be an add on
> package.
> 
> If we are truly Unicode-centric, this gives us the most value in terms of
> accessing a Unicode character properties database, which will provide
> language neutral case folding, Hankaku <----> Zenkaku folding (Japan
> specific), and composition / normalisation between composed characters and
> their component nonspacing characters.

>From the proposal:

"""
Unicode Character Properties:
-----------------------------

A separate module "unicodedata" should provide a compact interface to
all Unicode character properties defined in the standard's
UnicodeData.txt file.

Among other things, these properties provide ways to recognize
numbers, digits, spaces, whitespace, etc.

Since this module will have to provide access to all Unicode
characters, it will eventually have to contain the data from
UnicodeData.txt which takes up around 200kB. For this reason, the data
should be stored in static C data. This enables compilation as shared
module which the underlying OS can shared between processes (unlike
normal Python code modules).

XXX Define the interface...

"""

Special CJK packages can then access this data for the purposes
you mentioned above.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From guido at CNRI.Reston.VA.US  Mon Nov 15 22:37:28 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 16:37:28 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Mon, 15 Nov 1999 20:20:55 +0100."
             <38305D17.60EC94D0@lemburg.com> 
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>  
            <38305D17.60EC94D0@lemburg.com> 
Message-ID: <199911152137.QAA28280@eric.cnri.reston.va.us>

> Andy Robinson wrote:
> > 
> > Some thoughts on the codecs...
> > 
> > 1. Stream interface
> > At the moment a codec has dump and load methods which
> > read a (slice of a) stream into a string in memory and
> > vice versa.  As the proposal notes, this could lead to
> > errors if you take a slice out of a stream.   This is
> > not just due to character truncation; some Asian
> > encodings are modal and have shift-in and shift-out
> > sequences as they move from Western single-byte
> > characters to double-byte ones.   It also seems a bit
> > pointless to me as the source (or target) is still a
> > Unicode string in memory.
> > 
> > This is a real problem - a filter to convert big files
> > between two encodings should be possible without
> > knowledge of the particular encoding, as should one on
> > the input/output of some server.  We can still give a
> > default implementation for single-byte encodings.
> > 
> > What's a good API for real stream conversion?   just
> > Codec.encodeStream(infile, outfile)  ?  or is it more
> > useful to feed the codec with data a chunk at a time?

M.-A. Lemburg responds:

> The idea was to use Unicode as intermediate for all
> encoding conversions. 
> 
> What you invision here are stream recoders. The can
> easily be implemented as an useful addition to the Codec
> subclasses, but I don't think that these have to go
> into the core.

What I wanted was a codec API that acts somewhat like a buffered file;
the buffer makes it possible to efficient handle shift states.  This
is not exactly what Andy shows, but it's not what Marc's current spec
has either.

I had thought something more like what Java does: an output stream
codec's constructor takes a writable file object and the object
returned by the constructor has a write() method, a flush() method and
a close() method.  It acts like a buffering interface to the
underlying file; this allows it to generate the minimal number of
shift sequeuces.  Similar for input stream codecs.

Andy's file translation example could then be written as follows:

# assuming variables input_file, input_encoding, output_file,
# output_encoding, and constant BUFFER_SIZE

f = open(input_file, "rb")
f1 = unicodec.codecs[input_encoding].stream_reader(f)
g = open(output_file, "wb")
g1 = unicodec.codecs[output_encoding].stream_writer(f)

while 1:
      buffer = f1.read(BUFFER_SIZE)
      if not buffer:
	 break
      f2.write(buffer)

f2.close()
f1.close()

Note that we could possibly make these the only API that a codec needs
to provide; the string object <--> unicode object conversions can be
done using this and the cStringIO module.  (On the other hand it seems
a common case that would be quite useful.)

> > 2. Data driven codecs
> > I really like codecs being objects, and believe we
> > could build support for a lot more encodings, a lot
> > sooner than is otherwise possible, by making them data
> > driven rather making each one compiled C code with
> > static mapping tables.  What do people think about the
> > approach below?
> > 
> > First of all, the ISO8859-1 series are straight
> > mappings to Unicode code points.  So one Python script
> > could parse these files and build the mapping table,
> > and a very small data file could hold these encodings.
> >   A compiled helper function analogous to
> > string.translate() could deal with most of them.
> 
> The problem with these large tables is that currently
> Python modules are not shared among processes since
> every process builds its own table.
> 
> Static C data has the advantage of being shareable at
> the OS level.

Don't worry about it.  128K is too small to care, I think...

> You can of course implement Python based lookup tables,
> but these should be too large...
>  
> > Secondly, the double-byte ones involve a mixture of
> > algorithms and data.  The worst cases I know are modal
> > encodings which need a single-byte lookup table, a
> > double-byte lookup table, and have some very simple
> > rules about escape sequences in between them.  A
> > simple state machine could still handle these (and the
> > single-byte mappings above become extra-simple special
> > cases); I could imagine feeding it a totally
> > data-driven set of rules.
> > 
> > Third, we can massively compress the mapping tables
> > using a notation which just lists contiguous ranges;
> > and very often there are relationships between
> > encodings.  For example, "cpXYZ is just like cpXYY but
> > with an extra 'smiley' at 0XFE32".  In these cases, a
> > script can build a family of related codecs in an
> > auditable manner.
> 
> These are all great ideas, but I think they unnecessarily
> complicate the proposal.

Agreed, let's leave the *implementation* of codecs out of the current
efforts.

However I want to make sure that the *interface* to codecs is defined
right, because changing it will be expensive.  (This is Linus
Torvald's philosophy on drivers -- he doesn't care about bugs in
drivers, as they will get fixed; however he greatly cares about
defining the driver APIs correctly.)

> > 3. What encodings to distribute?
> > The only clean answers to this are 'almost none', or
> > 'everything that Unicode 3.0 has a mapping for'.  The
> > latter is going to add some weight to the
> > distribution.  What are people's feelings?  Do we ship
> > any at all apart from the Unicode ones?  Should new
> > encodings be downloadable from www.python.org?  Should
> > there be an optional package outside the main
> > distribution?
> 
> Since Codecs can be registered at runtime, there is quite
> some potential there for extension writers coding their
> own fast codecs. E.g. one could use mxTextTools as codec
> engine working at C speeds.

(Do you think you'll be able to extort some money from HP for these? :-)

> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
> 
>   'utf-8':		8-bit variable length encoding
>   'utf-16':		16-bit variable length encoding (litte/big endian)
>   'utf-16-le':		utf-16 but explicitly little endian
>   'utf-16-be':		utf-16 but explicitly big endian
>   'ascii':		7-bit ASCII codepage
>   'latin-1':		Latin-1 codepage
>   'html-entities':	Latin-1 + HTML entities;
> 			see htmlentitydefs.py from the standard Pythin Lib
>   'jis' (a popular version XXX):
> 			Japanese character encoding
>   'unicode-escape':	See Unicode Constructors for a definition
>   'native':		Dump of the Internal Format used by Python
> 
> Perhaps not even 'html-entities' (even though it would make
> a cool replacement for cgi.escape()) and maybe we should
> also place the JIS encoding into a separate Unicode package.

I'd drop html-entities, it seems too cutesie.  (And who uses these
anyway, outside browsers?)

For JIS (shift-JIS?) I hope that Andy can help us with some pointers
and validation.

And unicode-escape: now that you mention it, this is a section of
the proposal that I don't understand.  I quote it here:

| Python should provide a built-in constructor for Unicode strings which
| is available through __builtins__:
| 
|   u = unicode([,=])
                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

What do you mean by this notation?  Since encoding names are not
always legal Python identifiers (most contain hyphens), I don't
understand what you really meant here.  Do you mean to say that it has
to be a keyword argument?  I would disagree; and then I would have
expected the notation [,encoding=].

| With the 'unicode-escape' encoding being defined as:
| 
|   u = u''
| 
| ? for single characters (and this includes all \XXX sequences except \uXXXX),
|   take the ordinal and interpret it as Unicode ordinal;
| 
| ? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX 
|   instead, e.g. \u03C0 to represent the character Pi.

I've looked at this several times and I don't see the difference
between the two bullets.  (Ironically, you are using a non-ASCII
character here that doesn't always display, depending on where I look
at your mail :-).

Can you give some examples?

Is u'\u0020' different from u'\x20' (a space)?

Does '\u0020' (no u prefix) have a meaning?

Also, I remember reading Tim Peters who suggested that a "raw unicode"
notation (ur"...") might be necessary, to encode regular expressions.
I tend to agree.

While I'm on the topic, I don't see in your proposal a description of
the source file character encoding.  Currently, this is undefined, and
in fact can be (ab)used to enter non-ASCII in string literals.  For
example, a programmer named Fran?ois might write a file containing
this statement:

  print "Written by Fran?ois." # (There's a cedilla in there!)

(He assumes his source character encoding is Latin-1, and he doesn't
want to have to type \347 when he can type a cedilla on his keyboard.)

If his source file (or .pyc file!)  is executed by a Japanese user,
this will probably print some garbage.

Using the new Unicode strings, Fran?ois could change his program as
follows:

  print unicode("Written by Fran?ois.", "latin-1")

Assuming that Fran?ois sets his sys.stdout to use Latin-1, while the
Japanese user sets his to shift-JIS (or whatever his kanjiterm uses).

But when the Japanese user views Fran?ois' source file, he will again
see garbage.  If he uses a generic tool to translate latin-1 files to
shift-JIS (assuming shift-JIS has a cedilla character) the program
will no longer work correctly -- the string "latin-1" has to be
changed to "shift-jis".

What should we do about this?  The safest and most radical solution is
to disallow non-ASCII source characters; Fran?ois will then have to
type

  print u"Written by Fran\u00E7ois."

but, knowing Fran?ois, he probably won't like this solution very much
(since he didn't like the \347 version either).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From andy at robanal.demon.co.uk  Mon Nov 15 22:41:21 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Mon, 15 Nov 1999 21:41:21 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38305D17.60EC94D0@lemburg.com>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>
Message-ID: <38307984.12653394@post.demon.co.uk>

On Mon, 15 Nov 1999 20:20:55 +0100, you wrote:

>These are all great ideas, but I think they unnecessarily
>complicate the proposal.

However, to claim that Python is properly internationalized, we will
need a large number of multi-byte encodings to be available.  It's a
large amount of work, it must be provably correct, and someone's going
to have to do it.  So if anyone with more C expertise than me - not
hard :-) - is interested

I'm not suggesting putting my points in the Unicode proposal - in
fact, I'm very happy we have a proposal which allows for extension,
and lets us work on the encodings separately (and later).

>Since Codecs can be registered at runtime, there is quite
>some potential there for extension writers coding their
>own fast codecs. E.g. one could use mxTextTools as codec
>engine working at C speeds.
Exactly my thoughts , although I was thinking of a more slimmed down
and specialized one.  The right tool might be usable for things like
compression algorithms too.  Separate project to the Unicode stuff,
but if anyone is interested, talk to me.

>I would propose to only add some very basic encodings to
>the standard distribution, e.g. the ones mentioned under
>Standard Codecs in the proposal:
>
>  'utf-8':		8-bit variable length encoding
>  'utf-16':		16-bit variable length encoding (litte/big endian)
>  'utf-16-le':		utf-16 but explicitly little endian
>  'utf-16-be':		utf-16 but explicitly big endian
>  'ascii':		7-bit ASCII codepage
>  'latin-1':		Latin-1 codepage
>  'html-entities':	Latin-1 + HTML entities;
>			see htmlentitydefs.py from the standard Pythin Lib
>  'jis' (a popular version XXX):
>			Japanese character encoding
>  'unicode-escape':	See Unicode Constructors for a definition
>  'native':		Dump of the Internal Format used by Python
>
Leave JISXXX and the CJK stuff out.  If you get into Japanese, you
really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there
are lots of options about how to do it.  The other ones are
algorithmic and can be small and fast and fit into the core.

Ditto with HTML, and maybe even escaped-unicode too.

In summary, the current discussion is clearly doing the right things,
but is only covering a small percentage of what needs to be done to
internationalize Python fully.

- Andy




From guido at CNRI.Reston.VA.US  Mon Nov 15 22:49:26 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Mon, 15 Nov 1999 16:49:26 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Mon, 15 Nov 1999 21:41:21 GMT."
             <38307984.12653394@post.demon.co.uk> 
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>  
            <38307984.12653394@post.demon.co.uk> 
Message-ID: <199911152149.QAA28345@eric.cnri.reston.va.us>

> In summary, the current discussion is clearly doing the right things,
> but is only covering a small percentage of what needs to be done to
> internationalize Python fully.

Agreed.  So let's focus on defining interfaces that are correct and
convenient so others who want to add codecs won't have to fight our
architecture!

Is the current architecture good enough so that the Japanese codecs
will fit in it?  (I'm particularly worried about the stream codecs,
see my previous message.)

--Guido van Rossum (home page: http://www.python.org/~guido/)




From andy at robanal.demon.co.uk  Mon Nov 15 22:58:34 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Mon, 15 Nov 1999 21:58:34 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <199911152149.QAA28345@eric.cnri.reston.va.us>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>   <38307984.12653394@post.demon.co.uk> <199911152149.QAA28345@eric.cnri.reston.va.us>
Message-ID: <3831806d.14422147@post.demon.co.uk>

On Mon, 15 Nov 1999 16:49:26 -0500, you wrote:

>> In summary, the current discussion is clearly doing the right things,
>> but is only covering a small percentage of what needs to be done to
>> internationalize Python fully.
>
>Agreed.  So let's focus on defining interfaces that are correct and
>convenient so others who want to add codecs won't have to fight our
>architecture!
>
>Is the current architecture good enough so that the Japanese codecs
>will fit in it?  (I'm particularly worried about the stream codecs,
>see my previous message.)
>
No, I don't think it is good enough.  We need a stream codec, and as
you said the string and file interfaces can be built out of that.  

You guys will know better than me what the best patterns for that
are...

- Andy







From andy at robanal.demon.co.uk  Mon Nov 15 23:30:53 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Mon, 15 Nov 1999 22:30:53 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>   <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <383086da.16067684@post.demon.co.uk>

On Mon, 15 Nov 1999 16:37:28 -0500, you wrote:

># assuming variables input_file, input_encoding, output_file,
># output_encoding, and constant BUFFER_SIZE
>
>f = open(input_file, "rb")
>f1 = unicodec.codecs[input_encoding].stream_reader(f)
>g = open(output_file, "wb")
>g1 = unicodec.codecs[output_encoding].stream_writer(f)
>
>while 1:
>      buffer = f1.read(BUFFER_SIZE)
>      if not buffer:
>	 break
>      f2.write(buffer)
>
>f2.close()
>f1.close()
>
>Note that we could possibly make these the only API that a codec needs
>to provide; the string object <--> unicode object conversions can be
>done using this and the cStringIO module.  (On the other hand it seems
>a common case that would be quite useful.)
Perfect.  I'd keep the string ones - easy to implement but a big
convenience.

The proposal also says:
>For explicit handling of Unicode using files, the unicodec module
>could provide stream wrappers which provide transparent
>encoding/decoding for any open stream (file-like object):
>
>  import unicodec
>  file = open('mytext.txt','rb')
>  ufile = unicodec.stream(file,'utf-16')
>  u = ufile.read()
>  ...
>  ufile.close()

It seems to me that if we go for stream_reader, it replaces this bit
of the proposal too - no need for unicodec to provide anything.  If
you want to have a convenience function there to save a line or two,
you could have
	unicodec.open(filename, mode, encoding)
which returned a stream_reader.


- Andy




From mal at lemburg.com  Mon Nov 15 23:54:38 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Mon, 15 Nov 1999 23:54:38 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>  
	            <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <38308F2E.44B9C6BF@lemburg.com>

[I'll get back on this tomorrow, just some quick notes here...]

Guido van Rossum wrote:
> 
> > Andy Robinson wrote:
> > >
> > > Some thoughts on the codecs...
> > >
> > > 1. Stream interface
> > > At the moment a codec has dump and load methods which
> > > read a (slice of a) stream into a string in memory and
> > > vice versa.  As the proposal notes, this could lead to
> > > errors if you take a slice out of a stream.   This is
> > > not just due to character truncation; some Asian
> > > encodings are modal and have shift-in and shift-out
> > > sequences as they move from Western single-byte
> > > characters to double-byte ones.   It also seems a bit
> > > pointless to me as the source (or target) is still a
> > > Unicode string in memory.
> > >
> > > This is a real problem - a filter to convert big files
> > > between two encodings should be possible without
> > > knowledge of the particular encoding, as should one on
> > > the input/output of some server.  We can still give a
> > > default implementation for single-byte encodings.
> > >
> > > What's a good API for real stream conversion?   just
> > > Codec.encodeStream(infile, outfile)  ?  or is it more
> > > useful to feed the codec with data a chunk at a time?
> 
> M.-A. Lemburg responds:
> 
> > The idea was to use Unicode as intermediate for all
> > encoding conversions.
> >
> > What you invision here are stream recoders. The can
> > easily be implemented as an useful addition to the Codec
> > subclasses, but I don't think that these have to go
> > into the core.
> 
> What I wanted was a codec API that acts somewhat like a buffered file;
> the buffer makes it possible to efficient handle shift states.  This
> is not exactly what Andy shows, but it's not what Marc's current spec
> has either.
> 
> I had thought something more like what Java does: an output stream
> codec's constructor takes a writable file object and the object
> returned by the constructor has a write() method, a flush() method and
> a close() method.  It acts like a buffering interface to the
> underlying file; this allows it to generate the minimal number of
> shift sequeuces.  Similar for input stream codecs.

The Codecs provide implementations for encoding and decoding,
they are not intended as complete wrappers for e.g. files or
sockets.

The unicodec module will define a generic stream wrapper
(which is yet to be defined) for dealing with files, sockets,
etc. It will use the codec registry to do the actual codec
work.
 
>From the proposal:
"""
For explicit handling of Unicode using files, the unicodec module
could provide stream wrappers which provide transparent
encoding/decoding for any open stream (file-like object):

  import unicodec
  file = open('mytext.txt','rb')
  ufile = unicodec.stream(file,'utf-16')
  u = ufile.read()
  ...
  ufile.close()

XXX unicodec.file(,,) could be provided as
    short-hand for unicodec.file(open(,),) which
    also assures that  contains the 'b' character when needed.

XXX Specify the wrapper(s)...

    Open issues: what to do with Python strings
    fed to the .write() method (may need to know the encoding of the
    strings) and when/if to return Python strings through the .read()
    method.

    Perhaps we need more than one type of wrapper here.
"""

> Andy's file translation example could then be written as follows:
> 
> # assuming variables input_file, input_encoding, output_file,
> # output_encoding, and constant BUFFER_SIZE
> 
> f = open(input_file, "rb")
> f1 = unicodec.codecs[input_encoding].stream_reader(f)
> g = open(output_file, "wb")
> g1 = unicodec.codecs[output_encoding].stream_writer(f)
> 
> while 1:
>       buffer = f1.read(BUFFER_SIZE)
>       if not buffer:
>          break
>       f2.write(buffer)
> 
> f2.close()
> f1.close()

 
> Note that we could possibly make these the only API that a codec needs
> to provide; the string object <--> unicode object conversions can be
> done using this and the cStringIO module.  (On the other hand it seems
> a common case that would be quite useful.)

You wouldn't want to go via cStringIO for *every* encoding
translation.

The Codec interface defines two pairs of methods
on purpose: one which works internally (ie. directly between
strings and Unicode objects), and one which works externally
(directly between a stream and Unicode objects).

> > > 2. Data driven codecs
> > > I really like codecs being objects, and believe we
> > > could build support for a lot more encodings, a lot
> > > sooner than is otherwise possible, by making them data
> > > driven rather making each one compiled C code with
> > > static mapping tables.  What do people think about the
> > > approach below?
> > >
> > > First of all, the ISO8859-1 series are straight
> > > mappings to Unicode code points.  So one Python script
> > > could parse these files and build the mapping table,
> > > and a very small data file could hold these encodings.
> > >   A compiled helper function analogous to
> > > string.translate() could deal with most of them.
> >
> > The problem with these large tables is that currently
> > Python modules are not shared among processes since
> > every process builds its own table.
> >
> > Static C data has the advantage of being shareable at
> > the OS level.
> 
> Don't worry about it.  128K is too small to care, I think...

Huh ? 128K for every process using Python ? That quickly
sums up to lots of megabytes lying around pretty much unused.

> > You can of course implement Python based lookup tables,
> > but these should be too large...
> >
> > > Secondly, the double-byte ones involve a mixture of
> > > algorithms and data.  The worst cases I know are modal
> > > encodings which need a single-byte lookup table, a
> > > double-byte lookup table, and have some very simple
> > > rules about escape sequences in between them.  A
> > > simple state machine could still handle these (and the
> > > single-byte mappings above become extra-simple special
> > > cases); I could imagine feeding it a totally
> > > data-driven set of rules.
> > >
> > > Third, we can massively compress the mapping tables
> > > using a notation which just lists contiguous ranges;
> > > and very often there are relationships between
> > > encodings.  For example, "cpXYZ is just like cpXYY but
> > > with an extra 'smiley' at 0XFE32".  In these cases, a
> > > script can build a family of related codecs in an
> > > auditable manner.
> >
> > These are all great ideas, but I think they unnecessarily
> > complicate the proposal.
> 
> Agreed, let's leave the *implementation* of codecs out of the current
> efforts.
> 
> However I want to make sure that the *interface* to codecs is defined
> right, because changing it will be expensive.  (This is Linus
> Torvald's philosophy on drivers -- he doesn't care about bugs in
> drivers, as they will get fixed; however he greatly cares about
> defining the driver APIs correctly.)
> 
> > > 3. What encodings to distribute?
> > > The only clean answers to this are 'almost none', or
> > > 'everything that Unicode 3.0 has a mapping for'.  The
> > > latter is going to add some weight to the
> > > distribution.  What are people's feelings?  Do we ship
> > > any at all apart from the Unicode ones?  Should new
> > > encodings be downloadable from www.python.org?  Should
> > > there be an optional package outside the main
> > > distribution?
> >
> > Since Codecs can be registered at runtime, there is quite
> > some potential there for extension writers coding their
> > own fast codecs. E.g. one could use mxTextTools as codec
> > engine working at C speeds.
> 
> (Do you think you'll be able to extort some money from HP for these? :-)

Don't know, it depends on what their specs look like. I use
mxTextTools for fast HTML file processing. It uses a small
Turing machine with some extra magic and is progammable via
Python tuples.
 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8':            8-bit variable length encoding
> >   'utf-16':           16-bit variable length encoding (litte/big endian)
> >   'utf-16-le':                utf-16 but explicitly little endian
> >   'utf-16-be':                utf-16 but explicitly big endian
> >   'ascii':            7-bit ASCII codepage
> >   'latin-1':          Latin-1 codepage
> >   'html-entities':    Latin-1 + HTML entities;
> >                       see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> >                       Japanese character encoding
> >   'unicode-escape':   See Unicode Constructors for a definition
> >   'native':           Dump of the Internal Format used by Python
> >
> > Perhaps not even 'html-entities' (even though it would make
> > a cool replacement for cgi.escape()) and maybe we should
> > also place the JIS encoding into a separate Unicode package.
> 
> I'd drop html-entities, it seems too cutesie.  (And who uses these
> anyway, outside browsers?)

Ok.
 
> For JIS (shift-JIS?) I hope that Andy can help us with some pointers
> and validation.
> 
> And unicode-escape: now that you mention it, this is a section of
> the proposal that I don't understand.  I quote it here:
> 
> | Python should provide a built-in constructor for Unicode strings which
> | is available through __builtins__:
> |
> |   u = unicode([,=])
>                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I meant this as optional second argument defaulting to
whatever we define  to mean, e.g. 'utf-8'.

u = unicode("string","utf-8") == unicode("string")

The  argument must be a string identifying one
of the registered codecs.
 
> | With the 'unicode-escape' encoding being defined as:
> |
> |   u = u''
> |
> | ? for single characters (and this includes all \XXX sequences except \uXXXX),
> |   take the ordinal and interpret it as Unicode ordinal;
> |
> | ? for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX
> |   instead, e.g. \u03C0 to represent the character Pi.
> 
> I've looked at this several times and I don't see the difference
> between the two bullets.  (Ironically, you are using a non-ASCII
> character here that doesn't always display, depending on where I look
> at your mail :-).

The first bullet covers the normal Python string characters
and escapes, e.g. \n and \267 (the center dot ;-), while the
second explains how \uXXXX is interpreted.
 
> Can you give some examples?
> 
> Is u'\u0020' different from u'\x20' (a space)?

No, they both map to the same Unicode ordinal.

> Does '\u0020' (no u prefix) have a meaning?

No, \uXXXX is only defined for u"" strings or strings that are
used to build Unicode objects with this encoding:

u = u'\u0020' == unicode(r'\u0020','unicode-escape')

Note that writing \uXX is an error, e.g. u"\u12 " will cause
cause a syntax error.
 
Aside: I just noticed that '\x2010' doesn't give '\x20' + '10'
but instead '\x10' -- is this intended ?

> Also, I remember reading Tim Peters who suggested that a "raw unicode"
> notation (ur"...") might be necessary, to encode regular expressions.
> I tend to agree.

This can be had via unicode():

u = unicode(r'\a\b\c\u0020','unicode-escaped')

If that's too long, define a ur() function which wraps up the
above line in a function.

> While I'm on the topic, I don't see in your proposal a description of
> the source file character encoding.  Currently, this is undefined, and
> in fact can be (ab)used to enter non-ASCII in string literals.  For
> example, a programmer named Fran?ois might write a file containing
> this statement:
> 
>   print "Written by Fran?ois." # (There's a cedilla in there!)
> 
> (He assumes his source character encoding is Latin-1, and he doesn't
> want to have to type \347 when he can type a cedilla on his keyboard.)
> 
> If his source file (or .pyc file!)  is executed by a Japanese user,
> this will probably print some garbage.
> 
> Using the new Unicode strings, Fran?ois could change his program as
> follows:
> 
>   print unicode("Written by Fran?ois.", "latin-1")
> 
> Assuming that Fran?ois sets his sys.stdout to use Latin-1, while the
> Japanese user sets his to shift-JIS (or whatever his kanjiterm uses).
> 
> But when the Japanese user views Fran?ois' source file, he will again
> see garbage.  If he uses a generic tool to translate latin-1 files to
> shift-JIS (assuming shift-JIS has a cedilla character) the program
> will no longer work correctly -- the string "latin-1" has to be
> changed to "shift-jis".
> 
> What should we do about this?  The safest and most radical solution is
> to disallow non-ASCII source characters; Fran?ois will then have to
> type
> 
>   print u"Written by Fran\u00E7ois."
> 
> but, knowing Fran?ois, he probably won't like this solution very much
> (since he didn't like the \347 version either).

I think best is to leave it undefined... as with all files,
only the programmer knows what format and encoding it contains,
e.g. a Japanese programmer might want to use a shift-JIS editor
to enter strings directly in shift-JIS via

u = unicode("...shift-JIS encoded text...","shift-jis")

Of course, this is not readable using an ASCII editor, but
Python will continue to produce the intended string.
NLS strings don't belong into program text anyway: i10n usually
takes the gettext() approach to handle these issues.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    46 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From andy at robanal.demon.co.uk  Tue Nov 16 01:09:28 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Tue, 16 Nov 1999 00:09:28 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38308F2E.44B9C6BF@lemburg.com>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>   <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com>
Message-ID: <3839a078.22625844@post.demon.co.uk>

On Mon, 15 Nov 1999 23:54:38 +0100, you wrote:

>[I'll get back on this tomorrow, just some quick notes here...]
>The Codecs provide implementations for encoding and decoding,
>they are not intended as complete wrappers for e.g. files or
>sockets.
>
>The unicodec module will define a generic stream wrapper
>(which is yet to be defined) for dealing with files, sockets,
>etc. It will use the codec registry to do the actual codec
>work.
> 
>XXX unicodec.file(,,) could be provided as
>    short-hand for unicodec.file(open(,),) which
>    also assures that  contains the 'b' character when needed.
>
>The Codec interface defines two pairs of methods
>on purpose: one which works internally (ie. directly between
>strings and Unicode objects), and one which works externally
>(directly between a stream and Unicode objects).

That's the problem Guido and I are worried about.  Your present API is
not enough to build stream encoders.  The 'slurp it into a unicode
string in one go' approach fails for big files or for network
connections.  And you just cannot build a generic stream reader/writer
by slicing it into strings.   The solution must be specific to the
codec - only it knows how much to buffer, when to flip states etc.  

So the codec should provide proper stream reading and writing
services.  

Unicodec can then wrap those up in labour-saving ways - I'm not fussy
which but I like the one-line file-open utility.


- Andy








From tim_one at email.msn.com  Tue Nov 16 06:38:32 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:38:32 -0500
Subject: [Python-Dev] Unicode proposal: %-formatting ?
In-Reply-To: <382AE7D9.147D58CB@lemburg.com>
Message-ID: <000001bf2ff4$d36e2540$042d153f@tim>

[MAL]
> I wonder how we could add %-formatting to Unicode strings without
> duplicating the PyString_Format() logic.
>
> First, do we need Unicode object %-formatting at all ?

Sure -- in the end, all the world speaks Unicode natively and encodings
become historical baggage.  Granted I won't live that long, but I may last
long enough to see encodings become almost purely an I/O hassle, with all
computation done in Unicode.

> Second, here is an emulation using strings and 
> that should give an idea of one could work with the different
> encodings:
>
>     s = '%s %i abc???' # a Latin-1 encoded string
>     t = (u,3)

What's u?  A Unicode object?  Another Latin-1 string?  A default-encoded
string?  How does the following know the difference?

>     # Convert Latin-1 s to a  string via Unicode
>     s1 = unicode(s,'latin-1').encode()
>
>     # The '%s' will now add u in 
>     s2 = s1 % t
>
>     # Finally, convert the  encoded string to Unicode
>     u1 = unicode(s2)

I don't expect this actually works:  for example, change %s to %4s.
Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to
know that some (or all) characters in u consume multiple bytes, so can't
extract "the right" number of bytes from u.  I think % formating has to know
the truth of what you're doing.

> Note that .encode() defaults to the current setting of
> .
>
> Provided u maps to Latin-1, an alternative would be:
>
>     u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1')

More interesting is fmt % tuple where everything is Unicode; people can muck
with Latin-1 directly today using regular strings, so the example above
mostly shows artificial convolution.





From tim_one at email.msn.com  Tue Nov 16 06:38:40 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:38:40 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <382BDD81.458D3125@lemburg.com>
Message-ID: <000101bf2ff4$d636bb20$042d153f@tim>

[MAL, on raw Unicode strings]
> ...
> Agreed... note that you could also write your own codec for just this
> reason and then use:
>
> u = unicode('....\u1234...\...\...','raw-unicode-escaped')
>
> Put that into a function called 'ur' and you have:
>
> u = ur('...\u4545...\...\...')
>
> which is not that far away from ur'...' w/r to cosmetics.

Well, not quite.  In general you need to pass raw strings:

u = unicode(r'....\u1234...\...\...','raw-unicode-escaped')
            ^
u = ur(r'...\u4545...\...\...')
       ^

else Python will replace all the other backslash sequences.  This is a
crucial distinction at times; e.g., else \b in a Unicode regexp will expand
into a backspace character before the regexp processor ever sees it (\b is
supposed to be a word boundary assertion).





From tim_one at email.msn.com  Tue Nov 16 06:44:42 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:44:42 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000201bf2ff5$ae6aefc0$042d153f@tim>

[Tim, wonders why Perl and Tcl went w/ UTF-8 internally]

[Greg Stein]
> Probably for the exact reason that you stated in your messages: many
> 8-bit (7-bit?) functions continue to work quite well when given a
> UTF-8-encoded string. i.e. they didn't have to rewrite the entire
> Perl/TCL interpreter to deal with a new string type.
>
> I'd guess it is a helluva lot easier for us to add a Python Type than
> for Perl or TCL to whack around with new string types (since they use
> strings so heavily).

Sounds convincing to me!  Bumped into an old thread on c.l.p.m. that
suggested Perl was also worried about UCS-2's 64K code point limit.  But I'm
already on record as predicting we'll regret any decision .





From tim_one at email.msn.com  Tue Nov 16 06:52:12 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:52:12 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000501bf2ff6$ba943a80$042d153f@tim>

[Da Silva, Mike]
> ...
> 5.	UTF-16 requires string operations that do not make assumptions
> about nulls - this means re-implementing most of the C runtime
> functions to work with unsigned shorts.

Python strings are already null-friendly, so Python has already recoded
everything it needs to get away from the no-null assumption; stropmodule.c
is < 1,500 lines of code, and MAL can turn it into C++ template functions in
his sleep .





From tim_one at email.msn.com  Tue Nov 16 06:56:18 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 00:56:18 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <19991112121303.27452.rocketmail@ web605.yahoomail.com>
Message-ID: <000601bf2ff7$4d8a4c80$042d153f@tim>

[Andy Robinson]
> ...
> I presume no one is actually advocating dropping
> ordinary Python strings, or the ability to do
>    rawdata = open('myfile.txt', 'rb').read()
> without any transformations?

If anyone has advocated either, they've successfully hidden it from me.
Anyone?





From tim_one at email.msn.com  Tue Nov 16 07:09:04 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:09:04 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382BF6C3.D79840EC@lemburg.com>
Message-ID: <000701bf2ff9$15cecda0$042d153f@tim>

[MAL]
> BTW, wouldn't it be possible to take pcre and have it
> use Py_Unicode instead of char ? [Of course, there would have to
> be some extensions for character classes etc.]

No, alas.  The assumption that characters are 8 bits is ubiquitous, in both
obvious and subtle ways.

if ((start_bits[c/8] & (1 << (c&7))) == 0) start_match++; else break;





From tim_one at email.msn.com  Tue Nov 16 07:19:16 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:19:16 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <382C3749.198EEBC6@lemburg.com>
Message-ID: <000801bf2ffa$82273400$042d153f@tim>

[MAL]
> sys.bom should return the byte order mark (BOM) for the format used
> internally. The unicodec module should provide symbols for all
> possible values of this variable:
>
>   BOM_BE: '\376\377' 
>     (corresponds to Unicode 0x0000FEFF in UTF-16 
>      == ZERO WIDTH NO-BREAK SPACE)
>
>   BOM_LE: '\377\376' 
>     (corresponds to Unicode 0x0000FFFE in UTF-16 
>      == illegal Unicode character)
>
>   BOM4_BE: '\000\000\377\376'
>     (corresponds to Unicode 0x0000FEFF in UCS-4)

Should be
    BOM4_BE: '\000\000\376\377'   
 
>   BOM4_LE: '\376\377\000\000'
>     (corresponds to Unicode 0x0000FFFE in UCS-4)

Should be
    BOM4_LE: '\377\376\000\000'





From tim_one at email.msn.com  Tue Nov 16 07:31:39 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:31:39 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
Message-ID: <000901bf2ffc$3d4bb8e0$042d153f@tim>

[Fred L. Drake, Jr.]
> ...
>   I wasn't suggesting the PyStringObject be changed, only that the
> PyUnicodeObject could maintain a reference.  Consider:
>
>         s = fp.read()
>         u = unicode(s, 'utf-8')
>
> u would now hold a reference to s, and s/s# would return a pointer
> into s instead of re-building the UTF-8 form.  I talked myself out of
> this because it would be too easy to keep a lot more string objects
> around than were actually needed.

Yet another use for a weak reference <0.5 wink>.





From tim_one at email.msn.com  Tue Nov 16 07:41:44 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 01:41:44 -0500
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: 
Message-ID: <000b01bf2ffd$a5ad69a0$042d153f@tim>

[MAL]
>   BOM_BE: '\376\377'
>     (corresponds to Unicode 0x0000FEFF in UTF-16
>      == ZERO WIDTH NO-BREAK SPACE)

[Greg Stein]
> Are you sure about that interpretation? I thought the BOM characters
> (0xFEFF and 0xFFFE) were *reserved* in the UCS-2 space.

I can't speak to MAL's degree of certainty , but he's right about this
stuff.  There is only one BOM character, U+FEFF, which is the zero-width
no-break space.  The byte-swapped form is not only reserved, it's guaranteed
never to be assigned to a character.





From tim_one at email.msn.com  Tue Nov 16 08:47:06 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 02:47:06 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <000d01bf3006$c7823700$042d153f@tim>

[Guido]
> ...
> While I'm on the topic, I don't see in your proposal a description of
> the source file character encoding.  Currently, this is undefined, and
> in fact can be (ab)used to enter non-ASCII in string literals.
> ...
> What should we do about this?  The safest and most radical solution is
> to disallow non-ASCII source characters; Fran?ois will then have to
> type
>
>   print u"Written by Fran\u00E7ois."
>
> but, knowing Fran?ois, he probably won't like this solution very much
> (since he didn't like the \347 version either).

So long as Python opens source files using libc text mode, it can't
guarantee more than C does:  the presence of any character other than tab,
newline, and ASCII 32-126 inclusive renders the file contents undefined.

Go beyond that, and you've got the same problem as mailers and browsers, and
so also the same solution:  open source files in binary mode, and add a
pragma specifying the intended charset.

As a practical matter, declare that Python source is Latin-1 for now, and
declare any *system* that doesn't support that non-conforming .

python-is-the-measure-of-all-things-ly y'rs  - tim





From tim_one at email.msn.com  Tue Nov 16 08:47:08 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Tue, 16 Nov 1999 02:47:08 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38308F2E.44B9C6BF@lemburg.com>
Message-ID: <000e01bf3006$c8c11fa0$042d153f@tim>

[Guido]
>> Does '\u0020' (no u prefix) have a meaning?

[MAL]
> No, \uXXXX is only defined for u"" strings or strings that are
> used to build Unicode objects with this encoding:

I believe your intent is that '\u0020' be exactly those 6 characters, just
as today.  That is, it does have a meaning, but its meaning differs between
Unicode string literals and regular string literals.

> Note that writing \uXX is an error, e.g. u"\u12 " will cause
> cause a syntax error.

Although I believe your intent  is that, just as today, '\u12' is not
an error.

> Aside: I just noticed that '\x2010' doesn't give '\x20' + '10'
> but instead '\x10' -- is this intended ?

Yes; see 2.4.1 ("String literals") of the Lang Ref.  Blame the C committee
for not defining \x in a platform-independent way.  Note that a Python \x
escape consumes *all* following hex characters, no matter how many -- and
ignores all but the last two.

> This [raw Unicode strings] can be had via unicode():
>
> u = unicode(r'\a\b\c\u0020','unicode-escaped')
>
> If that's too long, define a ur() function which wraps up the
> above line in a function.

As before, I think that's fine for now, but won't stand forever.





From fredrik at pythonware.com  Tue Nov 16 09:39:20 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 09:39:20 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>             <38305D17.60EC94D0@lemburg.com>  <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: <010001bf300e$14741310$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> I had thought something more like what Java does: an output stream
> codec's constructor takes a writable file object and the object
> returned by the constructor has a write() method, a flush() method and
> a close() method.  It acts like a buffering interface to the
> underlying file; this allows it to generate the minimal number of
> shift sequeuces.  Similar for input stream codecs.

note that the html/sgml/xml parsers generally
support the feed/close protocol.  to be able
to use these codecs in that context, we need

1) codes written according to the "data
   consumer model", instead of the "stream"
   model.

        class myDecoder:
            def __init__(self, target):
                self.target = target
                self.state = ...
            def feed(self, data):
                ... extract as much data as possible ...
                self.target.feed(extracted data)
            def close(self):
                ... extract what's left ...
                self.target.feed(additional data)
                self.target.close()

or

2) make threads mandatory, just like in Java.

or

3) add light-weight threads (ala stackless python)
   to the interpreter...

(I vote for alternative 3, but that's another story ;-)






From fredrik at pythonware.com  Tue Nov 16 09:58:50 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 09:58:50 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf2ff4$d636bb20$042d153f@tim>
Message-ID: <016a01bf3010$cde52620$f29b12c2@secret.pythonware.com>

Tim Peters  wrote:
> (\b is supposed to be a word boundary assertion).

in some places, that is.



    Main Entry: reg?u?lar
    Pronunciation: 're-gy&-l&r, 're-g(&-)l&r

    1 : belonging to a religious order
    2 a : formed, built, arranged, or ordered according
    to some established rule, law, principle, or type ...
    3 a : ORDERLY, METHODICAL  ...
    4 a : constituted, conducted, or done in conformity
    with established or prescribed usages, rules, or
    discipline ...




From jack at oratrix.nl  Tue Nov 16 12:05:55 1999
From: jack at oratrix.nl (Jack Jansen)
Date: Tue, 16 Nov 1999 12:05:55 +0100
Subject: [Python-Dev] Some thoughts on the codecs... 
In-Reply-To: Message by "M.-A. Lemburg"  ,
	     Mon, 15 Nov 1999 20:20:55 +0100 , <38305D17.60EC94D0@lemburg.com> 
Message-ID: <19991116110555.8B43335BB1E@snelboot.oratrix.nl>

> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
> 
>   'utf-8':		8-bit variable length encoding
>   'utf-16':		16-bit variable length encoding (litte/big endian)
>   'utf-16-le':		utf-16 but explicitly little endian
>   'utf-16-be':		utf-16 but explicitly big endian
>   'ascii':		7-bit ASCII codepage
>   'latin-1':		Latin-1 codepage
>   'html-entities':	Latin-1 + HTML entities;
> 			see htmlentitydefs.py from the standard Pythin Lib
>   'jis' (a popular version XXX):
> 			Japanese character encoding
>   'unicode-escape':	See Unicode Constructors for a definition
>   'native':		Dump of the Internal Format used by Python

I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets 
(their equivalents of latin-1) too, as documents in these encoding are pretty 
ubiquitous. But maybe these should only be added on the respective platforms.
--
Jack Jansen             | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen at oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack    | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm 





From mal at lemburg.com  Tue Nov 16 09:35:28 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 09:35:28 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <000e01bf3006$c8c11fa0$042d153f@tim>
Message-ID: <38311750.22D17EC1@lemburg.com>

Tim Peters wrote:
> 
> [Guido]
> >> Does '\u0020' (no u prefix) have a meaning?
> 
> [MAL]
> > No, \uXXXX is only defined for u"" strings or strings that are
> > used to build Unicode objects with this encoding:
> 
> I believe your intent is that '\u0020' be exactly those 6 characters, just
> as today.  That is, it does have a meaning, but its meaning differs between
> Unicode string literals and regular string literals.

Right.
 
> > Note that writing \uXX is an error, e.g. u"\u12 " will cause
> > cause a syntax error.
> 
> Although I believe your intent  is that, just as today, '\u12' is not
> an error.

Right again :-) "\u12" gives a 4 byte string, u"\u12" produces an
exception.
 
> > Aside: I just noticed that '\x2010' doesn't give '\x20' + '10'
> > but instead '\x10' -- is this intended ?
> 
> Yes; see 2.4.1 ("String literals") of the Lang Ref.  Blame the C committee
> for not defining \x in a platform-independent way.  Note that a Python \x
> escape consumes *all* following hex characters, no matter how many -- and
> ignores all but the last two.

Strange definition...
 
> > This [raw Unicode strings] can be had via unicode():
> >
> > u = unicode(r'\a\b\c\u0020','unicode-escaped')
> >
> > If that's too long, define a ur() function which wraps up the
> > above line in a function.
> 
> As before, I think that's fine for now, but won't stand forever.

If Guido agrees to ur"", I can put that into the proposal too
-- it's just that things are starting to get a little crowded
for a strawman proposal ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 11:50:31 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:50:31 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <38307984.12653394@post.demon.co.uk>
Message-ID: <383136F7.AB73A90@lemburg.com>

Andy Robinson wrote:
> 
> Leave JISXXX and the CJK stuff out.  If you get into Japanese, you
> really need to cover ShiftJIS, EUC-JP and JIS, they are big, and there
> are lots of options about how to do it.  The other ones are
> algorithmic and can be small and fast and fit into the core.
> 
> Ditto with HTML, and maybe even escaped-unicode too.

So I can drop JIS ? [I won't be able to drop the escaped unicode
codec because this is needed for u"" and ur"".]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 11:42:19 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:42:19 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf2ff4$d636bb20$042d153f@tim>
Message-ID: <3831350B.8F69CB6D@lemburg.com>

Tim Peters wrote:
> 
> [MAL, on raw Unicode strings]
> > ...
> > Agreed... note that you could also write your own codec for just this
> > reason and then use:
> >
> > u = unicode('....\u1234...\...\...','raw-unicode-escaped')
> >
> > Put that into a function called 'ur' and you have:
> >
> > u = ur('...\u4545...\...\...')
> >
> > which is not that far away from ur'...' w/r to cosmetics.
> 
> Well, not quite.  In general you need to pass raw strings:
> 
> u = unicode(r'....\u1234...\...\...','raw-unicode-escaped')
>             ^
> u = ur(r'...\u4545...\...\...')
>        ^
> 
> else Python will replace all the other backslash sequences.  This is a
> crucial distinction at times; e.g., else \b in a Unicode regexp will expand
> into a backspace character before the regexp processor ever sees it (\b is
> supposed to be a word boundary assertion).

Right.

Here is a sample implementation of what I had in mind:

""" Demo for 'unicode-escape' encoding.
"""
import struct,string,re

pack_format = '>H'

def convert_string(s):

    l = map(None,s)
    for i in range(len(l)):
	l[i] = struct.pack(pack_format,ord(l[i]))
    return l

u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})')

def unicode_unescape(s):

    l = []
    start = 0
    while start < len(s):
	m = u_escape.search(s,start)
	if not m:
	    l[len(l):] = convert_string(s[start:])
	    break
	m_start,m_end = m.span()
	if m_start > start:
	    l[len(l):] = convert_string(s[start:m_start])
	hexcode = m.group(1)
	#print hexcode,start,m_start
	if len(hexcode) != 4:
	    raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode
	ordinal = string.atoi(hexcode,16)
	l.append(struct.pack(pack_format,ordinal))
	start = m_end
    #print l
    return string.join(l,'')
    
def hexstr(s,sep=''):

    return string.join(map(lambda x,hex=hex,ord=ord: '%02x' % ord(x),s),sep)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 11:40:42 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:40:42 +0100
Subject: [Python-Dev] Unicode proposal: %-formatting ?
References: <000001bf2ff4$d36e2540$042d153f@tim>
Message-ID: <383134AA.4B49D178@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > I wonder how we could add %-formatting to Unicode strings without
> > duplicating the PyString_Format() logic.
> >
> > First, do we need Unicode object %-formatting at all ?
> 
> Sure -- in the end, all the world speaks Unicode natively and encodings
> become historical baggage.  Granted I won't live that long, but I may last
> long enough to see encodings become almost purely an I/O hassle, with all
> computation done in Unicode.
> 
> > Second, here is an emulation using strings and 
> > that should give an idea of one could work with the different
> > encodings:
> >
> >     s = '%s %i abc???' # a Latin-1 encoded string
> >     t = (u,3)
> 
> What's u?  A Unicode object?  Another Latin-1 string?  A default-encoded
> string?  How does the following know the difference?

u refers to a Unicode object in the proposal. Sorry, forgot to
mention that.
 
> >     # Convert Latin-1 s to a  string via Unicode
> >     s1 = unicode(s,'latin-1').encode()
> >
> >     # The '%s' will now add u in 
> >     s2 = s1 % t
> >
> >     # Finally, convert the  encoded string to Unicode
> >     u1 = unicode(s2)
> 
> I don't expect this actually works:  for example, change %s to %4s.
> Assuming u is either UTF-8 or Unicode, PyString_Format isn't smart enough to
> know that some (or all) characters in u consume multiple bytes, so can't
> extract "the right" number of bytes from u.  I think % formating has to know
> the truth of what you're doing.

Hmm, guess you're right... format parameters should indeed refer
to characters rather than number of encoding bytes.

This means a new PyUnicode_Format() implementation mapping
Unicode format objects to Unicode objects.
 
> > Note that .encode() defaults to the current setting of
> > .
> >
> > Provided u maps to Latin-1, an alternative would be:
> >
> >     u1 = unicode('%s %i abc???' % (u.encode('latin-1'),3), 'latin-1')
> 
> More interesting is fmt % tuple where everything is Unicode; people can muck
> with Latin-1 directly today using regular strings, so the example above
> mostly shows artificial convolution.

... hmm, there is a problem there: how should the PyUnicode_Format()
API deal with '%s' when it sees a Unicode object as argument ?

E.g. what would you get in these cases:

u = u"%s %s" % (u"abc", "abc")

Perhaps we need a new marker for "insert Unicode object here".

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 11:48:13 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 11:48:13 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>   <38305D17.60EC94D0@lemburg.com> <199911152137.QAA28280@eric.cnri.reston.va.us> <38308F2E.44B9C6BF@lemburg.com> <3839a078.22625844@post.demon.co.uk>
Message-ID: <3831366D.8A09E194@lemburg.com>

Andy Robinson wrote:
> 
> On Mon, 15 Nov 1999 23:54:38 +0100, you wrote:
> 
> >[I'll get back on this tomorrow, just some quick notes here...]
> >The Codecs provide implementations for encoding and decoding,
> >they are not intended as complete wrappers for e.g. files or
> >sockets.
> >
> >The unicodec module will define a generic stream wrapper
> >(which is yet to be defined) for dealing with files, sockets,
> >etc. It will use the codec registry to do the actual codec
> >work.
> >
> >XXX unicodec.file(,,) could be provided as
> >    short-hand for unicodec.file(open(,),) which
> >    also assures that  contains the 'b' character when needed.
> >
> >The Codec interface defines two pairs of methods
> >on purpose: one which works internally (ie. directly between
> >strings and Unicode objects), and one which works externally
> >(directly between a stream and Unicode objects).
> 
> That's the problem Guido and I are worried about.  Your present API is
> not enough to build stream encoders.  The 'slurp it into a unicode
> string in one go' approach fails for big files or for network
> connections.  And you just cannot build a generic stream reader/writer
> by slicing it into strings.   The solution must be specific to the
> codec - only it knows how much to buffer, when to flip states etc.
> 
> So the codec should provide proper stream reading and writing
> services.

I guess I'll have to rethink the Codec specs. Some leads:

1. introduce a new StreamCodec class which is designed for
   handling stream encoding and decoding (and supports
   state)

2. give more information to the unicodec registry: 
   one could register classes instead of instances which the Unicode
   imlementation would then instantiate whenever it needs to
   apply the conversion; since this is only needed for encodings
   maintaining state, the registery would only have to do the
   instantiation for these codecs and could use cached instances for
   stateless codecs.
 
> Unicodec can then wrap those up in labour-saving ways - I'm not fussy
> which but I like the one-line file-open utility.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From fredrik at pythonware.com  Tue Nov 16 12:38:31 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 12:38:31 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com>
Message-ID: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com>

> I would propose to only add some very basic encodings to
> the standard distribution, e.g. the ones mentioned under
> Standard Codecs in the proposal:
> 
>   'utf-8': 8-bit variable length encoding
>   'utf-16': 16-bit variable length encoding (litte/big endian)
>   'utf-16-le': utf-16 but explicitly little endian
>   'utf-16-be': utf-16 but explicitly big endian
>   'ascii': 7-bit ASCII codepage
>   'latin-1': Latin-1 codepage
>   'html-entities': Latin-1 + HTML entities;
> see htmlentitydefs.py from the standard Pythin Lib
>   'jis' (a popular version XXX):
> Japanese character encoding
>   'unicode-escape': See Unicode Constructors for a definition
>   'native': Dump of the Internal Format used by Python

since this is already very close, maybe we could adopt
the naming guidelines from XML:

    In an encoding declaration, the values "UTF-8", "UTF-16",
    "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
    for the various encodings and transformations of
    Unicode/ISO/IEC 10646, the values "ISO-8859-1",
    "ISO-8859-2", ... "ISO-8859-9" should be used for the parts
    of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS",
    and "EUC-JP" should be used for the various encoded
    forms of JIS X-0208-1997.

    XML processors may recognize other encodings; it is
    recommended that character encodings registered
    (as charsets) with the Internet Assigned Numbers
    Authority [IANA], other than those just listed,
    should be referred to using their registered names.

    Note that these registered names are defined to be
    case-insensitive, so processors wishing to match
    against them should do so in a case-insensitive way.

(ie "iso-8859-1" instead of "latin-1", etc -- at least as
aliases...).






From gstein at lyra.org  Tue Nov 16 12:45:48 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 03:45:48 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com>
Message-ID: 

On Tue, 16 Nov 1999, Fredrik Lundh wrote:
>...
> since this is already very close, maybe we could adopt
> the naming guidelines from XML:
> 
>     In an encoding declaration, the values "UTF-8", "UTF-16",
>     "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
>     for the various encodings and transformations of
>     Unicode/ISO/IEC 10646, the values "ISO-8859-1",
>     "ISO-8859-2", ... "ISO-8859-9" should be used for the parts
>     of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS",
>     and "EUC-JP" should be used for the various encoded
>     forms of JIS X-0208-1997.
> 
>     XML processors may recognize other encodings; it is
>     recommended that character encodings registered
>     (as charsets) with the Internet Assigned Numbers
>     Authority [IANA], other than those just listed,
>     should be referred to using their registered names.
> 
>     Note that these registered names are defined to be
>     case-insensitive, so processors wishing to match
>     against them should do so in a case-insensitive way.
> 
> (ie "iso-8859-1" instead of "latin-1", etc -- at least as
> aliases...).

+1

(as we'd say in Apache-land... :-)

-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Tue Nov 16 13:04:47 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 04:04:47 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <3830595B.348E8CC7@lemburg.com>
Message-ID: 

On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> Guido van Rossum wrote:
>...
> > t# refers to byte-encoded data.  Multibyte encodings are explicitly
> > designed to be passed cleanly through processing steps that handle
> > single-byte character data, as long as they are 8-bit clean and don't
> > do too much processing.
> 
> Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> "8-bit clean" as you obviously did.

Hrm. That might be dangerous. Many of the functions that use "t#" assume
that each character is 8-bits long. i.e. the returned length == the number
of characters.

I'm not sure what the implications would be if you interpret the semantics
of "t#" as multi-byte characters.

>...
> > For example, take an encryption engine.  While it is defined in terms
> > of byte streams, there's no requirement that the bytes represent
> > characters -- they could be the bytes of a GIF file, an MP3 file, or a
> > gzipped tar file.  If we pass Unicode to an encryption engine, we want
> > Unicode to come out at the other end, not UTF-8.  (If we had wanted to
> > encrypt UTF-8, we should have fed it UTF-8.)

Heck. I just want to quickly throw the data onto my disk. I'll write a
BOM, following by the raw data. Done. It's even portable.

>...
> > Aha, I think there's a confusion about what "8-bit" means.  For me, a
> > multibyte encoding like UTF-8 is still 8-bit.  Am I alone in this?

Maybe. I don't see multi-byte characters as 8-bit (in the sense of the "t"
format).

> > (As far as I know, C uses char* to represent multibyte characters.)
> > Maybe we should disambiguate it more explicitly?

We can disambiguate with a new format character, or we can clarify the
semantics of "t" to mean single- *or* multi- byte characters. Again, I
think there may be trouble if the semantics of "t" are defined to allow
multibyte characters.

> There should be some definition for the two markers and the
> ideas behind them in the API guide, I guess.

Certainly.

[ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]

> > > Hmm, I would strongly object to making "s#" return the internal
> > > format. file.write() would then default to writing UTF-16 data
> > > instead of UTF-8 data. This could result in strange errors
> > > due to the UTF-16 format being endian dependent.
> > 
> > But this was the whole design.  file.write() needs to be changed to
> > use s# when the file is open in binary mode and t# when the file is
> > open in text mode.

Interesting idea, but that presumes that "t" will be defined for the
Unicode
object (i.e. it implements the getcharbuffer type slot). Because of the
multi-byte problem, I don't think it will.
[ not to mention, that I don't think the Unicode object should implicitly
  do a UTF-8 conversion and hold a ref to the resulting string ]

>...
> I still don't feel very comfortable about the fact that all
> existing APIs using "s#" will suddenly receive UTF-16 data if
> being passed Unicode objects: this probably won't get us the
> "magical" Unicode integration we invision, since "t#" usage is not
> very wide spread and character handling code will probably not
> work well with UTF-16 encoded strings.

I'm not sure that we should definitely go for "magical." Perl has magic in
it, and that is one of its worst faults. Go for clean and predictable, and
leave as much logic to the Python level as possible. The interpreter
should provide a minimum of functionality, rather than second-guessing and
trying to be neat and sneaky with its operation.

>...
> > Because file.write() for a binary file, and other similar things
> > (e.g. the encryption engine example I mentioned above) must have
> > *some* way to get at the raw bits.
> 
> What for ?

How about: "because I'm the application developer, and I say that I want
the raw bytes in the file."

> Any lossless encoding should do the trick... UTF-8
> is just as good as UTF-16 for binary files; plus it's more compact
> for ASCII data. I don't really see a need to get explicitly
> at the internal data representation because both encodings are
> in fact "internal" w/r to Unicode objects.
> 
> The only argument I can come up with is that using UTF-16 for
> binary files could (possibly) eliminate the UTF-8 conversion step
> which is otherwise always needed.

The argument that I come up with is "don't tell me how to design my
storage format, and don't make Python force me into one."

If I want to write Unicode text to a file, the most natural thing to do
is:

open('file', 'w').write(u)

If you do a conversion on me, then I'm not writing Unicode. I've got to go
and do some nasty conversion which just monkeys up my program.

If I have a Unicode object, but I *want* to write UTF-8 to the file, then
the cleanest thing is:

open('file', 'w').write(encode(u, 'utf-8'))

This is clear that I've got a Unicode object input, but I'm writing UTF-8.

I have a second argument, too: See my first argument. :-)

Really... this is kind of what Fredrik was trying to say: don't get in the
way of the application programmer. Give them tools, but avoid policy and
gimmicks and other "magic".

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Tue Nov 16 13:09:17 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 04:09:17 -0800 (PST)
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <199911152137.QAA28280@eric.cnri.reston.va.us>
Message-ID: 

On Mon, 15 Nov 1999, Guido van Rossum wrote:
>...
> > The problem with these large tables is that currently
> > Python modules are not shared among processes since
> > every process builds its own table.
> > 
> > Static C data has the advantage of being shareable at
> > the OS level.
> 
> Don't worry about it.  128K is too small to care, I think...

This is the reason Python starts up so slow and has a large memory
footprint. There hasn't been any concern for moving stuff into shared data
pages. As a result, a process must map in a bunch of vmem pages, for no
other reason than to allocate Python structures in that memory and copy
constants in.

Go start Perl 100 times, then do the same with Python. Python is
significantly slower. I've actually written a web app in PHP because
another one that I did in Python had slow response time.
[ yah: the Real Man Answer is to write a real/good mod_python. ]

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From captainrobbo at yahoo.com  Tue Nov 16 13:18:19 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 16 Nov 1999 04:18:19 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
Message-ID: <19991116121819.21509.rocketmail@web606.mail.yahoo.com>


--- "M.-A. Lemburg"  wrote:
> So I can drop JIS ? [I won't be able to drop the
> escaped unicode
> codec because this is needed for u"" and ur"".]

Drop Japanese from the core language.  

JIS0208 is a big character set with three popular
encodings (Shift-JIS, EUC-JP and JIS), and a host of
slight variations; it has 6879 characters, and there
are a range of options a user might need to set for it
to be useful.  So let's assume for now this a separate
package.  There's a good chance I'll do it but it is
not a small job.  If you start statically linking in
tables of 7000 characters for one Asian language,
you'll have to do the lot.

As for the single-byte Latin ones, a prototype Python
module could be whipped up in a couple of evenings,
and a tiny C function which does single-byte to
double-byte mappings and vice versa could make it
fast.  We can have an extensible, data driven solution
in no time without having to build it into the core.

The way I see it, to claim that python has i18n, a
serious effort is needed to ensure every major
encoding in the world is available to Python users.  
But that's separate to the core languages.  Your spec
should only cover what is going to be hard-coded into
Python.  

I'd like to see one paragraph in your spec stating
that our architecture seperates the encodings
themselves from the core language changes, and that
getting them sorted is a logically separate (but
important) project.  Ideally, we could put together a
separate proposal for the encoding library itself and
run it by some world class experts in that field, but
after yours is done.


- Andy

 



=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From guido at CNRI.Reston.VA.US  Tue Nov 16 14:28:42 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 08:28:42 -0500
Subject: [Python-Dev] Unicode proposal: %-formatting ?
In-Reply-To: Your message of "Tue, 16 Nov 1999 11:40:42 +0100."
             <383134AA.4B49D178@lemburg.com> 
References: <000001bf2ff4$d36e2540$042d153f@tim>  
            <383134AA.4B49D178@lemburg.com> 
Message-ID: <199911161328.IAA29042@eric.cnri.reston.va.us>

> ... hmm, there is a problem there: how should the PyUnicode_Format()
> API deal with '%s' when it sees a Unicode object as argument ?
> 
> E.g. what would you get in these cases:
> 
> u = u"%s %s" % (u"abc", "abc")


From guido at CNRI.Reston.VA.US  Tue Nov 16 14:45:17 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 08:45:17 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Tue, 16 Nov 1999 04:04:47 PST."
              
References:  
Message-ID: <199911161345.IAA29064@eric.cnri.reston.va.us>

> > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> > "8-bit clean" as you obviously did.
> 
> Hrm. That might be dangerous. Many of the functions that use "t#" assume
> that each character is 8-bits long. i.e. the returned length == the number
> of characters.
> 
> I'm not sure what the implications would be if you interpret the semantics
> of "t#" as multi-byte characters.

Hrm.  Can you quote examples of users of t# who would be confused by
multibyte characters?  I guess that there are quite a few places where
they will be considered illegal, but that's okay -- the string will be
parsed at some point and rejected, e.g. as an illegal filename,
hostname or whatever.  On the other hand, there are quite some places
where I would think that multibyte characters would do just the right
thing.  Many places using t# could just as well be using 's' except
they need to know the length and they don't want to call strlen().
In all cases I've looked at, the reason they need the length because
they are allocating a buffer (or checking whether it fits in a
statically allocated buffer) -- and there the number of bytes in a
multibyte string is just fine.

Note that I take the same stance on 's' -- it should return multibyte
characters.

> > What for ?
> 
> How about: "because I'm the application developer, and I say that I want
> the raw bytes in the file."

Here I'm with you, man!

> Greg Stein, http://www.lyra.org/

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gward at cnri.reston.va.us  Tue Nov 16 15:10:33 1999
From: gward at cnri.reston.va.us (Greg Ward)
Date: Tue, 16 Nov 1999 09:10:33 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: ; from gstein@lyra.org on Tue, Nov 16, 1999 at 04:09:17AM -0800
References: <199911152137.QAA28280@eric.cnri.reston.va.us> 
Message-ID: <19991116091032.A4063@cnri.reston.va.us>

On 16 November 1999, Greg Stein said:
> This is the reason Python starts up so slow and has a large memory
> footprint. There hasn't been any concern for moving stuff into shared data
> pages. As a result, a process must map in a bunch of vmem pages, for no
> other reason than to allocate Python structures in that memory and copy
> constants in.
> 
> Go start Perl 100 times, then do the same with Python. Python is
> significantly slower. I've actually written a web app in PHP because
> another one that I did in Python had slow response time.
> [ yah: the Real Man Answer is to write a real/good mod_python. ]

I don't think this is the only factor in startup overhead.  Try looking
into the number of system calls for the trivial startup case of each
interpreter:

  $ truss perl -e 1 2> perl.log 
  $ truss python -c 1 2> python.log

(This is on Solaris; I did the same thing on Linux with "strace", and on
IRIX with "par -s -SS".  Dunno about other Unices.)  The results are
interesting, and useful despite the platform and version disparities.

(For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on
Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX.  The Solaris is 2.6,
using the Official CNRI Python Build by Barry, and the ditto Perl build
by me; the Linux system is starship, using whatever Perl and Python the
Starship Masters provide us with; the IRIX box is an elderly but
well-maintained SGI Challenge running IRIX 5.3.)

Also, this is with an empty PYTHONPATH.  The Solaris build of Python has
different prefix and exec_prefix, but on the Linux and IRIX builds, they
are the same.  (I think this will reflect poorly on the Solaris
version.)  PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect
startup of the trivial "1" script, so I haven't paid attention to them.

First, the size of log files (in lines), i.e. number of system calls:

               Solaris     Linux    IRIX[1]
  Perl              88        85      70
  Python           425       316     257

[1] after chopping off the summary counts from the "par" output -- ie.
    these really are the number of system calls, not the number of
    lines in the log files

Next, the number of "open" calls:

               Solaris     Linux    IRIX
  Perl             16         10       9
  Python          107         71      48

(It looks as though *all* of the Perl 'open' calls are due to the
dynamic linker going through /usr/lib and/or /lib.)

And the number of unsuccessful "open" calls:

               Solaris     Linux    IRIX
  Perl              6          1       3
  Python           77         49      32

Number of "mmap" calls:

               Solaris     Linux    IRIX
  Perl              25        25       1
  Python            36        24       1

...nope, guess we can't blame mmap for any Perl/Python startup
disparity.

How about "brk":

               Solaris     Linux    IRIX
  Perl               6        11      12
  Python            47        39      25

...ok, looks like Greg's gripe about memory holds some water.

Rerunning "truss" on Solaris with "python -S -c 1" drastically reduces
the startup overhead as measured by "number of system calls".  Some
quick timing experiments show a drastic speedup (in wall-clock time) by
adding "-S": about 37% faster under Solaris, 56% faster under Linux, and
35% under IRIX.  These figures should be taken with a large grain of
salt, as the Linux and IRIX systems were fairly well loaded at the time,
and the wall-clock results I measured had huge variance.  Still, it gets
the point across.

Oh, also for the record, all timings were done like:

   perl -e 'for $i (1 .. 100) { system "python", "-S", "-c", "1"; }'

because I wanted to guarantee no shell was involved in the Python
startup.

        Greg
-- 
Greg Ward - software developer                    gward at cnri.reston.va.us
Corporation for National Research Initiatives    
1895 Preston White Drive                           voice: +1-703-620-8990
Reston, Virginia, USA  20191-5434                    fax: +1-703-620-0913



From mal at lemburg.com  Tue Nov 16 12:33:07 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 12:33:07 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991116110555.8B43335BB1E@snelboot.oratrix.nl>
Message-ID: <383140F3.EDDB307A@lemburg.com>

Jack Jansen wrote:
> 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8':            8-bit variable length encoding
> >   'utf-16':           16-bit variable length encoding (litte/big endian)
> >   'utf-16-le':                utf-16 but explicitly little endian
> >   'utf-16-be':                utf-16 but explicitly big endian
> >   'ascii':            7-bit ASCII codepage
> >   'latin-1':          Latin-1 codepage
> >   'html-entities':    Latin-1 + HTML entities;
> >                       see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> >                       Japanese character encoding
> >   'unicode-escape':   See Unicode Constructors for a definition
> >   'native':           Dump of the Internal Format used by Python
> 
> I would suggest adding the Dos, Windows and Macintosh standard 8-bit charsets
> (their equivalents of latin-1) too, as documents in these encoding are pretty
> ubiquitous. But maybe these should only be added on the respective platforms.

Good idea. What code pages would that be ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 15:13:25 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 15:13:25 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.6
References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com>
Message-ID: <38316685.7977448D@lemburg.com>

FYI, I've uploaded a new version of the proposal which incorporates
many things we have discussed lately, e.g. the buffer interface,
"s#" vs. "t#", etc.

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ? Unicode objects support for %-formatting

    ? specifying StreamCodecs

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Tue Nov 16 13:54:51 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 13:54:51 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com> <38305D17.60EC94D0@lemburg.com> <024b01bf3027$1cff1480$f29b12c2@secret.pythonware.com>
Message-ID: <3831541B.B242FFA9@lemburg.com>

Fredrik Lundh wrote:
> 
> > I would propose to only add some very basic encodings to
> > the standard distribution, e.g. the ones mentioned under
> > Standard Codecs in the proposal:
> >
> >   'utf-8': 8-bit variable length encoding
> >   'utf-16': 16-bit variable length encoding (litte/big endian)
> >   'utf-16-le': utf-16 but explicitly little endian
> >   'utf-16-be': utf-16 but explicitly big endian
> >   'ascii': 7-bit ASCII codepage
> >   'latin-1': Latin-1 codepage
> >   'html-entities': Latin-1 + HTML entities;
> > see htmlentitydefs.py from the standard Pythin Lib
> >   'jis' (a popular version XXX):
> > Japanese character encoding
> >   'unicode-escape': See Unicode Constructors for a definition
> >   'native': Dump of the Internal Format used by Python
> 
> since this is already very close, maybe we could adopt
> the naming guidelines from XML:
> 
>     In an encoding declaration, the values "UTF-8", "UTF-16",
>     "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used
>     for the various encodings and transformations of
>     Unicode/ISO/IEC 10646, the values "ISO-8859-1",
>     "ISO-8859-2", ... "ISO-8859-9" should be used for the parts
>     of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS",
>     and "EUC-JP" should be used for the various encoded
>     forms of JIS X-0208-1997.
> 
>     XML processors may recognize other encodings; it is
>     recommended that character encodings registered
>     (as charsets) with the Internet Assigned Numbers
>     Authority [IANA], other than those just listed,
>     should be referred to using their registered names.
> 
>     Note that these registered names are defined to be
>     case-insensitive, so processors wishing to match
>     against them should do so in a case-insensitive way.
> 
> (ie "iso-8859-1" instead of "latin-1", etc -- at least as
> aliases...).

>From the proposal:
"""
General Remarks:
----------------

? Unicode encoding names should be lower case on output and
  case-insensitive on input (they will be converted to lower case
  by all APIs taking an encoding name as input).

  Encoding names should follow the name conventions as used by the
  Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
  written as 'utf-16'.
"""

Is there a naming scheme definition for these encoding names?
(The quote you gave above doesn't really sound like a definition
to me.)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 14:15:19 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 14:15:19 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <19991116121819.21509.rocketmail@web606.mail.yahoo.com>
Message-ID: <383158E7.BC574A1F@lemburg.com>

Andy Robinson wrote:
> 
> --- "M.-A. Lemburg"  wrote:
> > So I can drop JIS ? [I won't be able to drop the
> > escaped unicode
> > codec because this is needed for u"" and ur"".]
> 
> Drop Japanese from the core language.

Done ... that one was easy ;-)
 
> JIS0208 is a big character set with three popular
> encodings (Shift-JIS, EUC-JP and JIS), and a host of
> slight variations; it has 6879 characters, and there
> are a range of options a user might need to set for it
> to be useful.  So let's assume for now this a separate
> package.  There's a good chance I'll do it but it is
> not a small job.  If you start statically linking in
> tables of 7000 characters for one Asian language,
> you'll have to do the lot.
> 
> As for the single-byte Latin ones, a prototype Python
> module could be whipped up in a couple of evenings,
> and a tiny C function which does single-byte to
> double-byte mappings and vice versa could make it
> fast.  We can have an extensible, data driven solution
> in no time without having to build it into the core.

Perhaps these helper function could be intergrated into
the core to avoid compilation when adding a new codec.

> The way I see it, to claim that python has i18n, a
> serious effort is needed to ensure every major
> encoding in the world is available to Python users.
> But that's separate to the core languages.  Your spec
> should only cover what is going to be hard-coded into
> Python.

Right.
 
> I'd like to see one paragraph in your spec stating
> that our architecture seperates the encodings
> themselves from the core language changes, and that
> getting them sorted is a logically separate (but
> important) project.  Ideally, we could put together a
> separate proposal for the encoding library itself and
> run it by some world class experts in that field, but
> after yours is done.

I've added:
All other encoding such as the CJK ones to support Asian scripts
should be implemented in seperate packages which do not get included
in the core Python distribution and are not a part of this proposal.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 14:06:39 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 14:06:39 +0100
Subject: [Python-Dev] just say no...
References: 
Message-ID: <383156DF.2209053F@lemburg.com>

Greg Stein wrote:
> 
> On Mon, 15 Nov 1999, M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> >...
> > > t# refers to byte-encoded data.  Multibyte encodings are explicitly
> > > designed to be passed cleanly through processing steps that handle
> > > single-byte character data, as long as they are 8-bit clean and don't
> > > do too much processing.
> >
> > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not
> > "8-bit clean" as you obviously did.
> 
> Hrm. That might be dangerous. Many of the functions that use "t#" assume
> that each character is 8-bits long. i.e. the returned length == the number
> of characters.
> 
> I'm not sure what the implications would be if you interpret the semantics
> of "t#" as multi-byte characters.

FYI, the next version of the proposal now says "s#" gives you
UTF-16 and "t#" returns UTF-8. File objects opened in text mode
will use "t#" and binary ones use "s#".

I'll just use explicit u.encode('utf-8') calls if I want to write
UTF-8 to binary files -- perhaps everyone else should too ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From akuchlin at mems-exchange.org  Tue Nov 16 15:35:39 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 Nov 1999 09:35:39 -0500 (EST)
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <19991116091032.A4063@cnri.reston.va.us>
References: <199911152137.QAA28280@eric.cnri.reston.va.us>
	
	<19991116091032.A4063@cnri.reston.va.us>
Message-ID: <14385.27579.292173.433577@amarok.cnri.reston.va.us>

Greg Ward writes:
>Next, the number of "open" calls:
>               Solaris     Linux    IRIX
>  Perl             16         10       9
>  Python          107         71      48

Running 'python -v' explains this:

amarok akuchlin>python -v
# /usr/local/lib/python1.5/exceptions.pyc matches /usr/local/lib/python1.5/exceptions.py
import exceptions # precompiled from /usr/local/lib/python1.5/exceptions.pyc
# /usr/local/lib/python1.5/site.pyc matches /usr/local/lib/python1.5/site.py
import site # precompiled from /usr/local/lib/python1.5/site.pyc
# /usr/local/lib/python1.5/os.pyc matches /usr/local/lib/python1.5/os.py
import os # precompiled from /usr/local/lib/python1.5/os.pyc
import posix # builtin
# /usr/local/lib/python1.5/posixpath.pyc matches /usr/local/lib/python1.5/posixpath.py
import posixpath # precompiled from /usr/local/lib/python1.5/posixpath.pyc
# /usr/local/lib/python1.5/stat.pyc matches /usr/local/lib/python1.5/stat.py
import stat # precompiled from /usr/local/lib/python1.5/stat.pyc
# /usr/local/lib/python1.5/UserDict.pyc matches /usr/local/lib/python1.5/UserDict.py
import UserDict # precompiled from /usr/local/lib/python1.5/UserDict.pyc
Python 1.5.2 (#80, May 25 1999, 18:06:07)  [GCC 2.8.1] on sunos5
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
import readline # dynamically loaded from /usr/local/lib/python1.5/lib-dynload/readline.so

And each import tries several different forms of the module name:

stat("/usr/local/lib/python1.5/os", 0xEFFFD5E0) Err#2 ENOENT
open("/usr/local/lib/python1.5/os.so", O_RDONLY) Err#2 ENOENT
open("/usr/local/lib/python1.5/osmodule.so", O_RDONLY) Err#2 ENOENT
open("/usr/local/lib/python1.5/os.py", O_RDONLY) = 4

I don't see how this is fixable, unless we strip down site.py, which
drags in os, which drags in os.path and stat and UserDict. 

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I'm going stir-crazy, and I've joined the ranks of the walking brain-dead, but
otherwise I'm just peachy.
    -- Lyta Hall on parenthood, in SANDMAN #40: "Parliament of Rooks"




From guido at CNRI.Reston.VA.US  Tue Nov 16 15:43:07 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 09:43:07 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Tue, 16 Nov 1999 14:06:39 +0100."
             <383156DF.2209053F@lemburg.com> 
References:   
            <383156DF.2209053F@lemburg.com> 
Message-ID: <199911161443.JAA29149@eric.cnri.reston.va.us>

> FYI, the next version of the proposal now says "s#" gives you
> UTF-16 and "t#" returns UTF-8. File objects opened in text mode
> will use "t#" and binary ones use "s#".

Good.

> I'll just use explicit u.encode('utf-8') calls if I want to write
> UTF-8 to binary files -- perhaps everyone else should too ;-)

You could write UTF-8 to files opened in text mode too; at least most
actual systems will leave the UTF-8 escapes alone and just to LF ->
CRLF translation, which should be fine.

--Guido van Rossum (home page: http://www.python.org/~guido/)




From fdrake at acm.org  Tue Nov 16 15:50:55 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 09:50:55 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000901bf2ffc$3d4bb8e0$042d153f@tim>
References: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
	<000901bf2ffc$3d4bb8e0$042d153f@tim>
Message-ID: <14385.28495.685427.598748@weyr.cnri.reston.va.us>

Tim Peters writes:
 > Yet another use for a weak reference <0.5 wink>.

  Those just keep popping up!  I seem to recall Diane Hackborne
actually implemented these under the name "vref" long ago; perhaps
that's worth revisiting after all?  (Not the implementation so much as 
the idea.)  I think to make it general would cost one PyObject* in
each object's structure, and some code in some constructors (maybe),
and all destructors, but not much.
  Is this worth pursuing, or is it locked out of the core because of
the added space for the PyObject*?  (Note that the concept isn't
necessarily useful for all object types -- numbers in particular --
but it only makes sense to bother if it works for everything, even if
it's not very useful in some cases.)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fdrake at acm.org  Tue Nov 16 16:12:43 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 10:12:43 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: 
References: <3830595B.348E8CC7@lemburg.com>
	
Message-ID: <14385.29803.459364.456840@weyr.cnri.reston.va.us>

Greg Stein writes:
 > [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ]

  And the sooner I receive them, the sooner they can be integrated!
Any plans to get them to me?  I'll probably want to do another release 
before the IPC8.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From mal at lemburg.com  Tue Nov 16 15:36:54 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 15:36:54 +0100
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us>
Message-ID: <38316C06.8B0E1D7B@lemburg.com>

Greg Ward wrote:
> 
> > Go start Perl 100 times, then do the same with Python. Python is
> > significantly slower. I've actually written a web app in PHP because
> > another one that I did in Python had slow response time.
> > [ yah: the Real Man Answer is to write a real/good mod_python. ]
> 
> I don't think this is the only factor in startup overhead.  Try looking
> into the number of system calls for the trivial startup case of each
> interpreter:
> 
>   $ truss perl -e 1 2> perl.log
>   $ truss python -c 1 2> python.log
> 
> (This is on Solaris; I did the same thing on Linux with "strace", and on
> IRIX with "par -s -SS".  Dunno about other Unices.)  The results are
> interesting, and useful despite the platform and version disparities.
> 
> (For the record: Python 1.5.2 on all three platforms; Perl 5.005_03 on
> Solaris, 5.004_05 on Linux, and 5.004_04 on IRIX.  The Solaris is 2.6,
> using the Official CNRI Python Build by Barry, and the ditto Perl build
> by me; the Linux system is starship, using whatever Perl and Python the
> Starship Masters provide us with; the IRIX box is an elderly but
> well-maintained SGI Challenge running IRIX 5.3.)
> 
> Also, this is with an empty PYTHONPATH.  The Solaris build of Python has
> different prefix and exec_prefix, but on the Linux and IRIX builds, they
> are the same.  (I think this will reflect poorly on the Solaris
> version.)  PERLLIB, PERL5LIB, and Perl's builtin @INC should not affect
> startup of the trivial "1" script, so I haven't paid attention to them.

For kicks I've done a similar test with cgipython, the 
one file version of Python 1.5.2:
 
> First, the size of log files (in lines), i.e. number of system calls:
> 
>                Solaris     Linux    IRIX[1]
>   Perl              88        85      70
>   Python           425       316     257

    cgipython                  182 
 
> [1] after chopping off the summary counts from the "par" output -- ie.
>     these really are the number of system calls, not the number of
>     lines in the log files
> 
> Next, the number of "open" calls:
> 
>                Solaris     Linux    IRIX
>   Perl             16         10       9
>   Python          107         71      48

    cgipython                   33 

> (It looks as though *all* of the Perl 'open' calls are due to the
> dynamic linker going through /usr/lib and/or /lib.)
> 
> And the number of unsuccessful "open" calls:
> 
>                Solaris     Linux    IRIX
>   Perl              6          1       3
>   Python           77         49      32

    cgipython                   28

Note that cgipython does search for sitecutomize.py.

> 
> Number of "mmap" calls:
> 
>                Solaris     Linux    IRIX
>   Perl              25        25       1
>   Python            36        24       1

    cgipython                   13

> 
> ...nope, guess we can't blame mmap for any Perl/Python startup
> disparity.
> 
> How about "brk":
> 
>                Solaris     Linux    IRIX
>   Perl               6        11      12
>   Python            47        39      25

    cgipython                   41 (?)

So at least in theory, using cgipython for the intended
purpose should gain some performance.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Tue Nov 16 17:00:58 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 17:00:58 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
Message-ID: <38317FBA.4F3D6B1F@lemburg.com>

Here is a new proposal for the codec interface:

class Codec:

    def encode(self,u,slice=None):
	
	""" Return the Unicode object u encoded as Python string.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is encoded.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	"""
	...

    def decode(self,s,slice=None):

	""" Return an equivalent Unicode object for the encoded Python
	    string s.

	    If slice is given (as slice object), only the sliced part
	    of the Python string is decoded and returned as Unicode
	    object.  Note that this can cause the decoding algorithm
	    to fail due to truncations in the encoding.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	""" 
	...
	

class StreamCodec(Codec):

    def __init__(self,stream=None,errors='strict'):

	""" Creates a StreamCodec instance.

	    stream must be a file-like object open for reading and/or
	    writing binary data depending on the intended codec
            action or None.

	    The StreamCodec may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are known (they need not all be supported by StreamCodec
            subclasses): 

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def write(self,u,slice=None):

	""" Writes the Unicode object's contents encoded to self.stream.

	    stream must be a file-like object open for writing binary
	    data.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def read(self,length=None):

	""" Reads an encoded string from the stream and returns
	    an equivalent Unicode object.

	    If length is given, only length Unicode characters are
	    returned (the StreamCodec instance reads as many raw bytes
            as needed to fulfill this requirement). Otherwise, all
	    available data is read and decoded.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...


It is not required by the unicodec.register() API to provide a
subclass of these base class, only the given methods must be present;
this allows writing Codecs as extensions types.  All Codecs must
provide the .encode()/.decode() methods. Codecs having the .read()
and/or .write() methods are considered to be StreamCodecs.

The Unicode implementation will by itself only use the
stateless .encode() and .decode() methods.

All other conversion have to be done by explicitly instantiating
the appropriate [Stream]Codec.
--

Feel free to beat on this one ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Tue Nov 16 17:08:49 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 17:08:49 +0100
Subject: [Python-Dev] just say no...
References: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
		<000901bf2ffc$3d4bb8e0$042d153f@tim> <14385.28495.685427.598748@weyr.cnri.reston.va.us>
Message-ID: <38318191.11D93903@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> Tim Peters writes:
>  > Yet another use for a weak reference <0.5 wink>.
> 
>   Those just keep popping up!  I seem to recall Diane Hackborne
> actually implemented these under the name "vref" long ago; perhaps
> that's worth revisiting after all?  (Not the implementation so much as
> the idea.)  I think to make it general would cost one PyObject* in
> each object's structure, and some code in some constructors (maybe),
> and all destructors, but not much.
>   Is this worth pursuing, or is it locked out of the core because of
> the added space for the PyObject*?  (Note that the concept isn't
> necessarily useful for all object types -- numbers in particular --
> but it only makes sense to bother if it works for everything, even if
> it's not very useful in some cases.)

FYI, there's mxProxy which implements a flavor of them. Look
in the standard places for mx stuff ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake at acm.org  Tue Nov 16 17:14:06 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 11:14:06 -0500 (EST)
Subject: [Python-Dev] just say no...
In-Reply-To: <38318191.11D93903@lemburg.com>
References: <14380.16437.71847.832880@weyr.cnri.reston.va.us>
	<000901bf2ffc$3d4bb8e0$042d153f@tim>
	<14385.28495.685427.598748@weyr.cnri.reston.va.us>
	<38318191.11D93903@lemburg.com>
Message-ID: <14385.33486.855802.187739@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > FYI, there's mxProxy which implements a flavor of them. Look
 > in the standard places for mx stuff ;-)

  Yes, but still not in the core.  So we have two general examples
(vrefs and mxProxy) and there's WeakDict (or something like that).  I
think there really needs to be a core facility for this.  There are a
lot of users (including myself) who think that things are far less
useful if they're not in the core.  (No, I'm not saying that
everything should be in the core, or even that it needs a lot more
stuff.  I just don't want to be writing code that requires a lot of
separate packages to be installed.  At least not until we can tell an
installation tool to "install this and everything it depends on." ;)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From bwarsaw at cnri.reston.va.us  Tue Nov 16 17:14:55 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Tue, 16 Nov 1999 11:14:55 -0500 (EST)
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
References: <199911152137.QAA28280@eric.cnri.reston.va.us>
	
	<19991116091032.A4063@cnri.reston.va.us>
	<14385.27579.292173.433577@amarok.cnri.reston.va.us>
Message-ID: <14385.33535.23316.286575@anthem.cnri.reston.va.us>

>>>>> "AMK" == Andrew M Kuchling  writes:

    AMK> I don't see how this is fixable, unless we strip down
    AMK> site.py, which drags in os, which drags in os.path and stat
    AMK> and UserDict.

One approach might be to support loading modules out of jar files (or
whatever) using Greg imputils.  We could put the bootstrap .pyc files
in this jar and teach Python to import from it first.  Python
installations could even craft their own modules.jar file to include
whatever modules they are willing to "hard code".  This, with -S might
make Python start up much faster, at the small cost of some
flexibility (which could be regained with a c.l. switch or other
mechanism to bypass modules.jar).

-Barry



From guido at CNRI.Reston.VA.US  Tue Nov 16 17:20:28 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 11:20:28 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Tue, 16 Nov 1999 17:00:58 +0100."
             <38317FBA.4F3D6B1F@lemburg.com> 
References: <38317FBA.4F3D6B1F@lemburg.com> 
Message-ID: <199911161620.LAA02643@eric.cnri.reston.va.us>

> It is not required by the unicodec.register() API to provide a
> subclass of these base class, only the given methods must be present;
> this allows writing Codecs as extensions types.  All Codecs must
> provide the .encode()/.decode() methods. Codecs having the .read()
> and/or .write() methods are considered to be StreamCodecs.
> 
> The Unicode implementation will by itself only use the
> stateless .encode() and .decode() methods.
> 
> All other conversion have to be done by explicitly instantiating
> the appropriate [Stream]Codec.

Looks okay, although I'd like someone to implement a simple
shift-state-based stream codec to check this out further.

I have some questions about the constructor.  You seem to imply
that instantiating the class without arguments creates a codec without
state.  That's fine.  When given a stream argument, shouldn't the
direction of the stream be given as an additional argument, so the
proper state for encoding or decoding can be set up?  I can see that
for an implementation it might be more convenient to have separate
classes for encoders and decoders -- certainly the state being kept is
very different.

Also, I don't want to ignore the alternative interface that was
suggested by /F.  It uses feed() similar to htmllib c.s.  This has
some advantages (although we might want to define some compatibility
so it can also feed directly into a file).

Perhaps someone should go ahead and implement prototype codecs using
either paradigm and then write some simple apps, so we can make a
better decision.

In any case I think the specs codec registry API aren't on the
critical path, integration of /F's basic unicode object is the first
thing we need.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Tue Nov 16 17:27:53 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 11:27:53 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: Your message of "Tue, 16 Nov 1999 11:14:55 EST."
             <14385.33535.23316.286575@anthem.cnri.reston.va.us> 
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us>  
            <14385.33535.23316.286575@anthem.cnri.reston.va.us> 
Message-ID: <199911161627.LAA02665@eric.cnri.reston.va.us>

> >>>>> "AMK" == Andrew M Kuchling  writes:
> 
>     AMK> I don't see how this is fixable, unless we strip down
>     AMK> site.py, which drags in os, which drags in os.path and stat
>     AMK> and UserDict.
> 
> One approach might be to support loading modules out of jar files (or
> whatever) using Greg imputils.  We could put the bootstrap .pyc files
> in this jar and teach Python to import from it first.  Python
> installations could even craft their own modules.jar file to include
> whatever modules they are willing to "hard code".  This, with -S might
> make Python start up much faster, at the small cost of some
> flexibility (which could be regained with a c.l. switch or other
> mechanism to bypass modules.jar).

A completely different approach (which, incidentally, HP has lobbied
for before; and which has been implemented by Sjoerd Mullender for one
particular application) would be to cache a mapping from module names
to filenames in a dbm file.  For Sjoerd's app (which imported hundreds
of modules) this made a huge difference.  The problem is that it's
hard to deal with issues like updating the cache while sharing it with
other processes and even other users...  But if those can be solved,
this could greatly reduce the number of stats and unsuccessful opens,
without having to resort to jar files.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gmcm at hypernet.com  Tue Nov 16 17:56:19 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Tue, 16 Nov 1999 11:56:19 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <14385.33535.23316.286575@anthem.cnri.reston.va.us>
Message-ID: <1269351119-9152905@hypernet.com>

Barry A. Warsaw writes:

> One approach might be to support loading modules out of jar files
> (or whatever) using Greg imputils.  We could put the bootstrap
> .pyc files in this jar and teach Python to import from it first. 
> Python installations could even craft their own modules.jar file
> to include whatever modules they are willing to "hard code". 
> This, with -S might make Python start up much faster, at the
> small cost of some flexibility (which could be regained with a
> c.l. switch or other mechanism to bypass modules.jar).

Couple hundred Windows users have been doing this for 
months (http://starship.python.net/crew/gmcm/install.html). 
The .pyz files are cross-platform, although the "embedding" 
app would have to be redone for *nix, (and all the embedding 
really does is keep Python from hunting all over your disk). 
Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a 
diskette with a little room left over.

but-since-its-WIndows-it-must-be-tainted-ly y'rs


- Gordon



From guido at CNRI.Reston.VA.US  Tue Nov 16 18:00:15 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 12:00:15 -0500
Subject: [Python-Dev] Python 1.6 status
Message-ID: <199911161700.MAA02716@eric.cnri.reston.va.us>

Greg Stein recently reminded me that he was holding off on 1.6 patches
because he was under the impression that I wasn't accepting them yet.

The situation is rather more complicated than that.  There are a great
deal of things that need to be done, and for many of them I'd be most
happy to receive patches!  For other things, however, I'm still in the
requirements analysis phase, and patches might be premature (e.g., I
want to redesign the import mechanisms, and while I like some of the
prototypes that have been posted, I'm not ready to commit to any
specific implementation).

How do you know for which things I'm ready for patches?  Ask me.  I've
tried to make lists before, and there are probably some hints in the
TODO FAQ wizard as well as in the "requests" section of the Python
Bugs List.

Greg also suggested that I might receive more patches if I opened up
the CVS tree for checkins by certain valued contributors.  On the one
hand I'm reluctant to do that (I feel I have a pretty good track
record of checking in patches that are mailed to me, assuming I agree
with them) but on the other hand there might be something to say for
this, because it gives contributors more of a sense of belonging to
the inner core.  Of course, checkin privileges don't mean you can
check in anything you like -- as in the Apache world, changes must be
discussed and approved by the group, and I would like to have a veto.
However once a change is approved, it's much easier if the contributor
can check the code in without having to go through me all the time.

A drawback may be that some people will make very forceful requests to
be given checkin privileges, only to never use them; just like there
are some members of python-dev who have never contributed.  I
definitely want to limit the number of privileged contributors to a
very small number (e.g. 10-15).

One additional detail is the legal side -- contributors will have to
sign some kind of legal document similar to the current (wetsign.html)
release form, but guiding all future contributions.  I'll have to
discuss this with CNRI's legal team.

Greg, I understand you have checkin privileges for Apache.  What is
the procedure there for handing out those privileges?  What is the
procedure for using them?  (E.g. if you made a bogus change to part of
Apache you're not supposed to work on, what happens?)

I'm hoping for several kind of responses to this email:

- uncontroversial patches

- questions about whether specific issues are sufficiently settled to
start coding a patch

- discussion threads opening up some issues that haven't been settled
yet (like the current, very productive, thread in i18n)

- posts summarizing issues that were settled long ago in the past,
requesting reverification that the issue is still settled

- suggestions for new issues that maybe ought to be settled in 1.6

- requests for checkin privileges, preferably with a specific issue or
area of expertise for which the requestor will take responsibility

--Guido van Rossum (home page: http://www.python.org/~guido/)



From akuchlin at mems-exchange.org  Tue Nov 16 18:11:48 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 Nov 1999 12:11:48 -0500 (EST)
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us>
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <14385.36948.610106.195971@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>I'm hoping for several kind of responses to this email:

My list of things to do for 1.6 is:

   * Translate re.py to C and switch to the latest PCRE 2 codebase
(mostly done, perhaps ready for public review in a week or so).

   * Go through the O'Reilly POSIX book and draw up a list of missing
POSIX functions that aren't available in the posix module.  This
was sparked by Greg Ward showing me a Perl daemonize() function
he'd written, and I realized that some of the functions it used
weren't available in Python at all.  (setsid() was one of them, I
think.)

   * A while back I got approval to add the mmapfile module to the
core.  The outstanding issue there is that the constructor has a
different interface on Unix and Windows platforms.

On Windows:
mm = mmapfile.mmapfile("filename", "tag name", )

On Unix, it looks like the mmap() function:

mm = mmapfile.mmapfile(, , 
                        (like MAP_SHARED),
		        (like PROT_READ, PROT_READWRITE) 
                      )

Can we reconcile these interfaces, have two different function names,
or what?

>- suggestions for new issues that maybe ought to be settled in 1.6

Perhaps we should figure out what new capabilities, if any, should be
added in 1.6.  Fred has mentioned weak references, and there are other
possibilities such as ExtensionClass.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Society, my dear, is like salt water, good to swim in but hard to swallow.
    -- Arthur Stringer, _The Silver Poppy_




From beazley at cs.uchicago.edu  Tue Nov 16 18:24:24 1999
From: beazley at cs.uchicago.edu (David Beazley)
Date: Tue, 16 Nov 1999 11:24:24 -0600 (CST)
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
	<14385.36948.610106.195971@amarok.cnri.reston.va.us>
Message-ID: <199911161724.LAA13496@gargoyle.cs.uchicago.edu>

Andrew M. Kuchling writes:
> Guido van Rossum writes:
> >I'm hoping for several kind of responses to this email:
> 
>    * Go through the O'Reilly POSIX book and draw up a list of missing
> POSIX functions that aren't available in the posix module.  This
> was sparked by Greg Ward showing me a Perl daemonize() function
> he'd written, and I realized that some of the functions it used
> weren't available in Python at all.  (setsid() was one of them, I
> think.)
> 

I second this!   This was one of the things I noticed when doing the
Essential Reference Book.   Assuming no one has done it already,
I wouldn't mind volunteering to take a crack at it.

Cheers,

Dave





From fdrake at acm.org  Tue Nov 16 18:25:02 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 12:25:02 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <199911161620.LAA02643@eric.cnri.reston.va.us>
References: <38317FBA.4F3D6B1F@lemburg.com>
	<199911161620.LAA02643@eric.cnri.reston.va.us>
Message-ID: <14385.37742.816993.642515@weyr.cnri.reston.va.us>

Guido van Rossum writes:
 > Also, I don't want to ignore the alternative interface that was
 > suggested by /F.  It uses feed() similar to htmllib c.s.  This has
 > some advantages (although we might want to define some compatibility
 > so it can also feed directly into a file).

  I think one or the other can be used, and then a wrapper that
converts to the other interface.  Perhaps the encoders should provide
feed(), and a file-like wrapper can convert write() to feed().  It
could also be done the other way; I'm not sure if it matters which is
"normal."  (Or perhaps feed() was badly named and should be write()?
The general intent was a little different, I think, but an output file 
is very much a stream consumer.)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From akuchlin at mems-exchange.org  Tue Nov 16 18:32:41 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Tue, 16 Nov 1999 12:32:41 -0500 (EST)
Subject: [Python-Dev] mmapfile module
In-Reply-To: <199911161720.MAA02764@eric.cnri.reston.va.us>
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
	<14385.36948.610106.195971@amarok.cnri.reston.va.us>
	<199911161720.MAA02764@eric.cnri.reston.va.us>
Message-ID: <14385.38201.301429.786642@amarok.cnri.reston.va.us>

Guido van Rossum writes:
>Hm, this seems to require a higher-level Python module to hide the
>differences.  Maybe the Unix version could also use a filename?  I
>would think that mmap'ed files should always be backed by a file (not
>by a pipe, socket etc.).  Or is there an issue with secure creation of
>temp files?  This is a question for a separate thread.

Hmm... I don't know of any way to use mmap() on non-file things,
either; there are odd special cases, like using MAP_ANONYMOUS on
/dev/zero to allocate memory, but that's still using a file.  On the
other hand, there may be some special case where you need to do that.
We could add a fileno() method to get the file descriptor, but I don't
know if that's useful to Windows.  (Is Sam Rushing, the original
author of the Win32 mmapfile, on this list?)  

What do we do about the tagname, which is a Win32 argument that has no
Unix counterpart -- I'm not even sure what its function is.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
I had it in me to be the Pierce Brosnan of my generation.
    -- Vincent Me's past career plans in EGYPT #1



From mal at lemburg.com  Tue Nov 16 18:53:46 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 16 Nov 1999 18:53:46 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us>
Message-ID: <38319A2A.4385D2E7@lemburg.com>

Guido van Rossum wrote:
> 
> > It is not required by the unicodec.register() API to provide a
> > subclass of these base class, only the given methods must be present;
> > this allows writing Codecs as extensions types.  All Codecs must
> > provide the .encode()/.decode() methods. Codecs having the .read()
> > and/or .write() methods are considered to be StreamCodecs.
> >
> > The Unicode implementation will by itself only use the
> > stateless .encode() and .decode() methods.
> >
> > All other conversion have to be done by explicitly instantiating
> > the appropriate [Stream]Codec.
> 
> Looks okay, although I'd like someone to implement a simple
> shift-state-based stream codec to check this out further.
> 
> I have some questions about the constructor.  You seem to imply
> that instantiating the class without arguments creates a codec without
> state.  That's fine.  When given a stream argument, shouldn't the
> direction of the stream be given as an additional argument, so the
> proper state for encoding or decoding can be set up?  I can see that
> for an implementation it might be more convenient to have separate
> classes for encoders and decoders -- certainly the state being kept is
> very different.

Wouldn't it be possible to have the read/write methods set up
the state when called for the first time ?

Note that I wrote ".read() and/or .write() methods" in the proposal
on purpose: you can of course implement Codecs which only implement
one of them, i.e. Readers and Writers. The registry doesn't care
about them anyway :-)

Then, if you use a Reader for writing, it will result in an
AttributeError...
 
> Also, I don't want to ignore the alternative interface that was
> suggested by /F.  It uses feed() similar to htmllib c.s.  This has
> some advantages (although we might want to define some compatibility
> so it can also feed directly into a file).

AFAIK, .feed() and .finalize() (or .close() etc.) have a different
backgound: you add data in chunks and then process it at some
final stage rather than for each feed. This is often more
efficient.

With respest to codecs this would mean, that you buffer the
output in memory, first doing only preliminary operations on
the feeds and then apply some final logic to the buffer at
the time .finalize() is called.

We could define a StreamCodec subclass for this kind of operation.

> Perhaps someone should go ahead and implement prototype codecs using
> either paradigm and then write some simple apps, so we can make a
> better decision.
> 
> In any case I think the specs codec registry API aren't on the
> critical path, integration of /F's basic unicode object is the first
> thing we need.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    45 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gward at cnri.reston.va.us  Tue Nov 16 18:54:06 1999
From: gward at cnri.reston.va.us (Greg Ward)
Date: Tue, 16 Nov 1999 12:54:06 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <199911161627.LAA02665@eric.cnri.reston.va.us>; from guido@cnri.reston.va.us on Tue, Nov 16, 1999 at 11:27:53AM -0500
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us>
Message-ID: <19991116125405.B4063@cnri.reston.va.us>

On 16 November 1999, Guido van Rossum said:
> A completely different approach (which, incidentally, HP has lobbied
> for before; and which has been implemented by Sjoerd Mullender for one
> particular application) would be to cache a mapping from module names
> to filenames in a dbm file.  For Sjoerd's app (which imported hundreds
> of modules) this made a huge difference.

Hey, this could be a big win for Zope startup.  Dunno how much of that
20-30 sec startup overhead is due to loading modules, but I'm sure it's
a sizeable percentage.  Any Zope-heads listening?

> The problem is that it's
> hard to deal with issues like updating the cache while sharing it with
> other processes and even other users...

Probably not a concern in the case of Zope: one installation, one
process, only gets started when it's explicitly shut down and
restarted.  HmmmMMMMmmm...

        Greg



From petrilli at amber.org  Tue Nov 16 19:04:46 1999
From: petrilli at amber.org (Christopher Petrilli)
Date: Tue, 16 Nov 1999 13:04:46 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <19991116125405.B4063@cnri.reston.va.us>; from gward@cnri.reston.va.us on Tue, Nov 16, 1999 at 12:54:06PM -0500
References: <199911152137.QAA28280@eric.cnri.reston.va.us>  <19991116091032.A4063@cnri.reston.va.us> <14385.27579.292173.433577@amarok.cnri.reston.va.us> <14385.33535.23316.286575@anthem.cnri.reston.va.us> <199911161627.LAA02665@eric.cnri.reston.va.us> <19991116125405.B4063@cnri.reston.va.us>
Message-ID: <19991116130446.A3068@trump.amber.org>

Greg Ward [gward at cnri.reston.va.us] wrote:
> On 16 November 1999, Guido van Rossum said:
> > A completely different approach (which, incidentally, HP has lobbied
> > for before; and which has been implemented by Sjoerd Mullender for one
> > particular application) would be to cache a mapping from module names
> > to filenames in a dbm file.  For Sjoerd's app (which imported hundreds
> > of modules) this made a huge difference.
> 
> Hey, this could be a big win for Zope startup.  Dunno how much of that
> 20-30 sec startup overhead is due to loading modules, but I'm sure it's
> a sizeable percentage.  Any Zope-heads listening?

Wow, that's a huge start up that I've personally never seen.  I can't
imagine... even loading the Oracle libraries dynamically, which are HUGE
(2Mb or so), it's only a couple seconds.  

> > The problem is that it's
> > hard to deal with issues like updating the cache while sharing it with
> > other processes and even other users...
> 
> Probably not a concern in the case of Zope: one installation, one
> process, only gets started when it's explicitly shut down and
> restarted.  HmmmMMMMmmm...

This doesn't reslve a lot of other users of Python howver... and Zope
would always benefit, especially when you're running multiple instances
on th same machine... would perhaps share more code.

Chris
-- 
| Christopher Petrilli
| petrilli at amber.org



From gmcm at hypernet.com  Tue Nov 16 19:04:41 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Tue, 16 Nov 1999 13:04:41 -0500
Subject: [Python-Dev] mmapfile module
In-Reply-To: <14385.38201.301429.786642@amarok.cnri.reston.va.us>
References: <199911161720.MAA02764@eric.cnri.reston.va.us>
Message-ID: <1269347016-9399681@hypernet.com>

Andrew M. Kuchling wrote:

> Hmm... I don't know of any way to use mmap() on non-file things,
> either; there are odd special cases, like using MAP_ANONYMOUS on
> /dev/zero to allocate memory, but that's still using a file.  On
> the other hand, there may be some special case where you need to
> do that. We could add a fileno() method to get the file
> descriptor, but I don't know if that's useful to Windows.  (Is
> Sam Rushing, the original author of the Win32 mmapfile, on this
> list?)  
> 
> What do we do about the tagname, which is a Win32 argument that
> has no Unix counterpart -- I'm not even sure what its function
> is.

On Windows, a mmap is always backed by disk (swap 
space), but is not necessarily associated with a (user-land) 
file. The tagname is like the "name" associated with a 
semaphore; two processes opening the same tagname get 
shared memory.

Fileno (in the c runtime sense) would be useless on Windows. 
As with all Win32 resources, there's a "handle", which is 
analagous. But different enough, it seems to me, to confound 
any attempts at a common API.

Another fundamental difference (IIRC) is that Windows mmap's 
can be resized on the fly.

- Gordon



From guido at CNRI.Reston.VA.US  Tue Nov 16 19:09:43 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Tue, 16 Nov 1999 13:09:43 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Tue, 16 Nov 1999 18:53:46 +0100."
             <38319A2A.4385D2E7@lemburg.com> 
References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us>  
            <38319A2A.4385D2E7@lemburg.com> 
Message-ID: <199911161809.NAA02894@eric.cnri.reston.va.us>

> > I have some questions about the constructor.  You seem to imply
> > that instantiating the class without arguments creates a codec without
> > state.  That's fine.  When given a stream argument, shouldn't the
> > direction of the stream be given as an additional argument, so the
> > proper state for encoding or decoding can be set up?  I can see that
> > for an implementation it might be more convenient to have separate
> > classes for encoders and decoders -- certainly the state being kept is
> > very different.
> 
> Wouldn't it be possible to have the read/write methods set up
> the state when called for the first time ?

Hm, I'd rather be explicit.  We don't do this for files either.

> Note that I wrote ".read() and/or .write() methods" in the proposal
> on purpose: you can of course implement Codecs which only implement
> one of them, i.e. Readers and Writers. The registry doesn't care
> about them anyway :-)
> 
> Then, if you use a Reader for writing, it will result in an
> AttributeError...
>  
> > Also, I don't want to ignore the alternative interface that was
> > suggested by /F.  It uses feed() similar to htmllib c.s.  This has
> > some advantages (although we might want to define some compatibility
> > so it can also feed directly into a file).
> 
> AFAIK, .feed() and .finalize() (or .close() etc.) have a different
> backgound: you add data in chunks and then process it at some
> final stage rather than for each feed. This is often more
> efficient.
> 
> With respest to codecs this would mean, that you buffer the
> output in memory, first doing only preliminary operations on
> the feeds and then apply some final logic to the buffer at
> the time .finalize() is called.

This is part of the purpose, yes.

> We could define a StreamCodec subclass for this kind of operation.

The difference is that to decode from a file, your proposed interface
is to call read() on the codec which will in turn call read() on the
stream.  In /F's version, I call read() on the stream (geting multibyte
encoded data), feed() that to the codec, which in turn calls feed() to
some other back end -- perhaps another codec which in turn feed()s its
converted data to another file, perhaps an XML parser.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake at acm.org  Tue Nov 16 19:16:42 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Tue, 16 Nov 1999 13:16:42 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <38319A2A.4385D2E7@lemburg.com>
References: <38317FBA.4F3D6B1F@lemburg.com>
	<199911161620.LAA02643@eric.cnri.reston.va.us>
	<38319A2A.4385D2E7@lemburg.com>
Message-ID: <14385.40842.709711.12141@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > Wouldn't it be possible to have the read/write methods set up
 > the state when called for the first time ?

  That slows the down; the constructor should handle initialization.
Perhaps what gets registered should be:  encoding function, decoding
function, stream encoder factory (can be a class), stream decoder
factory (again, can be a class).  These can be encapsulated either
before or after hitting the registry, and can be None.  The registry
and provide default implementations from what is provided (stream
handlers from the functions, or functions from the stream handlers) as 
required.
  Ideally, I should be able to write a module with four well-known
entry points and then provide the module object itself as the
registration entry.  Or I could construct a new object that has the
right interface and register that if it made more sense for the
encoding.

 > AFAIK, .feed() and .finalize() (or .close() etc.) have a different
 > backgound: you add data in chunks and then process it at some
 > final stage rather than for each feed. This is often more

  Many of the classes that provide feed() do as much work as possible
as data is fed into them (see htmllib.HTMLParser); this structure is
commonly used to support asynchonous operation.

 > With respest to codecs this would mean, that you buffer the
 > output in memory, first doing only preliminary operations on
 > the feeds and then apply some final logic to the buffer at
 > the time .finalize() is called.

  That depends on the encoding.  I'd expect it to feed encoded data to 
a sink as quickly as it could and let the target decide what needs to
happen.  If buffering is needed, the target could be a StringIO or
whatever.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fredrik at pythonware.com  Tue Nov 16 20:32:21 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 20:32:21 +0100
Subject: [Python-Dev] mmapfile module
References: <199911161700.MAA02716@eric.cnri.reston.va.us><14385.36948.610106.195971@amarok.cnri.reston.va.us><199911161720.MAA02764@eric.cnri.reston.va.us> <14385.38201.301429.786642@amarok.cnri.reston.va.us>
Message-ID: <002201bf3069$4e232a50$f29b12c2@secret.pythonware.com>

> Hmm... I don't know of any way to use mmap() on non-file things,
> either; there are odd special cases, like using MAP_ANONYMOUS on
> /dev/zero to allocate memory, but that's still using a file.

but that's not always the case -- OSF/1 supports
truly anonymous mappings, for example.  in fact,
it bombs if you use ANONYMOUS with a file handle:

$ man mmap

    ...

    If MAP_ANONYMOUS is set in the flags parameter:

        +  A new memory region is created and initialized to all zeros.  This
           memory region can be shared only with descendents of the current pro-
           cess.

        +  If the filedes parameter is not -1, the mmap() function fails.

    ...

(btw, doing anonymous maps isn't exactly an odd special
case under this operating system; it's the only memory-
allocation mechanism provided by the kernel...)






From fredrik at pythonware.com  Tue Nov 16 20:33:52 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 20:33:52 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us>
Message-ID: <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> Also, I don't want to ignore the alternative interface that was
> suggested by /F.  It uses feed() similar to htmllib c.s.  This has
> some advantages (although we might want to define some
> compatibility so it can also feed directly into a file).

seeing this made me switch on my brain for a moment,
and recall how things are done in PIL (which is, as I've
bragged about before, another library with an internal
format, and many possible external encodings).  among
other things, PIL lets you read and write images to both
ordinary files and arbitrary file objects, but it also lets
you incrementally decode images by feeding it chunks
of data (through ImageFile.Parser).  and it's fast -- it has
to be, since images tends to contain lots of pixels...

anyway, here's what I came up with (code will follow,
if someone's interested).

--------------------------------------------------------------------
A PIL-like Unicode Codec Proposal
--------------------------------------------------------------------

In the PIL model, the codecs are called with a piece of data, and
returns the result to the caller.  The codecs maintain internal state
when needed.

class decoder:

    def decode(self, s, offset=0):
        # decode as much data as we possibly can from the
        # given string.  if there's not enough data in the
        # input string to form a full character, return
        # what we've got this far (this might be an empty
        # string).

    def flush(self):
        # flush the decoding buffers.  this should usually
        # return None, unless the fact that knowing that the
        # input stream has ended means that the state can be
        # interpreted in a meaningful way.  however, if the
        # state indicates that there last character was not
        # finished, this method should raise a UnicodeError
        # exception.

class encoder:

    def encode(self, u, offset=0, buffersize=0):
        # encode data from the given offset in the input
        # unicode string into a buffer of the given size
        # (or slightly larger, if required to proceed).
        # if the buffer size is 0, the decoder is free
        # to pick a suitable size itself (if at all
        # possible, it should make it large enough to
        # encode the entire input string).  returns a
        # 2-tuple containing the encoded data, and the
        # number of characters consumed by this call.

    def flush(self):
        # flush the encoding buffers.  returns an ordinary
        # string (which may be empty), or None.

Note that a codec instance can be used for a single string; the codec
registry should hold codec factories, not codec instances.  In
addition, you may use a single type or class to implement both
interfaces at once.

--------------------------------------------------------------------
Use Cases
--------------------------------------------------------------------

A null decoder:

    class decoder:
        def decode(self, s, offset=0):
            return s[offset:]
        def flush(self):
            pass

A null encoder:

    class encoder:
        def encode(self, s, offset=0, buffersize=0):
            if buffersize:
                s = s[offset:offset+buffersize]
            else:
                s = s[offset:]
            return s, len(s)
        def flush(self):
            pass

Decoding a string:

    def decode(s, encoding)
        c = registry.getdecoder(encoding)
        u = c.decode(s)
        t = c.flush()
        if not t:
            return u
        return u + t # not very common

Encoding a string:

    def encode(u, encoding)
        c = registry.getencoder(encoding)
        p = []
        o = 0
        while o < len(u):
            s, n = c.encode(u, o)
            p.append(s)
            o = o + n
        if len(p) == 1:
            return p[0]
        return string.join(p, "") # not very common

Implementing stream codecs is left as an exercise (see the zlib
material in the eff-bot guide for a decoder example).

--- end of proposal




From fredrik at pythonware.com  Tue Nov 16 20:37:40 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Tue, 16 Nov 1999 20:37:40 +0100
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us> <14385.36948.610106.195971@amarok.cnri.reston.va.us>
Message-ID: <003d01bf306a$0bdea330$f29b12c2@secret.pythonware.com>

>    * Go through the O'Reilly POSIX book and draw up a list of missing
> POSIX functions that aren't available in the posix module.  This
> was sparked by Greg Ward showing me a Perl daemonize() function
> he'd written, and I realized that some of the functions it used
> weren't available in Python at all.  (setsid() was one of them, I
> think.)

$ python
Python 1.5.2 (#1, Aug 23 1999, 14:42:39)  [GCC 2.7.2.3] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import os
>>> os.setsid







From mhammond at skippinet.com.au  Tue Nov 16 22:54:15 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 17 Nov 1999 08:54:15 +1100
Subject: [Python-Dev] Some thoughts on the codecs... 
In-Reply-To: <19991116110555.8B43335BB1E@snelboot.oratrix.nl>
Message-ID: <00f701bf307d$20f0cb00$0501a8c0@bobcat>

[Andy writes:]
> Leave JISXXX and the CJK stuff out.  If you get into Japanese, you
> really need to cover ShiftJIS, EUC-JP and JIS, they are big, and
there

[Then Marc relpies:]
> 2. give more information to the unicodec registry:
>    one could register classes instead of instances which the Unicode

[Jack chimes in with:]
> I would suggest adding the Dos, Windows and Macintosh
> standard 8-bit charsets
> (their equivalents of latin-1) too, as documents in these
> encoding are pretty
> ubiquitous. But maybe these should only be added on the
> respective platforms.

[And the conversation twisted around to Greg noting:]
> Next, the number of "open" calls:
>
>               Solaris     Linux    IRIX
>  Perl             16         10       9
>  Python          107         71      48

This is leading me to conclude that our "codec registry" should be the
file system, and Python modules.

Would it be possible to define a "standard package" called
"encodings", and when we need an encoding, we simply attempt to load a
module from that package?  The key benefits I see are:

* No need to load modules simply to register a codec (which would make
the number of open calls even higher, and the startup time even
slower.)  This makes it truly demand-loading of the codecs, rather
than explicit load-and-register.

* Making language specific distributions becomes simple - simply
select a different set of modules from the "encodings" directory.  The
Python source distribution has them all, but (say) the Windows binary
installer selects only a few.  The Japanese binary installer for
Windows installs a few more.

* Installing new codecs becomes trivial - no need to hack site.py
etc - simply copy the new "codec module" to the encodings directory
and you are done.

* No serious problem for GMcM's installer nor for freeze

We would probably need to assume that certain codes exist for _all_
platforms and language - but this is no different to assuming that
"exceptions.py" also exists for all platforms.

Is this worthy of consideration?

Mark.




From andy at robanal.demon.co.uk  Wed Nov 17 01:14:06 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Wed, 17 Nov 1999 00:14:06 GMT
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <010001bf300e$14741310$f29b12c2@secret.pythonware.com>
References: <19991115153045.9641.rocketmail@web604.mail.yahoo.com>             <38305D17.60EC94D0@lemburg.com>  <199911152137.QAA28280@eric.cnri.reston.va.us> <010001bf300e$14741310$f29b12c2@secret.pythonware.com>
Message-ID: <3836f28c.4929177@post.demon.co.uk>

On Tue, 16 Nov 1999 09:39:20 +0100, you wrote:

>1) codes written according to the "data
>   consumer model", instead of the "stream"
>   model.
>
>        class myDecoder:
>            def __init__(self, target):
>                self.target = target
>                self.state = ...
>            def feed(self, data):
>                ... extract as much data as possible ...
>                self.target.feed(extracted data)
>            def close(self):
>                ... extract what's left ...
>                self.target.feed(additional data)
>                self.target.close()
>
Apart from feed() instead of write(), how is that different from a
Java-like Stream writer as Guido suggested?  He said:

>Andy's file translation example could then be written as follows:
>
># assuming variables input_file, input_encoding, output_file,
># output_encoding, and constant BUFFER_SIZE
>
>f = open(input_file, "rb")
>f1 = unicodec.codecs[input_encoding].stream_reader(f)
>g = open(output_file, "wb")
>g1 = unicodec.codecs[output_encoding].stream_writer(f)
>
>while 1:
>      buffer = f1.read(BUFFER_SIZE)
>      if not buffer:
>	 break
>      f2.write(buffer)
>
>f2.close()
>f1.close()
>
>Note that we could possibly make these the only API that a codec needs
>to provide; the string object <--> unicode object conversions can be
>done using this and the cStringIO module.  (On the other hand it seems
>a common case that would be quite useful.)

- Andy



From gstein at lyra.org  Wed Nov 17 03:03:21 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 18:03:21 -0800 (PST)
Subject: [Python-Dev] shared data
In-Reply-To: <1269351119-9152905@hypernet.com>
Message-ID: 

On Tue, 16 Nov 1999, Gordon McMillan wrote:
> Barry A. Warsaw writes:
> > One approach might be to support loading modules out of jar files
> > (or whatever) using Greg imputils.  We could put the bootstrap
> > .pyc files in this jar and teach Python to import from it first. 
> > Python installations could even craft their own modules.jar file
> > to include whatever modules they are willing to "hard code". 
> > This, with -S might make Python start up much faster, at the
> > small cost of some flexibility (which could be regained with a
> > c.l. switch or other mechanism to bypass modules.jar).
> 
> Couple hundred Windows users have been doing this for 
> months (http://starship.python.net/crew/gmcm/install.html). 
> The .pyz files are cross-platform, although the "embedding" 
> app would have to be redone for *nix, (and all the embedding 
> really does is keep Python from hunting all over your disk). 
> Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a 
> diskette with a little room left over.

I've got a patch from Jim Ahlstrom to provide a "standardized" library
file. I've got to review and fold that thing in (I'll post here when that
is done).

As Gordon states: yes, the startup time is considerably improved.

The DBM approach is interesting. That could definitely be used thru an
imputils Importer; it would be quite interesting to try that out.

(Note that the library style approach would be even harder to deal with
updates, relative to what Sjoerd saw with the DBM approach; I would guess 
that the "right" approach is to rebuild the library from scratch and
atomically replace the thing (but that would bust people with open
references...))

Certainly something to look at.

Cheers,
-g

p.s. I also want to try mmap'ing a library and creating code objects that
use PyBufferObjects (rather than PyStringObjects) that refer to portions
of the mmap. Presuming the mmap is shared, there "should" be a large
reduction in heap usage. Question is that I don't know the proportion of
code bytes to other heap usage caused by loading a .pyc.

p.p.s. I also want to try the buffer approach for frozen code.

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Wed Nov 17 03:29:42 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 18:29:42 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <14385.40842.709711.12141@weyr.cnri.reston.va.us>
Message-ID: 

On Tue, 16 Nov 1999, Fred L. Drake, Jr. wrote:
> M.-A. Lemburg writes:
>  > Wouldn't it be possible to have the read/write methods set up
>  > the state when called for the first time ?
> 
>   That slows the down; the constructor should handle initialization.
> Perhaps what gets registered should be:  encoding function, decoding
> function, stream encoder factory (can be a class), stream decoder
> factory (again, can be a class).  These can be encapsulated either
> before or after hitting the registry, and can be None.  The registry

I'm with Fred here; he beat me to the punch (and his email is better than 
what I'd write anyhow :-).

I'd like to see the API be *functions* rather than a particular class
specification. If the spec is going to say "do not alter/store state",
then a function makes much more sense than a method on an object.

Of course, bound method objects could be registered. This might occur if
you have a general JIS encode/decoder but need to instantiate it a little
differently for each JIS variant.
(Andy also mentioned something about "options" in JIS encoding/decoding)

> and provide default implementations from what is provided (stream
> handlers from the functions, or functions from the stream handlers) as 
> required.

Excellent idea...

"I'll provide the encode/decode functions, but I don't have a spiffy
algorithm for streaming -- please provide a stream wrapper for my
functions."

>   Ideally, I should be able to write a module with four well-known
> entry points and then provide the module object itself as the
> registration entry.  Or I could construct a new object that has the
> right interface and register that if it made more sense for the
> encoding.

Mark's idea about throwing these things into a package for on-demand
registrations is much better than a "register-beforehand" model. When the
module is loaded from the package, it calls a registration function to
insert its 4-tuple of registration data.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Wed Nov 17 03:40:07 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 16 Nov 1999 18:40:07 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs... 
In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
Message-ID: 

On Wed, 17 Nov 1999, Mark Hammond wrote:
>...
> Would it be possible to define a "standard package" called
> "encodings", and when we need an encoding, we simply attempt to load a
> module from that package?  The key benefits I see are:
>...
> Is this worthy of consideration?

Absolutely!

You will need to provide a way for a module (in the "codec" package) to
state *beforehand* that it should be loaded for the X, Y, and Z encodings.
This might be in terms of little "info" files that get dropped into the
package. The __init__.py module scans the directory for the info files and
loads them to build an encoding => module-name mapping.

The alternative would be to have stub modules like:

iso-8859-1.py:

import unicodec

def encode_1(...)
  ...
def encode_2(...)
  ...
...

unicodec.register('iso-8859-1', encode_1, decode_1)
unicodec.register('iso-8859-2', encode_2, decode_2)
...


iso-8859-2.py:
import iso-8859-1


I believe that encoding names are legitimate file names, but they aren't
necessarily Python identifiers. That kind of bungs up "import
codec.iso-8859-1". The codec package would need to programmatically import
the modules. Clients should not be directly importing the modules, so I
don't see a difficult here.
[ if we do decide to allow clients access to the modules, then maybe they
  have to arrive through a "helper" module that has a nice name, or the
  codec package provides a "module = code.load('iso-8859-1')" idiom. ]

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From mhammond at skippinet.com.au  Wed Nov 17 03:57:48 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 17 Nov 1999 13:57:48 +1100
Subject: [Python-Dev] Some thoughts on the codecs... 
In-Reply-To: 
Message-ID: <010501bf30a7$88c00320$0501a8c0@bobcat>

> You will need to provide a way for a module (in the "codec"
> package) to
> state *beforehand* that it should be loaded for the X, Y, and
...

> The alternative would be to have stub modules like:

Actually, I was thinking even more radically - drop the codec registry
all together, and use modules with "well-known" names  (a slight
precedent, but Python isnt adverse to well-known names in general)

eg:
iso-8859-1.py:

import unicodec
def encode(...):
  ...
def decode(...):
  ...

iso-8859-2.py:
from iso-8859-1 import *

The codec registry then is trivial, and effectively does not exist
(cant get much more trivial than something that doesnt exist :-):

def getencoder(encoding):
  mod = __import__( "encodings." + encoding )
  return getattr(mod, "encode")


> I believe that encoding names are legitimate file names, but
> they aren't
> necessarily Python identifiers. That kind of bungs up "import
> codec.iso-8859-1".

Agreed - clients should never need to import them, and codecs that
wish to import other codes could use "__import__"

Of course, I am not adverse to the idea of a registry as well and
having the modules manually register themselves - but it doesnt seem
to buy much, and the logic for getting a codec becomes more complex -
ie, it needs to determine the module to import, then look in the
registry - if it needs to determine the module anyway, why not just
get it from the module and be done with it?

Mark.




From andy at robanal.demon.co.uk  Wed Nov 17 01:18:22 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Wed, 17 Nov 1999 00:18:22 GMT
Subject: [Python-Dev] Some thoughts on the codecs... 
In-Reply-To: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
References: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
Message-ID: <3837f379.5166829@post.demon.co.uk>

On Wed, 17 Nov 1999 08:54:15 +1100, you wrote:

>This is leading me to conclude that our "codec registry" should be the
>file system, and Python modules.
>
>Would it be possible to define a "standard package" called
>"encodings", and when we need an encoding, we simply attempt to load a
>module from that package?  The key benefits I see are:
[snip]
>Is this worthy of consideration?

Exactly what I am aiming for.  The real icing on the cake would be a
small state machine or some helper functions in C which made it
possible to write fast codecs in pure Python, but that can come a bit
later when we have examples up and running.   

- Andy





From andy at robanal.demon.co.uk  Wed Nov 17 01:08:01 1999
From: andy at robanal.demon.co.uk (Andy Robinson)
Date: Wed, 17 Nov 1999 00:08:01 GMT
Subject: [Python-Dev] Internationalization Toolkit
In-Reply-To: <000601bf2ff7$4d8a4c80$042d153f@tim>
References: <000601bf2ff7$4d8a4c80$042d153f@tim>
Message-ID: <3834f142.4599884@post.demon.co.uk>

On Tue, 16 Nov 1999 00:56:18 -0500, you wrote:

>[Andy Robinson]
>> ...
>> I presume no one is actually advocating dropping
>> ordinary Python strings, or the ability to do
>>    rawdata = open('myfile.txt', 'rb').read()
>> without any transformations?
>
>If anyone has advocated either, they've successfully hidden it from me.
>Anyone?

Well, I hear statements looking forward to when all string-handling is
done in Unicode internally.  This scares the hell out of me - it is
what VB does and that bit us badly on simple stream operations.  For
encoding work, you will always need raw strings, and often need
Unicode ones.

- Andy



From tim_one at email.msn.com  Wed Nov 17 08:33:06 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 02:33:06 -0500
Subject: [Python-Dev] Unicode proposal: %-formatting ?
In-Reply-To: <383134AA.4B49D178@lemburg.com>
Message-ID: <000001bf30cd$fd6be9c0$a42d153f@tim>

[MAL]
> ...
> This means a new PyUnicode_Format() implementation mapping
> Unicode format objects to Unicode objects.

It's a bitch, isn't it <0.5 wink>?  I hope they're paying you a lot for
this!

> ... hmm, there is a problem there: how should the PyUnicode_Format()
> API deal with '%s' when it sees a Unicode object as argument ?

Anything other than taking the Unicode characters as-is would be
incomprehensible.  I mean, it's a Unicode format string sucking up Unicode
strings -- what else could possibly make *sense*?

> E.g. what would you get in these cases:
>
> u = u"%s %s" % (u"abc", "abc")

That u"abc" gets substituted as-is seems screamingly necessary to me.

I'm more baffled about what "abc" should do.  I didn't understand the t#/s#
etc arguments, and how those do or don't relate to what str() does.  On the
face of it, the idea that a gazillion and one distinct encodings all get
lumped into "a string object" without remembering their nature makes about
as much sense as if Python were to treat all instances of all user-defined
classes as being of a single InstanceType type  -- except in the
latter case you at least get a __class__ attribute to find your way home
again.

As an ignorant user, I would hope that

    u"%s" % string

had enough sense to know what string's encoding is all on its own, and
promote it correctly to Unicode by magic.

> Perhaps we need a new marker for "insert Unicode object here".

%s means string, and at this level a Unicode object *is* "a string".  If
this isn't obvious, it's likely because we're too clever about what
non-Unicode string objects do in this context.





From captainrobbo at yahoo.com  Wed Nov 17 08:53:53 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 16 Nov 1999 23:53:53 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs... 
Message-ID: <19991117075353.16046.rocketmail@web606.mail.yahoo.com>

--- Mark Hammond  wrote:
> Actually, I was thinking even more radically - drop
> the codec registry
> all together, and use modules with "well-known"
> names  (a slight
> precedent, but Python isnt adverse to well-known
> names in general)
> 
> eg:
> iso-8859-1.py:
> 
> import unicodec
> def encode(...):
>   ...
> def decode(...):
>   ...
> 
> iso-8859-2.py:
> from iso-8859-1 import *
> 
This is the simplest if each codec really is likely to
be implemented in a separate module.  But just look at
the data!  All the iso-8859 encodings need identical
functionality, and just have a different mapping table
with 256 elements.  It would be trivial to implement
these in one module.  And the wide variety of Japanese
encodings (mostly corporate or historical variants of
the same character set) are again best treated from
one code base with a bunch of mapping tables and
routines to generate the variants - basically one can
store the deltas.

So the choice is between possibly having a lot of
almost-dummy modules, or having Python modules which
generate and register a logical family of encodings.  

I may have some time next week and will try to code up
a few so we can pound on something.

- Andy



=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From captainrobbo at yahoo.com  Wed Nov 17 08:58:23 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Tue, 16 Nov 1999 23:58:23 -0800 (PST)
Subject: [Python-Dev] Unicode proposal: %-formatting ?
Message-ID: <19991117075823.6498.rocketmail@web602.mail.yahoo.com>


--- Tim Peters  wrote:
> I'm more baffled about what "abc" should do.  I
> didn't understand the t#/s#
> etc arguments, and how those do or don't relate to
> what str() does.  On the
> face of it, the idea that a gazillion and one
> distinct encodings all get
> lumped into "a string object" without remembering
> their nature makes about
> as much sense as if Python were to treat all
> instances of all user-defined
> classes as being of a single InstanceType type
>  -- except in the
> latter case you at least get a __class__ attribute
> to find your way home
> again.

Well said.  When the core stuff is done, I'm going to
implement a set of "TypedString" helper routines which
will remember what they are encoded in and won't let
you abuse them by concatenating or otherwise mixing
different encodings.  If you are consciously working
with multi-encoding data, this higher level of
abstraction is really useful.  But I reckon that can
be done in pure Python (just overload '%;, '+' etc.
with some encoding checks).

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From mal at lemburg.com  Wed Nov 17 11:03:59 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 11:03:59 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000201bf30d3$cb2cb240$a42d153f@tim>
Message-ID: <38327D8F.7A5352E6@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > ...demo script...
> 
> It looks like
> 
>     r'\\u0000'
> 
> will get translated into a 2-character Unicode string.

Right...

> That's probably not
> good, if for no other reason than that Java would not do this (it would
> create the obvious 7-character Unicode string), and having something that
> looks like a Java escape that doesn't *work* like the Java escape will be
> confusing as heck for JPython users.  Keeping track of even-vs-odd number of
> backslashes can't be done with a regexp search, but is easy if the code is
> simple :
> ...Tim's version of the demo...

Guido and I have decided to turn \uXXXX into a standard
escape sequence with no further magic applied. \uXXXX will
only be expanded in u"" strings.

Here's the new scheme:

With the 'unicode-escape' encoding being defined as:

? all non-escape characters represent themselves as a Unicode ordinal
  (e.g. 'a' -> U+0061).

? all existing defined Python escape sequences are interpreted as
  Unicode ordinals; note that \xXXXX can represent all Unicode
  ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.

? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
  error to have fewer than 4 digits after \u.

Examples:

u'abc'          -> U+0061 U+0062 U+0063
u'\u1234'       -> U+1234
u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c

Now how should we define ur"abc\u1234\n"  ... ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From tim_one at email.msn.com  Wed Nov 17 10:31:27 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 04:31:27 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <000801bf30de$85bea500$a42d153f@tim>

[Guido]
> ...
> I'm hoping for several kind of responses to this email:
> ...
> - requests for checkin privileges, preferably with a specific issue
> or area of expertise for which the requestor will take responsibility.

I'm specifically requesting not to have checkin privileges.  So there.

I see two problems:

1. When patches go thru you, you at least eyeball them.  This catches bugs
and design errors early.

2. For a multi-platform app, few people have adequate resources for testing;
e.g., I can test under an obsolete version of Win95, and NT if I have to,
but that's it.  You may not actually do better testing than that, but having
patches go thru you allows me the comfort of believing you do .





From mal at lemburg.com  Wed Nov 17 11:11:05 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 11:11:05 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: <00f701bf307d$20f0cb00$0501a8c0@bobcat>
Message-ID: <38327F39.AA381647@lemburg.com>

Mark Hammond wrote:
> 
> This is leading me to conclude that our "codec registry" should be the
> file system, and Python modules.
> 
> Would it be possible to define a "standard package" called
> "encodings", and when we need an encoding, we simply attempt to load a
> module from that package?  The key benefits I see are:
> 
> * No need to load modules simply to register a codec (which would make
> the number of open calls even higher, and the startup time even
> slower.)  This makes it truly demand-loading of the codecs, rather
> than explicit load-and-register.
> 
> * Making language specific distributions becomes simple - simply
> select a different set of modules from the "encodings" directory.  The
> Python source distribution has them all, but (say) the Windows binary
> installer selects only a few.  The Japanese binary installer for
> Windows installs a few more.
> 
> * Installing new codecs becomes trivial - no need to hack site.py
> etc - simply copy the new "codec module" to the encodings directory
> and you are done.
> 
> * No serious problem for GMcM's installer nor for freeze
> 
> We would probably need to assume that certain codes exist for _all_
> platforms and language - but this is no different to assuming that
> "exceptions.py" also exists for all platforms.
> 
> Is this worthy of consideration?

Why not... using the new registry scheme I proposed in the
thread "Codecs and StreamCodecs" you could implement this
via factory_functions and lazy imports (with the encoding
name folded to make up a proper Python identifier, e.g.
hyphens get converted to '' and spaces to '_').

I'd suggest grouping encodings:

[encodings]
	[iso}
		[iso88591]
		[iso88592]
	[jis]
		...
	[cyrillic]
		...
	[misc]

The unicodec registry could then query encodings.get(encoding,action)
and the package would take care of the rest.

Note that the "walk-me-up-scotty" import patch would probably
be nice in this situation too, e.g. to reach the modules in
[misc] or in higher levels such the ones in [iso] from
[iso88591].

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Wed Nov 17 10:29:34 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 10:29:34 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com>
Message-ID: <3832757E.B9503606@lemburg.com>

Fredrik Lundh wrote:
> 
> --------------------------------------------------------------------
> A PIL-like Unicode Codec Proposal
> --------------------------------------------------------------------
> 
> In the PIL model, the codecs are called with a piece of data, and
> returns the result to the caller.  The codecs maintain internal state
> when needed.
> 
> class decoder:
> 
>     def decode(self, s, offset=0):
>         # decode as much data as we possibly can from the
>         # given string.  if there's not enough data in the
>         # input string to form a full character, return
>         # what we've got this far (this might be an empty
>         # string).
> 
>     def flush(self):
>         # flush the decoding buffers.  this should usually
>         # return None, unless the fact that knowing that the
>         # input stream has ended means that the state can be
>         # interpreted in a meaningful way.  however, if the
>         # state indicates that there last character was not
>         # finished, this method should raise a UnicodeError
>         # exception.

Could you explain for reason for having a .flush() method
and what it should return.

Note that the .decode method is not so much different
from my Codec.decode method except that it uses a single
offset where my version uses a slice (the offset is probably
the better variant, because it avoids data truncation).
 
> class encoder:
> 
>     def encode(self, u, offset=0, buffersize=0):
>         # encode data from the given offset in the input
>         # unicode string into a buffer of the given size
>         # (or slightly larger, if required to proceed).
>         # if the buffer size is 0, the decoder is free
>         # to pick a suitable size itself (if at all
>         # possible, it should make it large enough to
>         # encode the entire input string).  returns a
>         # 2-tuple containing the encoded data, and the
>         # number of characters consumed by this call.

Dito.
 
>     def flush(self):
>         # flush the encoding buffers.  returns an ordinary
>         # string (which may be empty), or None.
> 
> Note that a codec instance can be used for a single string; the codec
> registry should hold codec factories, not codec instances.  In
> addition, you may use a single type or class to implement both
> interfaces at once.

Perhaps I'm missing something, but how would you define
stream codecs using this interface ? 

> Implementing stream codecs is left as an exercise (see the zlib
> material in the eff-bot guide for a decoder example).

...?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Wed Nov 17 10:55:05 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 17 Nov 1999 10:55:05 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>
		<199911161620.LAA02643@eric.cnri.reston.va.us>
		<38319A2A.4385D2E7@lemburg.com> <14385.40842.709711.12141@weyr.cnri.reston.va.us>
Message-ID: <38327B79.2415786B@lemburg.com>

"Fred L. Drake, Jr." wrote:
> 
> M.-A. Lemburg writes:
>  > Wouldn't it be possible to have the read/write methods set up
>  > the state when called for the first time ?
> 
>   That slows the down; the constructor should handle initialization.
> Perhaps what gets registered should be:  encoding function, decoding
> function, stream encoder factory (can be a class), stream decoder
> factory (again, can be a class).

Guido proposed the factory approach too, though not seperated
into these 4 APIs (note that your proposal looks very much like
what I had in the early version of my proposal).

Anyway, I think that factory functions are the way to go,
because they offer more flexibility w/r to reusing already
instantiated codecs, importing modules on-the-fly as was
suggested in another thread (thereby making codec module
import lazy) or mapping encoder and decoder requests all
to one class.

So here's a new registry approach:

unicodec.register(encoding,factory_function,action)

with 
	encoding - name of the supported encoding, e.g. Shift_JIS
	factory_function - a function that returns an object
                   or function ready to be used for action
	action - a string stating the supported action:
			'encode'
			'decode'
			'stream write'
			'stream read'

The factory_function API depends on the implementation of
the codec. The returned object's interface on the value of action:

Codecs:
-------

obj = factory_function_for_(errors='strict')

'encode': obj(u,slice=None) -> Python string
'decode': obj(s,offset=0,chunksize=0) -> (Unicode object, bytes consumed)

factory_functions are free to return simple function objects
for stateless encodings.

StreamCodecs:
-------------

obj = factory_function_for_(stream,errors='strict')

obj should provide access to all methods defined for the stream
object, overriding these:

'stream write': obj.write(u,slice=None) -> bytes written to stream
		obj.flush() -> ???
'stream read':  obj.read(chunksize=0) -> (Unicode object, bytes read)
		obj.flush() -> ???

errors is defined like in my Codec spec. The codecs are
expected to use this argument to handle error conditions.

I'm not sure what Fredrik intended with the .flush() methods,
so the definition is still open. I would expect it to do some
finalization of state.

Perhaps we need another set of actions for the .feed()/.close()
approach...

As in earlier version of the proposal:
The registry should provide default implementations for
missing action factory_functions using the other registered
functions, e.g. 'stream write' can be emulated using
'encode' and 'stream read' using 'decode'. The same probably
holds for feed approach.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    44 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From tim_one at email.msn.com  Wed Nov 17 09:14:38 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 03:14:38 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <3831350B.8F69CB6D@lemburg.com>
Message-ID: <000201bf30d3$cb2cb240$a42d153f@tim>

[MAL]
> ...
> Here is a sample implementation of what I had in mind:
>
> """ Demo for 'unicode-escape' encoding.
> """
> import struct,string,re
>
> pack_format = '>H'
>
> def convert_string(s):
>
>     l = map(None,s)
>     for i in range(len(l)):
> 	l[i] = struct.pack(pack_format,ord(l[i]))
>     return l
>
> u_escape = re.compile(r'\\u([0-9a-fA-F]{0,4})')
>
> def unicode_unescape(s):
>
>     l = []
>     start = 0
>     while start < len(s):
> 	m = u_escape.search(s,start)
> 	if not m:
> 	    l[len(l):] = convert_string(s[start:])
> 	    break
> 	m_start,m_end = m.span()
> 	if m_start > start:
> 	    l[len(l):] = convert_string(s[start:m_start])
> 	hexcode = m.group(1)
> 	#print hexcode,start,m_start
> 	if len(hexcode) != 4:
> 	    raise SyntaxError,'illegal \\uXXXX sequence: \\u%s' % hexcode
> 	ordinal = string.atoi(hexcode,16)
> 	l.append(struct.pack(pack_format,ordinal))
> 	start = m_end
>     #print l
>     return string.join(l,'')
>
> def hexstr(s,sep=''):
>
>     return string.join(map(lambda x,hex=hex,ord=ord: '%02x' %
> ord(x),s),sep)

It looks like

    r'\\u0000'

will get translated into a 2-character Unicode string.  That's probably not
good, if for no other reason than that Java would not do this (it would
create the obvious 7-character Unicode string), and having something that
looks like a Java escape that doesn't *work* like the Java escape will be
confusing as heck for JPython users.  Keeping track of even-vs-odd number of
backslashes can't be done with a regexp search, but is easy if the code is
simple :

def unicode_unescape(s):
    from string import atoi
    import array
    i, n = 0, len(s)
    result = array.array('H') # unsigned short, native order
    while i < n:
        ch = s[i]
        i = i+1
        if ch != "\\":
            result.append(ord(ch))
            continue
        if i == n:
            raise ValueError("string ends with lone backslash")
        ch = s[i]
        i = i+1
        if ch != "u":
            result.append(ord("\\"))
            result.append(ord(ch))
            continue
        hexchars = s[i:i+4]
        if len(hexchars) != 4:
            raise ValueError("\\u escape at end not followed by "
                             "at least 4 characters")
        i = i+4
        for ch in hexchars:
            if ch not in "01234567890abcdefABCDEF":
                raise ValueError("\\u" + hexchars + " contains "
                                 "non-hex characters")
        result.append(atoi(hexchars, 16))

    # print result
    return result.tostring()





From tim_one at email.msn.com  Wed Nov 17 09:47:48 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 03:47:48 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: <383156DF.2209053F@lemburg.com>
Message-ID: <000401bf30d8$6cf30bc0$a42d153f@tim>

[MAL]
> FYI, the next version of the proposal ...
> File objects opened in text mode will use "t#" and binary ones use "s#".

Am I the only one who sees magical distinctions between text and binary mode
as a Really Bad Idea?  I wouldn't have guessed the Unix natives here would
quietly acquiesce to importing a bit of Windows madness .





From tim_one at email.msn.com  Wed Nov 17 09:47:46 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 03:47:46 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <383140F3.EDDB307A@lemburg.com>
Message-ID: <000301bf30d8$6bbd4ae0$a42d153f@tim>

[Jack Jansen]
> I would suggest adding the Dos, Windows and Macintosh standard
> 8-bit charsets (their equivalents of latin-1) too, as documents
> in these encoding are pretty ubiquitous. But maybe these should
> only be added on the respective platforms.

[MAL]
> Good idea. What code pages would that be ?

I'm not clear on what's being suggested; e.g., Windows supports *many*
different "code pages".  CP 1252 is default in the U.S., and is an extension
of Latin-1.  See e.g.

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT

which appears to be up-to-date (has 0x80 as the euro symbol, Unicode
U+20AC -- although whether your version of U.S. Windows actually has this
depends on whether you installed the service pack that added it!).

See

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP850.TXT

for the closest DOS got.





From tim_one at email.msn.com  Wed Nov 17 10:05:21 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 04:05:21 -0500
Subject: Weak refs (was [Python-Dev] just say no...)
In-Reply-To: <14385.33486.855802.187739@weyr.cnri.reston.va.us>
Message-ID: <000601bf30da$e069d820$a42d153f@tim>

[Fred L. Drake, Jr., pines for some flavor of weak refs; MAL reminds us
 of his work; & back to Fred]

>   Yes, but still not in the core.  So we have two general examples
> (vrefs and mxProxy) and there's WeakDict (or something like that).  I
> think there really needs to be a core facility for this.

This kind of thing certainly belongs in the core (for efficiency and smooth
integration) -- if it belongs in the language at all.  This was discussed at
length here some months ago; that's what prompted MAL to "do something"
about it.  Guido hasn't shown visible interest, and nobody has been willing
to fight him to the death over it.  So it languishes.  Buy him lunch
tomorrow and get him excited .





From tim_one at email.msn.com  Wed Nov 17 10:10:24 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 04:10:24 -0500
Subject: [Python-Dev] shared data (was: Some thoughts on the codecs...)
In-Reply-To: <1269351119-9152905@hypernet.com>
Message-ID: <000701bf30db$94d4ac40$a42d153f@tim>

[Gordon McMillan]
> ...
> Yeah, it's faster. And I can put Python+Tcl/Tk+IDLE on a
> diskette with a little room left over.

That's truly remarkable (he says while waiting for the Inbox Repair Tool to
finish repairing his 50Mb Outlook mail file ...)!

> but-since-its-WIndows-it-must-be-tainted-ly y'rs

Indeed -- if it runs on Windows, it's a worthless piece o' crap .





From fredrik at pythonware.com  Wed Nov 17 12:00:10 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:00:10 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com>
Message-ID: <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com>

M.-A. Lemburg  wrote:
> >     def flush(self):
> >         # flush the decoding buffers.  this should usually
> >         # return None, unless the fact that knowing that the
> >         # input stream has ended means that the state can be
> >         # interpreted in a meaningful way.  however, if the
> >         # state indicates that there last character was not
> >         # finished, this method should raise a UnicodeError
> >         # exception.
>
> Could you explain for reason for having a .flush() method
> and what it should return.

in most cases, it should either return None, or
raise a UnicodeError exception:

    >>> u = unicode("? i ?a ? e ?", "iso-latin-1")
    >>> # yes, that's a valid Swedish sentence ;-)
    >>> s = u.encode("utf-8")
    >>> d = decoder("utf-8")
    >>> d.decode(s[:-1])
    "? i ?a ? e "
    >>> d.flush()
    UnicodeError: last character not complete

on the other hand, there are situations where it
might actually return a string.  consider a "HTML
entity decoder" which uses the following pattern
to match a character entity: "&\w+;?" (note that
the trailing semicolon is optional).

    >>> u = unicode("? i ?a ? e ?", "iso-latin-1")
    >>> s = u.encode("html-entities")
    >>> d = decoder("html-entities")
    >>> d.decode(s[:-1])
    "? i ?a ? e "
    >>> d.flush()
    "?"

> Perhaps I'm missing something, but how would you define
> stream codecs using this interface ?

input: read chunks of data, decode, and
keep extra data in a local buffer.

output: encode data into suitable chunks,
and write to the output stream (that's why
there's a buffersize argument to encode --
if someone writes a 10mb unicode string to
an encoded stream, python shouldn't allocate
an extra 10-30 megabytes just to be able to
encode the darn thing...)

> > Implementing stream codecs is left as an exercise (see the zlib
> > material in the eff-bot guide for a decoder example).

everybody should have a copy of the eff-bot guide ;-)

(but alright, I plan to post a complete utf-8 implementation
in a not too distant future).






From gstein at lyra.org  Wed Nov 17 11:57:36 1999
From: gstein at lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 02:57:36 -0800 (PST)
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: <38327F39.AA381647@lemburg.com>
Message-ID: 

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
>...
> I'd suggest grouping encodings:
> 
> [encodings]
> 	[iso}
> 		[iso88591]
> 		[iso88592]
> 	[jis]
> 		...
> 	[cyrillic]
> 		...
> 	[misc]

WHY?!?!

This is taking a simple solution and making it complicated. I see no
benefit to the creating yet-another-level-of-hierarchy. Why should they be
grouped?

Leave the modules just under "encodings" and be done with it.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Wed Nov 17 12:14:01 1999
From: gstein at lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 03:14:01 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <38327B79.2415786B@lemburg.com>
Message-ID: 

On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
>...
> Anyway, I think that factory functions are the way to go,
> because they offer more flexibility w/r to reusing already
> instantiated codecs, importing modules on-the-fly as was
> suggested in another thread (thereby making codec module
> import lazy) or mapping encoder and decoder requests all
> to one class.

Why a factory? I've got a simple encode() function. I don't need a
factory. "flexibility" at the cost of complexity (IMO).

> So here's a new registry approach:
> 
> unicodec.register(encoding,factory_function,action)
> 
> with 
> 	encoding - name of the supported encoding, e.g. Shift_JIS
> 	factory_function - a function that returns an object
>                    or function ready to be used for action
> 	action - a string stating the supported action:
> 			'encode'
> 			'decode'
> 			'stream write'
> 			'stream read'

This action thing is subject to error. *if* you're wanting to go this
route, then have:

unicodec.register_encode(...)
unicodec.register_decode(...)
unicodec.register_stream_write(...)
unicodec.register_stream_read(...)

They are equivalent. Guido has also told me in the past that he dislikes
parameters that alter semantics -- preferring different functions instead.
(this is why there are a good number of PyBufferObject interfaces; I had
fewer to start with)

This suggested approach is also quite a bit more wordy/annoying than
Fred's alternative:

unicode.register('iso-8859-1', encoder, decoder, None, None)

And don't say "future compatibility allows us to add new actions." Well,
those same future changes can add new registration functions or additional
parameters to the single register() function.

Not that I'm advocating it, but register() could also take a single
parameter: if a class, then instantiate it and call methods for each
action; if an instance, then just call methods for each action.

[ and the third/original variety: a function object as the first param is
  the actual hook, and params 2 thru 4 (each are optional, or just the
  stream funcs?) are the other hook functions ]

> The factory_function API depends on the implementation of
> the codec. The returned object's interface on the value of action:
> 
> Codecs:
> -------
> 
> obj = factory_function_for_(errors='strict')

Where does this "errors" value come from? How does a user alter that
value? Without an ability to change this, I see no reason for a factory.
[ and no: don't tell me it is a thread-state value :-) ]

On the other hand: presuming the "errors" thing is valid, *then* I see a
need for a factory.

Truly... I dislike factories. IMO, they just add code/complexity in many
cases where the functionality isn't needed. But that's just me :-)

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From captainrobbo at yahoo.com  Wed Nov 17 12:17:00 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 17 Nov 1999 03:17:00 -0800 (PST)
Subject: [Python-Dev] Rosette i18n API
Message-ID: <19991117111700.8831.rocketmail@web603.mail.yahoo.com>

There is a very capable C++ library at

http://rosette.basistech.com/

It is well worth looking at the things this API
actually lets you do for ideas on patterns.

- Andy

=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From gstein at lyra.org  Wed Nov 17 12:21:18 1999
From: gstein at lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 03:21:18 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim>
Message-ID: 

On Wed, 17 Nov 1999, Tim Peters wrote:
> [MAL]
> > FYI, the next version of the proposal ...
> > File objects opened in text mode will use "t#" and binary ones use "s#".
> 
> Am I the only one who sees magical distinctions between text and binary mode
> as a Really Bad Idea?  I wouldn't have guessed the Unix natives here would
> quietly acquiesce to importing a bit of Windows madness .

It's a seductive idea... yes, it feels wrong, but then... it seems kind of
right, too...

:-)

Yes. It is a mode. Is it bad? Not sure. You've already told the system
that you want to treat the file differently. Much like you're treating it
differently when you specify 'r' vs. 'w'.

The real annoying thing would be to assume that opening a file as 'r'
means that I *meant* text mode and to start using "t#". In actuality, I
typically open files that way since I do most of my coding on Linux. If
I now have to pay attention to things and open it as 'rb', then I'll be
pissed.

And the change in behavior and bugs that interpreting 'r' as text would
introduce? Ack!

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From fredrik at pythonware.com  Wed Nov 17 12:36:32 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:36:32 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: 
Message-ID: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com>

Greg Stein  wrote:
> Why a factory? I've got a simple encode() function. I don't need a
> factory. "flexibility" at the cost of complexity (IMO).

so where do you put the state?

how do you reset the state between
strings?

how do you handle incremental
decoding/encoding?

etc.

(I suggest taking another look at PIL's codec
design.  it solves all these problems with a
minimum of code, and it works -- people
have been hammering on PIL for years...)






From gstein at lyra.org  Wed Nov 17 12:34:30 1999
From: gstein at lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 03:34:30 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <001b01bf30ef$ffb08a20$f29b12c2@secret.pythonware.com>
Message-ID: 

On Wed, 17 Nov 1999, Fredrik Lundh wrote:
> Greg Stein  wrote:
> > Why a factory? I've got a simple encode() function. I don't need a
> > factory. "flexibility" at the cost of complexity (IMO).
> 
> so where do you put the state?

encode() is not supposed to retain state. It is supposed to do a complete
translation. It is not a stream thingy, which may have received partial
characters.

> how do you reset the state between
> strings?

There is none :-)

> how do you handle incremental
> decoding/encoding?

Streams.

-g

--
Greg Stein, http://www.lyra.org/




From fredrik at pythonware.com  Wed Nov 17 12:46:01 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:46:01 +0100
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com>

Guido van Rossum  wrote:
> - suggestions for new issues that maybe ought to be settled in 1.6

three things: imputil, imputil, imputil






From fredrik at pythonware.com  Wed Nov 17 12:51:33 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Wed, 17 Nov 1999 12:51:33 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: 
Message-ID: <006201bf30f2$194626f0$f29b12c2@secret.pythonware.com>

Greg Stein  wrote:
> > so where do you put the state?
>
> encode() is not supposed to retain state. It is supposed to do a complete
> translation. It is not a stream thingy, which may have received partial
> characters.
>
> > how do you handle incremental
> > decoding/encoding?
> 
> Streams.

hmm.  why have two different mechanisms when
you can do the same thing with one?






From gstein at lyra.org  Wed Nov 17 14:01:47 1999
From: gstein at lyra.org (Greg Stein)
Date: Wed, 17 Nov 1999 05:01:47 -0800 (PST)
Subject: [Python-Dev] Apache process (was: Python 1.6 status)
In-Reply-To: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: 

On Tue, 16 Nov 1999, Guido van Rossum wrote:
>...
> Greg, I understand you have checkin privileges for Apache.  What is
> the procedure there for handing out those privileges?  What is the
> procedure for using them?  (E.g. if you made a bogus change to part of
> Apache you're not supposed to work on, what happens?)

Somebody proposes that a person is added to the list of people with
checkin privileges. If nobody else in the group vetoes that, then they're
in (their system doesn't require continual participation by each member,
so it can only operate at a veto level, rather than a unanimous assent).
It is basically determined on the basis of merit -- has the person been
active (on the Apache developer's mailing list) and has the person
contributed something significant? Further, by providing commit access,
will they further the goals of Apache? And, of course, does their
temperament seem to fit in with the other group members?

I can make any change that I'd like. However, there are about 20 other
people who can easily revert or alter my changes if they're bogus.
There are no programmatic restrictions.... You could say it is based on
mutual respect and a social contract of behavior. Large changes should be
discussed before committing to CVS. Bug fixes, doc enhancements, minor
functional improvements, etc, all follow a commit-then-review process. I
just check the thing in. Others see the diff (emailed to the checkins
mailing list (this is different from Python-checkins which only says what
files are changed, rather than providing the diff)) and can comment on the
change, make their own changes, etc.

To be concrete: I added the Expat code that now appears in Apache 1.3.9.
Before doing so, I queried the group. There were some issues that I dealt
with before finally commiting Expat to the CVS repository. On another
occasion, I added a new API to Apache; again, I proposed it first, got an
"all OK" and committed it. I've done a couple bug fixes which I just
checked in.
[ "all OK" means three +1 votes and no vetoes. everybody has veto
  ability (but the responsibility to explain why and to remove their veto 
  when their concerns are addressed). ]

On many occasions, I've reviewed the diffs that were posted to the
checkins list, and made comments back to the author. I've caught a few
problems this way.

For Apache 2.0, even large changes are commit-then-review at this point.
At some point, it will switch over to review-then-commit and the project
will start moving towards stabilization/release. (bug fixes and stuff will
always remain commit-then-review)

I'll note that the process works very well given that diffs are emailed. I
doubt that it would be effective if people had to fetch CVS diffs
themselves.

Your note also implies "areas of ownership". This doesn't really exist
within Apache. There aren't even "primary authors" or things like that. I
have the ability/rights to change any portions: from the low-level
networking, to the documentation, to the server-side include processing.
Of coures, if I'm going to make a big change, then I'll be posting a patch
for review first, and whoever has worked in that area in the past
may/will/should comment.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/




From guido at CNRI.Reston.VA.US  Wed Nov 17 14:32:05 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:32:05 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: Your message of "Wed, 17 Nov 1999 04:31:27 EST."
             <000801bf30de$85bea500$a42d153f@tim> 
References: <000801bf30de$85bea500$a42d153f@tim> 
Message-ID: <199911171332.IAA03266@kaluha.cnri.reston.va.us>

> I'm specifically requesting not to have checkin privileges.  So there.

I will force nobody to use checkin privileges.  However I see that
for some contributors, checkin privileges will save me and them time.

> I see two problems:
> 
> 1. When patches go thru you, you at least eyeball them.  This catches bugs
> and design errors early.

I will still eyeball them -- only after the fact.  Since checkins are
pretty public, being slapped on the wrist for a bad checkin is a
pretty big embarrassment, so few contributors will check in buggy code
more than once.  Moreover, there will be more eyeballs.

> 2. For a multi-platform app, few people have adequate resources for testing;
> e.g., I can test under an obsolete version of Win95, and NT if I have to,
> but that's it.  You may not actually do better testing than that, but having
> patches go thru you allows me the comfort of believing you do .

I expect that the same mechanisms will apply.  I have access to
Solaris, Linux and Windows (NT + 98) but it's actually a lot easier to
check portability after things have been checked in.  And again, there
will be more testers.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Wed Nov 17 14:34:23 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:34:23 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Tue, 16 Nov 1999 23:53:53 PST."
             <19991117075353.16046.rocketmail@web606.mail.yahoo.com> 
References: <19991117075353.16046.rocketmail@web606.mail.yahoo.com> 
Message-ID: <199911171334.IAA03374@kaluha.cnri.reston.va.us>

> This is the simplest if each codec really is likely to
> be implemented in a separate module.  But just look at
> the data!  All the iso-8859 encodings need identical
> functionality, and just have a different mapping table
> with 256 elements.  It would be trivial to implement
> these in one module.  And the wide variety of Japanese
> encodings (mostly corporate or historical variants of
> the same character set) are again best treated from
> one code base with a bunch of mapping tables and
> routines to generate the variants - basically one can
> store the deltas.
> 
> So the choice is between possibly having a lot of
> almost-dummy modules, or having Python modules which
> generate and register a logical family of encodings.  
> 
> I may have some time next week and will try to code up
> a few so we can pound on something.

I see no problem with having a lot of near-dummy modules if it
simplifies the architecture.  You can still do code sharing.  Files
are cheap; APIs are expensive.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Wed Nov 17 14:38:35 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:38:35 -0500
Subject: [Python-Dev] Some thoughts on the codecs...
In-Reply-To: Your message of "Wed, 17 Nov 1999 02:57:36 PST."
              
References:  
Message-ID: <199911171338.IAA03511@kaluha.cnri.reston.va.us>

> This is taking a simple solution and making it complicated. I see no
> benefit to the creating yet-another-level-of-hierarchy. Why should they be
> grouped?
> 
> Leave the modules just under "encodings" and be done with it.

Agreed.  Tim Peters once remarked that Python likes shallow encodings
(or perhaps that *I* like them :-).  This is one such case where I
would strongly urge for the simplicity of a shallow hierarchy.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Wed Nov 17 14:43:44 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:43:44 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Wed, 17 Nov 1999 03:14:01 PST."
              
References:  
Message-ID: <199911171343.IAA03636@kaluha.cnri.reston.va.us>

> Why a factory? I've got a simple encode() function. I don't need a
> factory. "flexibility" at the cost of complexity (IMO).

Unless there are certain cases where factories are useful.  But let's
read on...

> > 	action - a string stating the supported action:
> > 			'encode'
> > 			'decode'
> > 			'stream write'
> > 			'stream read'
> 
> This action thing is subject to error. *if* you're wanting to go this
> route, then have:
> 
> unicodec.register_encode(...)
> unicodec.register_decode(...)
> unicodec.register_stream_write(...)
> unicodec.register_stream_read(...)
> 
> They are equivalent. Guido has also told me in the past that he dislikes
> parameters that alter semantics -- preferring different functions instead.

Yes, indeed!  (But weren't we going to do away with the whole registry
idea in favor of an encodings package?)

> Not that I'm advocating it, but register() could also take a single
> parameter: if a class, then instantiate it and call methods for each
> action; if an instance, then just call methods for each action.

Nah, that's bad -- a class is just a factory, and once you are
allowing classes it's really good to also allowing factory functions.

> [ and the third/original variety: a function object as the first param is
>   the actual hook, and params 2 thru 4 (each are optional, or just the
>   stream funcs?) are the other hook functions ]

Fine too.  They should all be optional.

> > obj = factory_function_for_(errors='strict')
> 
> Where does this "errors" value come from? How does a user alter that
> value? Without an ability to change this, I see no reason for a factory.
> [ and no: don't tell me it is a thread-state value :-) ]
> 
> On the other hand: presuming the "errors" thing is valid, *then* I see a
> need for a factory.

The idea is that various places that take an encoding name can also
take a codec instance.  So the user can call the factory function /
class constructor.

> Truly... I dislike factories. IMO, they just add code/complexity in many
> cases where the functionality isn't needed. But that's just me :-)

Get over it...  In a sense, every Python class is a factory for its
own instances!  I think you must be confusing Python with Java or
C++. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Wed Nov 17 14:56:56 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Wed, 17 Nov 1999 08:56:56 -0500
Subject: [Python-Dev] Apache process (was: Python 1.6 status)
In-Reply-To: Your message of "Wed, 17 Nov 1999 05:01:47 PST."
              
References:  
Message-ID: <199911171356.IAA04005@kaluha.cnri.reston.va.us>

> Somebody proposes that a person is added to the list of people with
> checkin privileges. If nobody else in the group vetoes that, then they're
> in (their system doesn't require continual participation by each member,
> so it can only operate at a veto level, rather than a unanimous assent).
> It is basically determined on the basis of merit -- has the person been
> active (on the Apache developer's mailing list) and has the person
> contributed something significant? Further, by providing commit access,
> will they further the goals of Apache? And, of course, does their
> temperament seem to fit in with the other group members?

This makes sense, but I have one concern: if somebody who isn't liked
very much (say a capable hacker who is a real troublemaker) asks for
privileges, would people veto this?  I'd be reluctant to go on record
as veto'ing a particular person.  (E.g. there are a few troublemakers
in c.l.py, and I would never want them to join python-dev let alone
give them commit privileges, but I'm not sure if I would want to
discuss this on a publicly archived mailing list -- or even on a
privately archived mailing list, given that the number of members
might be in the hundreds.

[...stuff I like...]

> I'll note that the process works very well given that diffs are emailed. I
> doubt that it would be effective if people had to fetch CVS diffs
> themselves.

That's a great idea; I'll see if we can do that to our checkin email,
regardless of whether we hand out commit privileges.

> Your note also implies "areas of ownership". This doesn't really exist
> within Apache. There aren't even "primary authors" or things like that. I
> have the ability/rights to change any portions: from the low-level
> networking, to the documentation, to the server-side include processing.

But that's Apache, which is explicitly run as a collective.  In
Python, I definitely want to have ownership of certain sections of the
code.  But I agree that this doesn't need to be formalized by access
control lists; the social process you describe sounds like it will
work just fine.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From fdrake at acm.org  Wed Nov 17 15:44:25 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Wed, 17 Nov 1999 09:44:25 -0500 (EST)
Subject: Weak refs (was [Python-Dev] just say no...)
In-Reply-To: <000601bf30da$e069d820$a42d153f@tim>
References: <14385.33486.855802.187739@weyr.cnri.reston.va.us>
	<000601bf30da$e069d820$a42d153f@tim>
Message-ID: <14386.48969.630893.119344@weyr.cnri.reston.va.us>

Tim Peters writes:
 > about it.  Guido hasn't shown visible interest, and nobody has been willing
 > to fight him to the death over it.  So it languishes.  Buy him lunch
 > tomorrow and get him excited .

  Guido has asked me to pursue this topic, so I'll be checking out
available implementations and seeing if any are adoptable or if
something different is needed to be fully general and
well-integrated.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From tim_one at email.msn.com  Thu Nov 18 04:21:16 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:21:16 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: <38327D8F.7A5352E6@lemburg.com>
Message-ID: <000101bf3173$f9805340$c0a0143f@tim>

[MAL]
> Guido and I have decided to turn \uXXXX into a standard
> escape sequence with no further magic applied. \uXXXX will
> only be expanded in u"" strings.

Does that exclude ur"" strings?  Not arguing either way, just don't know
what all this means.

> Here's the new scheme:
>
> With the 'unicode-escape' encoding being defined as:
>
> ? all non-escape characters represent themselves as a Unicode ordinal
>   (e.g. 'a' -> U+0061).

Same as before (scream if that's wrong).

> ? all existing defined Python escape sequences are interpreted as
>   Unicode ordinals;

Same as before (ditto).

> note that \xXXXX can represent all Unicode ordinals,

This means that the definition of \xXXXX has changed, then -- as you pointed
out just yesterday , \xABCDq currently acts like \xCDq.  Does the new
\x definition apply only in u"" strings, or in "" strings too?  What is the
new \x definition?

> and \OOO (octal) can represent Unicode ordinals up to U+01FF.

Same as before (ditto).

> ? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
>   error to have fewer than 4 digits after \u.

Same as before (ditto).

IOW, I don't see anything that's changed other than an unspecified new
treatment of \x escapes, and possibly that ur"" strings don't expand \u
escapes.

> Examples:
>
> u'abc'          -> U+0061 U+0062 U+0063
> u'\u1234'       -> U+1234
> u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c

The last example is damaged (U+05c isn't legit).  Other than that, these
look the same as before.

> Now how should we define ur"abc\u1234\n"  ... ?

If strings carried an encoding tag with them, the obvious answer is that
this acts exactly like r"abc\u1234\n" acts today except gets a
"unicode-escaped" encoding tag instead of a "[whatever the default is
today]" encoding tag.

If strings don't carry an encoding tag with them, you're in a bit of a
pickle:  you'll have to convert it to a regular string or a Unicode string,
but in either case have no way to communicate that it may need further
processing; i.e., no way to distinguish it from a regular or Unicode string
produced by any other mechanism.  The code I posted yesterday remains my
best answer to that unpleasant puzzle (i.e., produce a Unicode string,
fiddling with backslashes just enough to get the \u escapes expanded, in the
same way Java's (conceptual) preprocessor does it).





From tim_one at email.msn.com  Thu Nov 18 04:21:19 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:21:19 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: 
Message-ID: <000201bf3173$fb7f7ea0$c0a0143f@tim>

[MAL]
> File objects opened in text mode will use "t#" and binary
> ones use "s#".

[Greg Stein]
> ...
> The real annoying thing would be to assume that opening a file as 'r'
> means that I *meant* text mode and to start using "t#".

Isn't that exactly what MAL said would happen?  Note that a "t" flag for
"text mode" is an MS extension -- C doesn't define "t", and Python doesn't
either; a lone "r" has always meant text mode.

> In actuality, I typically open files that way since I do most of my
> coding on Linux. If I now have to pay attention to things and open it
> as 'rb', then I'll be pissed.
>
> And the change in behavior and bugs that interpreting 'r' as text would
> introduce? Ack!

'r' is already intepreted as text mode, but so far, on Unix-like systems,
there's been no difference between text and binary modes.  Introducing a
distinction will certainly cause problems.  I don't know what the
compensating advantages are thought to be.





From tim_one at email.msn.com  Thu Nov 18 04:23:00 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:23:00 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <199911171332.IAA03266@kaluha.cnri.reston.va.us>
Message-ID: <000301bf3174$37b465c0$c0a0143f@tim>

[Guido]
> I will force nobody to use checkin privileges.

That almost went without saying .

> However I see that for some contributors, checkin privileges will
> save me and them time.

Then it's Good!  Provided it doesn't hurt language stability.  I agree that
changing the system to mail out diffs addresses what I was worried about
there.





From tim_one at email.msn.com  Thu Nov 18 04:31:38 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:31:38 -0500
Subject: [Python-Dev] Apache process (was: Python 1.6 status)
In-Reply-To: <199911171356.IAA04005@kaluha.cnri.reston.va.us>
Message-ID: <000401bf3175$6c089660$c0a0143f@tim>

[Greg]
> ...
> Somebody proposes that a person is added to the list of people with
> checkin privileges. If nobody else in the group vetoes that, then
? they're in ...

[Guido]
> This makes sense, but I have one concern: if somebody who isn't liked
> very much (say a capable hacker who is a real troublemaker) asks for
> privileges, would people veto this?

It seems that a key point in Greg's description is that people don't propose
*themselves* for checkin.  They have to talk someone else into proposing
them.  That should keep Endang out of the running for a few years .

After that, I care more about their code than their personalities.  If the
stuff they check in is good, fine; if it's not, lock 'em out for direct
cause.

> I'd be reluctant to go on record as veto'ing a particular person.

Secret Ballot run off a web page -- although not so secret you can't see who
voted for what .





From tim_one at email.msn.com  Thu Nov 18 04:37:18 1999
From: tim_one at email.msn.com (Tim Peters)
Date: Wed, 17 Nov 1999 22:37:18 -0500
Subject: Weak refs (was [Python-Dev] just say no...)
In-Reply-To: <14386.48969.630893.119344@weyr.cnri.reston.va.us>
Message-ID: <000501bf3176$36a5ca00$c0a0143f@tim>

[Fred L. Drake, Jr.]
> Guido has asked me to pursue this topic [weak refs], so I'll be
> checking out available implementations and seeing if any are
> adoptable or if something different is needed to be fully general
> and well-integrated.

Just don't let "fully general" stop anything for its sake alone; e.g., if
there's a slick trick that *could* exempt numbers, that's all to the good!
Adding a pointer to every object is really unattractive, while adding a flag
or two to type objects is dirt cheap.

Note in passing that current Java addresses weak refs too (several flavors
of 'em! -- very elaborate).





From gstein at lyra.org  Thu Nov 18 09:09:24 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 18 Nov 1999 00:09:24 -0800 (PST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000201bf3173$fb7f7ea0$c0a0143f@tim>
Message-ID: 

On Wed, 17 Nov 1999, Tim Peters wrote:
>...
> 'r' is already intepreted as text mode, but so far, on Unix-like systems,
> there's been no difference between text and binary modes.  Introducing a
> distinction will certainly cause problems.  I don't know what the
> compensating advantages are thought to be.

Wow. "compensating advantages" ... Excellent "power phrase" there.

hehe...

-g

--
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Thu Nov 18 09:15:04 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:15:04 +0100
Subject: [Python-Dev] just say no...
References: <000201bf3173$fb7f7ea0$c0a0143f@tim>
Message-ID: <3833B588.1E31F01B@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > File objects opened in text mode will use "t#" and binary
> > ones use "s#".
> 
> [Greg Stein]
> > ...
> > The real annoying thing would be to assume that opening a file as 'r'
> > means that I *meant* text mode and to start using "t#".
> 
> Isn't that exactly what MAL said would happen?  Note that a "t" flag for
> "text mode" is an MS extension -- C doesn't define "t", and Python doesn't
> either; a lone "r" has always meant text mode.

Em, I think you've got something wrong here: "t#" refers to the
parsing marker used for writing data to files opened in text mode.

Until now, all files used the "s#" parsing marker for writing
data, regardeless of being opened in text or binary mode. The
new interpretation (new, because there previously was none ;-)
of the buffer interface forces this to be changed to regain
conformance.

> > In actuality, I typically open files that way since I do most of my
> > coding on Linux. If I now have to pay attention to things and open it
> > as 'rb', then I'll be pissed.
> >
> > And the change in behavior and bugs that interpreting 'r' as text would
> > introduce? Ack!
> 
> 'r' is already intepreted as text mode, but so far, on Unix-like systems,
> there's been no difference between text and binary modes.  Introducing a
> distinction will certainly cause problems.  I don't know what the
> compensating advantages are thought to be.

I guess you won't notice any difference: strings define both
interfaces ("s#" and "t#") to mean the same thing. Only other
buffer compatible types may now fail to write to text files
-- which is not so bad, because it forces the programmer to
rethink what he really intended when opening the file in text
mode.

Besides, if you are writing portable scripts you should pay
close attention to "r" vs. "rb" anyway.

[Strange, I find myself argueing for a feature that I don't
like myself ;-)]

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 09:59:21 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:59:21 +0100
Subject: [Python-Dev] Python 1.6 status
References: <199911161700.MAA02716@eric.cnri.reston.va.us> <004c01bf30f1$537102b0$f29b12c2@secret.pythonware.com>
Message-ID: <3833BFE9.6FD118B1@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum  wrote:
> > - suggestions for new issues that maybe ought to be settled in 1.6
> 
> three things: imputil, imputil, imputil

But please don't add the current version as default importer...
its strategy is way too slow for real life apps (yes, I've tested
this: imports typically take twice as long as with the builtin
importer).

I'd opt for an import manager which provides a useful API for
import hooks to register themselves with. What we really need
is not yet another complete reimplementation of what the
builtin importer does, but rather a more detailed exposure of
the various import aspects: finding modules and loading modules.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 09:50:36 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:50:36 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <38317FBA.4F3D6B1F@lemburg.com>  <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com>
Message-ID: <3833BDDC.7CD2CC1F@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg  wrote:
> > >     def flush(self):
> > >         # flush the decoding buffers.  this should usually
> > >         # return None, unless the fact that knowing that the
> > >         # input stream has ended means that the state can be
> > >         # interpreted in a meaningful way.  however, if the
> > >         # state indicates that there last character was not
> > >         # finished, this method should raise a UnicodeError
> > >         # exception.
> >
> > Could you explain for reason for having a .flush() method
> > and what it should return.
> 
> in most cases, it should either return None, or
> raise a UnicodeError exception:
> 
>     >>> u = unicode("? i ?a ? e ?", "iso-latin-1")
>     >>> # yes, that's a valid Swedish sentence ;-)
>     >>> s = u.encode("utf-8")
>     >>> d = decoder("utf-8")
>     >>> d.decode(s[:-1])
>     "? i ?a ? e "
>     >>> d.flush()
>     UnicodeError: last character not complete
> 
> on the other hand, there are situations where it
> might actually return a string.  consider a "HTML
> entity decoder" which uses the following pattern
> to match a character entity: "&\w+;?" (note that
> the trailing semicolon is optional).
> 
>     >>> u = unicode("? i ?a ? e ?", "iso-latin-1")
>     >>> s = u.encode("html-entities")
>     >>> d = decoder("html-entities")
>     >>> d.decode(s[:-1])
>     "? i ?a ? e "
>     >>> d.flush()
>     "?"

Ah, ok. So the .flush() method checks for proper
string endings and then either returns the remaining
input or raises an error.
 
> > Perhaps I'm missing something, but how would you define
> > stream codecs using this interface ?
> 
> input: read chunks of data, decode, and
> keep extra data in a local buffer.
> 
> output: encode data into suitable chunks,
> and write to the output stream (that's why
> there's a buffersize argument to encode --
> if someone writes a 10mb unicode string to
> an encoded stream, python shouldn't allocate
> an extra 10-30 megabytes just to be able to
> encode the darn thing...)

So the stream codecs would be wrappers around the
string codecs.

Have you read my latest version of the Codec interface ?
Wouldn't that be a reasonable approach ? Note that I have
integrated your ideas into the new API -- it's basically
only missing the .flush() methods, which I can add now
that I know what you meant.
 
> > > Implementing stream codecs is left as an exercise (see the zlib
> > > material in the eff-bot guide for a decoder example).
> 
> everybody should have a copy of the eff-bot guide ;-)

Sure, but the format, the format... make it printed and add
a CD and you would probably have a good selling book
there ;-)
 
> (but alright, I plan to post a complete utf-8 implementation
> in a not too distant future).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 09:16:48 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:16:48 +0100
Subject: [Python-Dev] Some thoughts on the codecs...
References: 
Message-ID: <3833B5F0.FA4620AD@lemburg.com>

Greg Stein wrote:
> 
> On Wed, 17 Nov 1999, M.-A. Lemburg wrote:
> >...
> > I'd suggest grouping encodings:
> >
> > [encodings]
> >       [iso}
> >               [iso88591]
> >               [iso88592]
> >       [jis]
> >               ...
> >       [cyrillic]
> >               ...
> >       [misc]
> 
> WHY?!?!
> 
> This is taking a simple solution and making it complicated. I see no
> benefit to the creating yet-another-level-of-hierarchy. Why should they be
> grouped?
> 
> Leave the modules just under "encodings" and be done with it.

Nevermind, was just an idea...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 09:43:31 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 09:43:31 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References:  <199911171343.IAA03636@kaluha.cnri.reston.va.us>
Message-ID: <3833BC33.66E134F@lemburg.com>

Guido van Rossum wrote:
> 
> > Why a factory? I've got a simple encode() function. I don't need a
> > factory. "flexibility" at the cost of complexity (IMO).
> 
> Unless there are certain cases where factories are useful.  But let's
> read on...
>
> > >     action - a string stating the supported action:
> > >                     'encode'
> > >                     'decode'
> > >                     'stream write'
> > >                     'stream read'
> >
> > This action thing is subject to error. *if* you're wanting to go this
> > route, then have:
> >
> > unicodec.register_encode(...)
> > unicodec.register_decode(...)
> > unicodec.register_stream_write(...)
> > unicodec.register_stream_read(...)
> >
> > They are equivalent. Guido has also told me in the past that he dislikes
> > parameters that alter semantics -- preferring different functions instead.
> 
> Yes, indeed!

Ok.

> (But weren't we going to do away with the whole registry
> idea in favor of an encodings package?)

One way or another, the Unicode implementation will have to
access a dictionary containing references to the codecs for
a particular encoding. You won't get around registering these
at some point... be it in a lazy way, on-the-fly or by some
other means.

What we could do is implement the lookup like this:

1. call encodings.lookup_(encoding) and use the
   return value for the conversion
2. if all fails, cop out with an error

Step 1. would do all the import magic and then register
the found codecs in some dictionary for faster access
(perhaps this could be done in a way that is directly
available to the Unicode implementation, e.g. in a
global internal dictionary -- the one I originally had in
mind for the unicodec registry).

> > Not that I'm advocating it, but register() could also take a single
> > parameter: if a class, then instantiate it and call methods for each
> > action; if an instance, then just call methods for each action.
> 
> Nah, that's bad -- a class is just a factory, and once you are
> allowing classes it's really good to also allowing factory functions.
> 
> > [ and the third/original variety: a function object as the first param is
> >   the actual hook, and params 2 thru 4 (each are optional, or just the
> >   stream funcs?) are the other hook functions ]
> 
> Fine too.  They should all be optional.

Ok.
 
> > > obj = factory_function_for_(errors='strict')
> >
> > Where does this "errors" value come from? How does a user alter that
> > value? Without an ability to change this, I see no reason for a factory.
> > [ and no: don't tell me it is a thread-state value :-) ]
> >
> > On the other hand: presuming the "errors" thing is valid, *then* I see a
> > need for a factory.
> 
> The idea is that various places that take an encoding name can also
> take a codec instance.  So the user can call the factory function /
> class constructor.

Right. The argument is reachable via:

Codec = encodings.lookup_encode('utf-8')
codec = Codec(errors='?')
s = codec(u"abc????")

s would then equal 'abc??'.

--

Should I go ahead then and change the registry business to
the new strategy (via the encodings package in the above
sense) ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mhammond at skippinet.com.au  Thu Nov 18 11:57:44 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 18 Nov 1999 21:57:44 +1100
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <3833BC33.66E134F@lemburg.com>
Message-ID: <002401bf31b3$bf16c230$0501a8c0@bobcat>

[Guido]
> > (But weren't we going to do away with the whole registry
> > idea in favor of an encodings package?)
>
[MAL]
> One way or another, the Unicode implementation will have to
> access a dictionary containing references to the codecs for
> a particular encoding. You won't get around registering these
> at some point... be it in a lazy way, on-the-fly or by some
> other means.

What is wrong with my idea of using well-known-names from the encoding
module?  The dict then is "encodings..__dict__".  All
encodings "just work" because the leverage from the Python module
system.  Unless Im missing something, there is no need for any extra
registry at all.  I guess it would actually resolve to 2 dict lookups,
but thats OK surely?

Mark.




From mal at lemburg.com  Thu Nov 18 10:39:30 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 10:39:30 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf3173$f9805340$c0a0143f@tim>
Message-ID: <3833C952.C6F154B1@lemburg.com>

Tim Peters wrote:
> 
> [MAL]
> > Guido and I have decided to turn \uXXXX into a standard
> > escape sequence with no further magic applied. \uXXXX will
> > only be expanded in u"" strings.
> 
> Does that exclude ur"" strings?  Not arguing either way, just don't know
> what all this means.
> 
> > Here's the new scheme:
> >
> > With the 'unicode-escape' encoding being defined as:
> >
> > ? all non-escape characters represent themselves as a Unicode ordinal
> >   (e.g. 'a' -> U+0061).
> 
> Same as before (scream if that's wrong).
> 
> > ? all existing defined Python escape sequences are interpreted as
> >   Unicode ordinals;
> 
> Same as before (ditto).
> 
> > note that \xXXXX can represent all Unicode ordinals,
> 
> This means that the definition of \xXXXX has changed, then -- as you pointed
> out just yesterday , \xABCDq currently acts like \xCDq.  Does the new
> \x definition apply only in u"" strings, or in "" strings too?  What is the
> new \x definition?

Guido decided to make \xYYXX return U+YYXX *only* within u""
strings. In  "" (Python strings) the same sequence will result
in chr(0xXX).
 
> > and \OOO (octal) can represent Unicode ordinals up to U+01FF.
> 
> Same as before (ditto).
> 
> > ? a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
> >   error to have fewer than 4 digits after \u.
> 
> Same as before (ditto).
> 
> IOW, I don't see anything that's changed other than an unspecified new
> treatment of \x escapes, and possibly that ur"" strings don't expand \u
> escapes.

The difference is that we no longer take the two step approach.
\uXXXX is treated at the same time all other escape sequences
are decoded (the previous version first scanned and decoded
all standard Python sequences and then turned to the \uXXXX
sequences in a second scan).
 
> > Examples:
> >
> > u'abc'          -> U+0061 U+0062 U+0063
> > u'\u1234'       -> U+1234
> > u'abc\u1234\n'  -> U+0061 U+0062 U+0063 U+1234 U+05c
> 
> The last example is damaged (U+05c isn't legit).  Other than that, these
> look the same as before.

Corrected; thanks.
 
> > Now how should we define ur"abc\u1234\n"  ... ?
> 
> If strings carried an encoding tag with them, the obvious answer is that
> this acts exactly like r"abc\u1234\n" acts today except gets a
> "unicode-escaped" encoding tag instead of a "[whatever the default is
> today]" encoding tag.
> 
> If strings don't carry an encoding tag with them, you're in a bit of a
> pickle:  you'll have to convert it to a regular string or a Unicode string,
> but in either case have no way to communicate that it may need further
> processing; i.e., no way to distinguish it from a regular or Unicode string
> produced by any other mechanism.  The code I posted yesterday remains my
> best answer to that unpleasant puzzle (i.e., produce a Unicode string,
> fiddling with backslashes just enough to get the \u escapes expanded, in the
> same way Java's (conceptual) preprocessor does it).

They don't have such tags... so I guess we're in trouble ;-)

I guess to make ur"" have a meaning at all, we'd need to go
the Java preprocessor way here, i.e. scan the string *only*
for \uXXXX sequences, decode these and convert the rest as-is
to Unicode ordinals.

Would that be ok ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 12:41:32 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 12:41:32 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
Message-ID: <3833E5EC.AAFE5016@lemburg.com>

Mark Hammond wrote:
> 
> [Guido]
> > > (But weren't we going to do away with the whole registry
> > > idea in favor of an encodings package?)
> >
> [MAL]
> > One way or another, the Unicode implementation will have to
> > access a dictionary containing references to the codecs for
> > a particular encoding. You won't get around registering these
> > at some point... be it in a lazy way, on-the-fly or by some
> > other means.
> 
> What is wrong with my idea of using well-known-names from the encoding
> module?  The dict then is "encodings..__dict__".  All
> encodings "just work" because the leverage from the Python module
> system.  Unless Im missing something, there is no need for any extra
> registry at all.  I guess it would actually resolve to 2 dict lookups,
> but thats OK surely?

The problem is that the encoding names are not Python identifiers,
e.g. iso-8859-1 is allowed as identifier. This and
the fact that applications may want to ship their own codecs (which
do not get installed under the system wide encodings package)
make the registry necessary.

I don't see a problem with the registry though -- the encodings
package can take care of the registration process without any
user interaction. There would only have to be an API for
looking up an encoding published by the encodings package for
the Unicode implementation to use. The magic behind that API
is left to the encodings package...

BTW, nothing's wrong with your idea :-) In fact, I like it
a lot because it keeps the encoding modules out of the
top-level scope which is good.

PS: we could probably even take the whole codec idea one step
further and also allow other input/output formats to be registered,
e.g. stream ciphers or pickle mechanisms. The step in that
direction is not a big one: we'd only have to drop the specification
of the Unicode object in the spec and replace it with an arbitrary
object. Of course, this will still have to be a Unicode object
for use by the Unicode implementation.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From gmcm at hypernet.com  Thu Nov 18 15:19:48 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Thu, 18 Nov 1999 09:19:48 -0500
Subject: [Python-Dev] Python 1.6 status
In-Reply-To: <3833BFE9.6FD118B1@lemburg.com>
Message-ID: <1269187709-18981857@hypernet.com>

Marc-Andre wrote:

> Fredrik Lundh wrote:
> > 
> > Guido van Rossum  wrote:
> > > - suggestions for new issues that maybe ought to be settled in 1.6
> > 
> > three things: imputil, imputil, imputil
> 
> But please don't add the current version as default importer...
> its strategy is way too slow for real life apps (yes, I've tested
> this: imports typically take twice as long as with the builtin
> importer).

I think imputil's emulation of the builtin importer is more of a 
demonstration than a serious implementation. As for speed, it 
depends on the test. 
 
> I'd opt for an import manager which provides a useful API for
> import hooks to register themselves with. 

I think that rather than blindly chain themselves together, there 
should be a simple minded manager. This could let the 
programmer prioritize them.

> What we really need
> is not yet another complete reimplementation of what the
> builtin importer does, but rather a more detailed exposure of
> the various import aspects: finding modules and loading modules.

The first clause I sort of agree with - the current 
implementation is a fine implementation of a filesystem 
directory based importer.

I strongly disagree with the second clause. The current import 
hooks are just such a detailed exposure; and they are 
incomprehensible and unmanagable.

I guess you want to tweak the "finding" part of the builtin 
import mechanism. But that's no reason to ask all importers 
to break themselves up into "find" and "load" pieces. It's a 
reason to ask that the standard importer be, in some sense, 
"subclassable" (ie, expose hooks, or perhaps be an extension 
class like thingie).

- Gordon



From jim at interet.com  Thu Nov 18 15:39:20 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Thu, 18 Nov 1999 09:39:20 -0500
Subject: [Python-Dev] Python 1.6 status
References: <1269187709-18981857@hypernet.com>
Message-ID: <38340F98.212F61@interet.com>

Gordon McMillan wrote:
> 
> Marc-Andre wrote:
> 
> > Fredrik Lundh wrote:
> > >
> > > Guido van Rossum  wrote:
> > > > - suggestions for new issues that maybe ought to be settled in 1.6
> > >
> > > three things: imputil, imputil, imputil
> >
> > But please don't add the current version as default importer...
> > its strategy is way too slow for real life apps (yes, I've tested
> > this: imports typically take twice as long as with the builtin
> > importer).
> 
> I think imputil's emulation of the builtin importer is more of a
> demonstration than a serious implementation. As for speed, it
> depends on the test.

IMHO the current import mechanism is good for developers who must
work on the library code in the directory tree, but a disaster
for sysadmins who must distribute Python applications either
internally to a number of machines or commercially.  What we
need is a standard Python library file like a Java "Jar" file.
Imputil can support this as 130 lines of Python.  I have also
written one in C.  I like the imputil approach, but if we want
to add a library importer to import.c, I volunteer to write it.

I don't want to just add more complicated and unmanageable hooks
which people will all use different ways and just add to the
confusion.

It is easy to install packages by just making them into a library
file and throwing it into a directory.  So why aren't we doing it?

Jim Ahlstrom



From guido at CNRI.Reston.VA.US  Thu Nov 18 16:30:28 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 10:30:28 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: Your message of "Thu, 18 Nov 1999 09:19:48 EST."
             <1269187709-18981857@hypernet.com> 
References: <1269187709-18981857@hypernet.com> 
Message-ID: <199911181530.KAA03887@eric.cnri.reston.va.us>

Gordon McMillan wrote:

> Marc-Andre wrote:
> 
> > Fredrik Lundh wrote:
> >
> > > Guido van Rossum  wrote:
> > > > - suggestions for new issues that maybe ought to be settled in 1.6
> > > 
> > > three things: imputil, imputil, imputil
> > 
> > But please don't add the current version as default importer...
> > its strategy is way too slow for real life apps (yes, I've tested
> > this: imports typically take twice as long as with the builtin
> > importer).
> 
> I think imputil's emulation of the builtin importer is more of a 
> demonstration than a serious implementation. As for speed, it 
> depends on the test. 

Agreed.  I like some of imputil's features, but I think the API
need to be redesigned.

> > I'd opt for an import manager which provides a useful API for
> > import hooks to register themselves with. 
> 
> I think that rather than blindly chain themselves together, there 
> should be a simple minded manager. This could let the 
> programmer prioritize them.

Indeed.  (A list of importers has been suggested, to replace the list
of directories currently used.)

> > What we really need
> > is not yet another complete reimplementation of what the
> > builtin importer does, but rather a more detailed exposure of
> > the various import aspects: finding modules and loading modules.
> 
> The first clause I sort of agree with - the current 
> implementation is a fine implementation of a filesystem 
> directory based importer.
> 
> I strongly disagree with the second clause. The current import 
> hooks are just such a detailed exposure; and they are 
> incomprehensible and unmanagable.

Based on how many people have successfully written import hooks, I
have to agree. :-(

> I guess you want to tweak the "finding" part of the builtin 
> import mechanism. But that's no reason to ask all importers 
> to break themselves up into "find" and "load" pieces. It's a 
> reason to ask that the standard importer be, in some sense, 
> "subclassable" (ie, expose hooks, or perhaps be an extension 
> class like thingie).

Agreed.  Subclassing is a good way towards flexibility.

And Jim Ahlstrom writes:

> IMHO the current import mechanism is good for developers who must
> work on the library code in the directory tree, but a disaster
> for sysadmins who must distribute Python applications either
> internally to a number of machines or commercially.

Unfortunately, you're right. :-(

> What we need is a standard Python library file like a Java "Jar"
> file.  Imputil can support this as 130 lines of Python.  I have also
> written one in C.  I like the imputil approach, but if we want to
> add a library importer to import.c, I volunteer to write it.

Please volunteer to design or at least review the grand architecture
-- see below.

> I don't want to just add more complicated and unmanageable hooks
> which people will all use different ways and just add to the
> confusion.

You're so right!

> It is easy to install packages by just making them into a library
> file and throwing it into a directory.  So why aren't we doing it?

Rhetorical question. :-)

So here's a challenge: redesign the import API from scratch.

Let me start with some requirements.

Compatibility issues:
---------------------

- the core API may be incompatible, as long as compatibility layers
can be provided in pure Python

- support for rexec functionality

- support for freeze functionality

- load .py/.pyc/.pyo files and shared libraries from files

- support for packages

- sys.path and sys.modules should still exist; sys.path might
have a slightly different meaning

- $PYTHONPATH and $PYTHONHOME should still be supported

(I wouldn't mind a splitting up of importdl.c into several
platform-specific files, one of which is chosen by the configure
script; but that's a bit of a separate issue.)

New features:
-------------

- Integrated support for Greg Ward's distribution utilities (i.e. a
  module prepared by the distutil tools should install painlessly)

- Good support for prospective authors of "all-in-one" packaging tool
  authors like Gordon McMillan's win32 installer or /F's squish.  (But
  I *don't* require backwards compatibility for existing tools.)

- Standard import from zip or jar files, in two ways:

  (1) an entry on sys.path can be a zip/jar file instead of a directory;
      its contents will be searched for modules or packages

  (2) a file in a directory that's on sys.path can be a zip/jar file;
      its contents will be considered as a package (note that this is
      different from (1)!)

  I don't particularly care about supporting all zip compression
  schemes; if Java gets away with only supporting gzip compression
  in jar files, so can we.

- Easy ways to subclass or augment the import mechanism along
  different dimensions.  For example, while none of the following
  features should be part of the core implementation, it should be
  easy to add any or all:

  - support for a new compression scheme to the zip importer

  - support for a new archive format, e.g. tar

  - a hook to import from URLs or other data sources (e.g. a
    "module server" imported in CORBA) (this needn't be supported
    through $PYTHONPATH though)

  - a hook that imports from compressed .py or .pyc/.pyo files

  - a hook to auto-generate .py files from other filename
    extensions (as currently implemented by ILU)

  - a cache for file locations in directories/archives, to improve
    startup time

  - a completely different source of imported modules, e.g. for an
    embedded system or PalmOS (which has no traditional filesystem)

- Note that different kinds of hooks should (ideally, and within
  reason) properly combine, as follows: if I write a hook to recognize
  .spam files and automatically translate them into .py files, and you
  write a hook to support a new archive format, then if both hooks are
  installed together, it should be possible to find a .spam file in an
  archive and do the right thing, without any extra action.  Right?

- It should be possible to write hooks in C/C++ as well as Python

- Applications embedding Python may supply their own implementations,
  default search path, etc., but don't have to if they want to piggyback
  on an existing Python installation (even though the latter is
  fraught with risk, it's cheaper and easier to understand).

Implementation:
---------------

- There must clearly be some code in C that can import certain
  essential modules (to solve the chicken-or-egg problem), but I don't
  mind if the majority of the implementation is written in Python.
  Using Python makes it easy to subclass.

- In order to support importing from zip/jar files using compression,
  we'd at least need the zlib extension module and hence libz itself,
  which may not be available everywhere.

- I suppose that the bootstrap is solved using a mechanism very
  similar to what freeze currently used (other solutions seem to be
  platform dependent).

- I also want to still support importing *everything* from the
  filesystem, if only for development.  (It's hard enough to deal with
  the fact that exceptions.py is needed during Py_Initialize();
  I want to be able to hack on the import code written in Python
  without having to rebuild the executable all the time.

Let's first complete the requirements gathering.  Are these
requirements reasonable?  Will they make an implementation too
complex?  Am I missing anything?

Finally, to what extent does this impact the desire for dealing
differently with the Python bytecode compiler (e.g. supporting
optimizers written in Python)?  And does it affect the desire to
implement the read-eval-print loop (the >>> prompt) in Python?

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Thu Nov 18 16:37:49 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 10:37:49 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Thu, 18 Nov 1999 12:41:32 +0100."
             <3833E5EC.AAFE5016@lemburg.com> 
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>  
            <3833E5EC.AAFE5016@lemburg.com> 
Message-ID: <199911181537.KAA03911@eric.cnri.reston.va.us>

> The problem is that the encoding names are not Python identifiers,
> e.g. iso-8859-1 is allowed as identifier.

This is easily taken care of by translating each string of consecutive
non-identifier-characters to an underscore, so this would import the
iso_8859_1.py module.  (I also noticed in an earlier post that the
official name for Shift_JIS has an underscore, while most other
encodings use hyphens.)

> This and
> the fact that applications may want to ship their own codecs (which
> do not get installed under the system wide encodings package)
> make the registry necessary.

But it could be enough to register a package where to look for
encodings (in addition to the system package).

Or there could be a registry for encoding search functions.  (See the
import discussion.)

> I don't see a problem with the registry though -- the encodings
> package can take care of the registration process without any
> user interaction. There would only have to be an API for
> looking up an encoding published by the encodings package for
> the Unicode implementation to use. The magic behind that API
> is left to the encodings package...

I think that the collection of encodings will eventually grow large
enough to make it a requirement to avoid doing work proportional to
the number of supported encodings at startup (or even when an encoding
is referenced for the first time).  Any "lazy" mechanism (of which
module search is an example) will do.

> BTW, nothing's wrong with your idea :-) In fact, I like it
> a lot because it keeps the encoding modules out of the
> top-level scope which is good.

Yes.

> PS: we could probably even take the whole codec idea one step
> further and also allow other input/output formats to be registered,
> e.g. stream ciphers or pickle mechanisms. The step in that
> direction is not a big one: we'd only have to drop the specification
> of the Unicode object in the spec and replace it with an arbitrary
> object. Of course, this will still have to be a Unicode object
> for use by the Unicode implementation.

This is a step towards Java's architecture of stackable streams.

But I'm always in favor of tackling what we know we need before
tackling the most generalized version of the problem.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From mal at lemburg.com  Thu Nov 18 16:52:26 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 16:52:26 +0100
Subject: [Python-Dev] Python 1.6 status
References: <1269187709-18981857@hypernet.com> <38340F98.212F61@interet.com>
Message-ID: <383420BA.EF8A6AC5@lemburg.com>

[imputil and friends]

"James C. Ahlstrom" wrote:
> 
> IMHO the current import mechanism is good for developers who must
> work on the library code in the directory tree, but a disaster
> for sysadmins who must distribute Python applications either
> internally to a number of machines or commercially.  What we
> need is a standard Python library file like a Java "Jar" file.
> Imputil can support this as 130 lines of Python.  I have also
> written one in C.  I like the imputil approach, but if we want
> to add a library importer to import.c, I volunteer to write it.
> 
> I don't want to just add more complicated and unmanageable hooks
> which people will all use different ways and just add to the
> confusion.
> 
> It is easy to install packages by just making them into a library
> file and throwing it into a directory.  So why aren't we doing it?

Perhaps we ought to rethink the strategy under a different
light: what are the real requirement we have for Python imports ?

Perhaps the outcome is only the addition of say one or two features
and those can probably easily be added to the builtin system...
then we can just forget about the whole import hook dilema
for quite a while (AFAIK, this is how we got packages into the
core -- people weren't happy with the import hook).

Well, just an idea... I have other threads to follow :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From fdrake at acm.org  Thu Nov 18 17:01:47 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 18 Nov 1999 11:01:47 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <3833E5EC.AAFE5016@lemburg.com>
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
	<3833E5EC.AAFE5016@lemburg.com>
Message-ID: <14388.8939.911928.41746@weyr.cnri.reston.va.us>

M.-A. Lemburg writes:
 > The problem is that the encoding names are not Python identifiers,
 > e.g. iso-8859-1 is allowed as identifier. This and
 > the fact that applications may want to ship their own codecs (which
 > do not get installed under the system wide encodings package)
 > make the registry necessary.

  This isn't a substantial problem.  Try this on for size (probably
not too different from what everyone is already thinking, but let's
make it clear).  This could be in encodings/__init__.py; I've tried to 
be really clear on the names.  (No testing, only partially complete.)

------------------------------------------------------------------------
import string
import sys

try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO


class EncodingError(Exception):
    def __init__(self, encoding, error):
        self.encoding = encoding
        self.strerror = "%s %s" % (error, `encoding`)
        self.error = error
        Exception.__init__(self, encoding, error)


_registry = {}

def registerEncoding(encoding, encode=None, decode=None,
                     make_stream_encoder=None, make_stream_decoder=None):
    encoding = encoding.lower()
    if _registry.has_key(encoding):
        info = _registry[encoding]
    else:
        info = _registry[encoding] = Codec(encoding)
    info._update(encode, decode,
                 make_stream_encoder, make_stream_decoder)


def getCodec(encoding):
    encoding = encoding.lower()
    if _registry.has_key(encoding):
        return _registry[encoding]

    # load the module
    modname = "encodings." + encoding.replace("-", "_")
    try:
        __import__(modname)
    except ImportError:
        raise EncodingError("unknown uncoding " + `encoding`)

    # if the module registered, use the codec as-is:
    if _registry.has_key(encoding):
        return _registry[encoding]

    # nothing registered, use well-known names
    module = sys.modules[modname]
    codec = _registry[encoding] = Codec(encoding)
    encode = getattr(module, "encode", None)
    decode = getattr(module, "decode", None)
    make_stream_encoder = getattr(module, "make_stream_encoder", None)
    make_stream_decoder = getattr(module, "make_stream_decoder", None)
    codec._update(encode, decode,
                  make_stream_encoder, make_stream_decoder)


class Codec:
    __encode = None
    __decode = None
    __stream_encoder_factory = None
    __stream_decoder_factory = None

    def __init__(self, name):
        self.name = name

    def encode(self, u):
        if self.__stream_encoder_factory:
            sio = StringIO()
            encoder = self.__stream_encoder_factory(sio)
            encoder.write(u)
            encoder.flush()
            return sio.getvalue()
        else:
            raise EncodingError("no encoder available for " + `self.name`)

    # similar for decode()...

    def make_stream_encoder(self, target):
        if self.__stream_encoder_factory:
            return self.__stream_encoder_factory(target)
        elif self.__encode:
            return DefaultStreamEncoder(target, self.__encode)
        else:
            raise EncodingError("no encoder available for " + `self.name`)

    # similar for make_stream_decoder()...

    def _update(self, encode, decode,
                make_stream_encoder, make_stream_decoder):
        self.__encode = encode or self.__encode
        self.__decode = decode or self.__decode
        self.__stream_encoder_factory = (
            make_stream_encoder or self.__stream_encoder_factory)
        self.__stream_decoder_factory = (
            make_stream_decoder or self.__stream_decoder_factory)
------------------------------------------------------------------------

 > I don't see a problem with the registry though -- the encodings
 > package can take care of the registration process without any

  No problem at all; we just need to make sure the right magic is
there for the "normal" case.

 > PS: we could probably even take the whole codec idea one step
 > further and also allow other input/output formats to be registered,

  File formats are different from text encodings, so let's keep them
separate.  Yes, a registry can be a good approach whenever the various 
things being registered are sufficiently similar semantically, but the 
behavior of the registry/lookup can be very different for each type of 
thing.  Let's not over-generalize.


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From fdrake at acm.org  Thu Nov 18 17:02:45 1999
From: fdrake at acm.org (Fred L. Drake, Jr.)
Date: Thu, 18 Nov 1999 11:02:45 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: <3833E5EC.AAFE5016@lemburg.com>
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
	<3833E5EC.AAFE5016@lemburg.com>
Message-ID: <14388.8997.703108.401808@weyr.cnri.reston.va.us>

  Er, I should note that the sample code I just sent makes use of
string methods.  ;)


  -Fred

--
Fred L. Drake, Jr.	     
Corporation for National Research Initiatives



From mal at lemburg.com  Thu Nov 18 17:23:09 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 17:23:09 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>  
	            <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us>
Message-ID: <383427ED.45A01BBB@lemburg.com>

Guido van Rossum wrote:
> 
> > The problem is that the encoding names are not Python identifiers,
> > e.g. iso-8859-1 is allowed as identifier.
> 
> This is easily taken care of by translating each string of consecutive
> non-identifier-characters to an underscore, so this would import the
> iso_8859_1.py module.  (I also noticed in an earlier post that the
> official name for Shift_JIS has an underscore, while most other
> encodings use hyphens.)

Right. That's one way of doing it.

> > This and
> > the fact that applications may want to ship their own codecs (which
> > do not get installed under the system wide encodings package)
> > make the registry necessary.
> 
> But it could be enough to register a package where to look for
> encodings (in addition to the system package).
> 
> Or there could be a registry for encoding search functions.  (See the
> import discussion.)

Like a path of search functions ? Not a bad idea... I will still
want the internal dict for caching purposes though. I'm not sure
how often these encodings will be, but even a few hundred function
call will slow down the Unicode implementation quite a bit.

The implementation could proceed as follows:

def lookup(encoding):

    codecs = _internal_dict.get(encoding,None)
    if codecs:
	return codecs
    for query in sys.encoders:
	codecs = query(encoding)
	if codecs:
	    break
    else:
	raise UnicodeError,'unkown encoding: %s' % encoding
    _internal_dict[encoding] = codecs
    return codecs

For simplicity, codecs should be a tuple (encoder,decoder,
stream_writer,stream_reader) of factory functions.

...that is if we can agree on these 4 APIs :-) Here are my
current versions:
-----------------------------------------------------------------------
class Codec:

    """ Defines the interface for stateless encoders/decoders.
    """

    def __init__(self,errors='strict'):

	""" Creates a Codec instance.

	    The Codec may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.errors = errors

    def encode(self,u,slice=None):
	
	""" Return the Unicode object u encoded as Python string.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is encoded.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	"""
	...

    def decode(self,s,offset=0):

	""" Decodes data from the Python string s and returns a tuple 
	    (Unicode object, bytes consumed).
	
	    If offset is given, the decoding process starts at
	    s[offset]. It defaults to 0.

	    The method may not store state in the Codec instance. Use
	    SteamCodec for codecs which have to keep state in order to
	    make encoding/decoding efficient.

	""" 
	...


StreamWriter and StreamReader define the interface for stateful
encoders/decoders:

class StreamWriter(Codec):

    def __init__(self,stream,errors='strict'):

	""" Creates a StreamWriter instance.

	    stream must be a file-like object open for writing
	    (binary) data.

	    The StreamWriter may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def write(self,u,slice=None):

	""" Writes the Unicode object's contents encoded to self.stream
	    and returns the number of bytes written.

	    If slice is given (as slice object), only the sliced part
	    of the Unicode object is written.

        """
	... the base class should provide a default implementation
	    of this method using self.encode ...
	
    def flush(self):

	""" Flushed the codec buffers used for keeping state.

	    Returns values are not defined. Implementations are free to
	    return None, raise an exception (in case there is pending
	    data in the buffers which could not be decoded) or
	    return any remaining data from the state buffers used.

	"""
	pass

class StreamReader(Codec):

    def __init__(self,stream,errors='strict'):

	""" Creates a StreamReader instance.

	    stream must be a file-like object open for reading
	    (binary) data.

	    The StreamReader may implement different error handling
	    schemes by providing the errors argument. These parameters
	    are defined:

	     'strict' - raise an UnicodeError (or a subclass)
	     'ignore' - ignore the character and continue with the next
	     (a single character)
	              - replace errorneous characters with the given
	                character (may also be a Unicode character)

	"""
	self.stream = stream

    def read(self,chunksize=0):

	""" Decodes data from the stream self.stream and returns a tuple 
	    (Unicode object, bytes consumed).

	    chunksize indicates the approximate maximum number of
	    bytes to read from the stream for decoding purposes. The
	    decoder can modify this setting as appropriate. The default
	    value 0 indicates to read and decode as much as possible.
	    The chunksize is intended to prevent having to decode huge
	    files in one step.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...

    def flush(self):

	""" Flushed the codec buffers used for keeping state.

	    Returns values are not defined. Implementations are free to
	    return None, raise an exception (in case there is pending
	    data in the buffers which could not be decoded) or
	    return any remaining data from the state buffers used.

	"""

In addition to the above methods, the StreamWriter and StreamReader
instances should also provide access to all other methods defined for
the stream object.

Stream codecs are free to combine the StreamWriter and StreamReader
interfaces into one class.
-----------------------------------------------------------------------

> > I don't see a problem with the registry though -- the encodings
> > package can take care of the registration process without any
> > user interaction. There would only have to be an API for
> > looking up an encoding published by the encodings package for
> > the Unicode implementation to use. The magic behind that API
> > is left to the encodings package...
> 
> I think that the collection of encodings will eventually grow large
> enough to make it a requirement to avoid doing work proportional to
> the number of supported encodings at startup (or even when an encoding
> is referenced for the first time).  Any "lazy" mechanism (of which
> module search is an example) will do.

Right. The list of search functions should provide this kind
of lazyness. It also provides ways to implement other strategies
to look for codecs, e.g. PIL could provide such a search function
for its codecs, mxCrypto for the included ciphers, etc.
 
> > BTW, nothing's wrong with your idea :-) In fact, I like it
> > a lot because it keeps the encoding modules out of the
> > top-level scope which is good.
> 
> Yes.
> 
> > PS: we could probably even take the whole codec idea one step
> > further and also allow other input/output formats to be registered,
> > e.g. stream ciphers or pickle mechanisms. The step in that
> > direction is not a big one: we'd only have to drop the specification
> > of the Unicode object in the spec and replace it with an arbitrary
> > object. Of course, this will still have to be a Unicode object
> > for use by the Unicode implementation.
> 
> This is a step towards Java's architecture of stackable streams.
> 
> But I'm always in favor of tackling what we know we need before
> tackling the most generalized version of the problem.

Well, I just wanted to mention the possibility... might be
something to look into next year. I find it rather thrilling
to be able to create encrypted streams by just hooking together
a few stream codecs...

f = open('myfile.txt','w')

CipherWriter = sys.codec('rc5-cipher')[3]
sf = StreamWriter(f,key='xxxxxxxx')

UTF8Writer = sys.codec('utf-8')[3]
sfx = UTF8Writer(sf)

sfx.write('asdfasdfasdfasdf')
sfx.close()

Hmm, we should probably define the additional constructor
arguments to be keyword arguments... writers/readers other
than Unicode ones will probably need different kinds of
parameters (such as the key in the above example).

Ahem, ...I'm getting distracted here :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From bwarsaw at cnri.reston.va.us  Thu Nov 18 17:23:41 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Thu, 18 Nov 1999 11:23:41 -0500 (EST)
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat>
	<3833E5EC.AAFE5016@lemburg.com>
	<14388.8997.703108.401808@weyr.cnri.reston.va.us>
Message-ID: <14388.10253.902424.904199@anthem.cnri.reston.va.us>

>>>>> "Fred" == Fred L Drake, Jr  writes:

    Fred>   Er, I should note that the sample code I just sent makes
    Fred> use of string methods.  ;)

Yay!



From guido at CNRI.Reston.VA.US  Thu Nov 18 17:37:08 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 11:37:08 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Thu, 18 Nov 1999 17:23:09 +0100."
             <383427ED.45A01BBB@lemburg.com> 
References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us>  
            <383427ED.45A01BBB@lemburg.com> 
Message-ID: <199911181637.LAA04260@eric.cnri.reston.va.us>

> Like a path of search functions ? Not a bad idea... I will still
> want the internal dict for caching purposes though. I'm not sure
> how often these encodings will be, but even a few hundred function
> call will slow down the Unicode implementation quite a bit.

Of course.  (It's like sys.modules caching the results of an import).

[...]
>     def flush(self):
> 
> 	""" Flushed the codec buffers used for keeping state.
> 
> 	    Returns values are not defined. Implementations are free to
> 	    return None, raise an exception (in case there is pending
> 	    data in the buffers which could not be decoded) or
> 	    return any remaining data from the state buffers used.
> 
> 	"""

I don't know where this came from, but a flush() should work like
flush() on a file.  It doesn't return a value, it just sends any
remaining data to the underlying stream (for output).  For input it
shouldn't be supported at all.

The idea is that flush() should do the same to the encoder state that
close() followed by a reopen() would do.  Well, more or less.  But if
the process were to be killed right after a flush(), the data written
to disk should be a complete encoding, and not have a lingering shift
state.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Thu Nov 18 17:59:06 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 11:59:06 -0500
Subject: [Python-Dev] Codecs and StreamCodecs
In-Reply-To: Your message of "Thu, 18 Nov 1999 09:50:36 +0100."
             <3833BDDC.7CD2CC1F@lemburg.com> 
References: <38317FBA.4F3D6B1F@lemburg.com> <199911161620.LAA02643@eric.cnri.reston.va.us> <002e01bf3069$8477b440$f29b12c2@secret.pythonware.com> <3832757E.B9503606@lemburg.com> <004101bf30ea$eb3801e0$f29b12c2@secret.pythonware.com>  
            <3833BDDC.7CD2CC1F@lemburg.com> 
Message-ID: <199911181659.LAA04303@eric.cnri.reston.va.us>

[Responding to some lingering mails]

[/F]
> >     >>> u = unicode("? i ?a ? e ?", "iso-latin-1")
> >     >>> s = u.encode("html-entities")
> >     >>> d = decoder("html-entities")
> >     >>> d.decode(s[:-1])
> >     "? i ?a ? e "
> >     >>> d.flush()
> >     "?"

[MAL]
> Ah, ok. So the .flush() method checks for proper
> string endings and then either returns the remaining
> input or raises an error.

No, please.  See my previous post on flush().

> > input: read chunks of data, decode, and
> > keep extra data in a local buffer.
> > 
> > output: encode data into suitable chunks,
> > and write to the output stream (that's why
> > there's a buffersize argument to encode --
> > if someone writes a 10mb unicode string to
> > an encoded stream, python shouldn't allocate
> > an extra 10-30 megabytes just to be able to
> > encode the darn thing...)
> 
> So the stream codecs would be wrappers around the
> string codecs.

No -- the other way around.  Think of the stream encoder as a little
FSM engine that you feed with unicode characters and which sends bytes
to the backend stream.  When a unicode character comes in that
requires a particular shift state, and the FSM isn't in that shift
state, it emits the escape sequence to enter that shift state first.
It should use standard buffered writes to the output stream; i.e. one
call to feed the encoder could cause several calls to write() on the
output stream, or vice versa (if you fed the encoder a single
character it might keep it in its own buffer).  That's all up to the
codec implementation.

The flush() forces the FSM into the "neutral" shift state, possibly
writing an escape sequence to leave the current shift state, and
empties the internal buffer.

The string codec CONCEPTUALLY uses the stream codec to a cStringIO
object, using flush() to force the final output.  However the
implementation may take a shortcut.  For stateless encodings the
stream codec may call on the string codec, but that's all an
implementation issue.

For input, things are slightly different (you don't know how much
encoded data you must read to give you N Unicode characters, so you
may have to make a guess and hold on to some data that you read
unnecessarily -- either in encoded form or in Unicode form, at the
discretion of the implementation.  Using seek() on the input stream is
forbidden (it could be a pipe or socket).

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Thu Nov 18 18:11:51 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 12:11:51 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: Your message of "Thu, 18 Nov 1999 10:39:30 +0100."
             <3833C952.C6F154B1@lemburg.com> 
References: <000101bf3173$f9805340$c0a0143f@tim>  
            <3833C952.C6F154B1@lemburg.com> 
Message-ID: <199911181711.MAA04342@eric.cnri.reston.va.us>

> > > Now how should we define ur"abc\u1234\n"  ... ?
> > 
> > If strings carried an encoding tag with them, the obvious answer is that
> > this acts exactly like r"abc\u1234\n" acts today except gets a
> > "unicode-escaped" encoding tag instead of a "[whatever the default is
> > today]" encoding tag.
> > 
> > If strings don't carry an encoding tag with them, you're in a bit of a
> > pickle:  you'll have to convert it to a regular string or a Unicode string,
> > but in either case have no way to communicate that it may need further
> > processing; i.e., no way to distinguish it from a regular or Unicode string
> > produced by any other mechanism.  The code I posted yesterday remains my
> > best answer to that unpleasant puzzle (i.e., produce a Unicode string,
> > fiddling with backslashes just enough to get the \u escapes expanded, in the
> > same way Java's (conceptual) preprocessor does it).
> 
> They don't have such tags... so I guess we're in trouble ;-)
> 
> I guess to make ur"" have a meaning at all, we'd need to go
> the Java preprocessor way here, i.e. scan the string *only*
> for \uXXXX sequences, decode these and convert the rest as-is
> to Unicode ordinals.
> 
> Would that be ok ?

Read Tim's code (posted about 40 messages ago in this list).

Like Java, it interprets \u.... when the number of backslashes is odd,
but not when it's even.  So \\u.... returns exactly that, while
\\\u.... returns two backslashes and a unicode character.

This is nice and can be done regardless of whether we are going to
interpret other \ escapes or not.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From skip at mojam.com  Thu Nov 18 18:34:51 1999
From: skip at mojam.com (Skip Montanaro)
Date: Thu, 18 Nov 1999 11:34:51 -0600 (CST)
Subject: [Python-Dev] just say no...
In-Reply-To: <000401bf30d8$6cf30bc0$a42d153f@tim>
References: <383156DF.2209053F@lemburg.com>
	<000401bf30d8$6cf30bc0$a42d153f@tim>
Message-ID: <14388.14523.158050.594595@dolphin.mojam.com>

    >> FYI, the next version of the proposal ...  File objects opened in
    >> text mode will use "t#" and binary ones use "s#".

    Tim> Am I the only one who sees magical distinctions between text and
    Tim> binary mode as a Really Bad Idea? 

No.

    Tim> I wouldn't have guessed the Unix natives here would quietly
    Tim> acquiesce to importing a bit of Windows madness .

We figured you and Guido would come to our rescue... ;-)

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From mal at lemburg.com  Thu Nov 18 19:15:54 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 19:15:54 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.7
References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com>
Message-ID: <3834425A.8E9C3B7E@lemburg.com>

FYI, I've uploaded a new version of the proposal which includes
new codec APIs, a new codec search mechanism and some minor
fixes here and there.

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

    ? Unicode objects support for %-formatting

    ? Design of the internal C API and the Python API for
      the Unicode character properties database

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Thu Nov 18 19:32:49 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 19:32:49 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf3173$f9805340$c0a0143f@tim>  
	            <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>
Message-ID: <38344651.960878A2@lemburg.com>

Guido van Rossum wrote:
> 
> > I guess to make ur"" have a meaning at all, we'd need to go
> > the Java preprocessor way here, i.e. scan the string *only*
> > for \uXXXX sequences, decode these and convert the rest as-is
> > to Unicode ordinals.
> >
> > Would that be ok ?
> 
> Read Tim's code (posted about 40 messages ago in this list).

I did, but wasn't sure whether he was argueing for going the
Java way...
 
> Like Java, it interprets \u.... when the number of backslashes is odd,
> but not when it's even.  So \\u.... returns exactly that, while
> \\\u.... returns two backslashes and a unicode character.
> 
> This is nice and can be done regardless of whether we are going to
> interpret other \ escapes or not.

So I'll take that as: this is what we want in Python too :-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From mal at lemburg.com  Thu Nov 18 19:38:41 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Thu, 18 Nov 1999 19:38:41 +0100
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
References: <000101bf3173$f9805340$c0a0143f@tim>  
	            <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>
Message-ID: <383447B1.1B7B594C@lemburg.com>

Would this definition be fine ?
"""

  u = ur''

The 'raw-unicode-escape' encoding is defined as follows:

? \uXXXX sequence represent the U+XXXX Unicode character if and
  only if the number of leading backslashes is odd

? all other characters represent themselves as Unicode ordinal
  (e.g. 'b' -> U+0062)

"""

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/



From guido at CNRI.Reston.VA.US  Thu Nov 18 19:46:35 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 13:46:35 -0500
Subject: [Python-Dev] just say no...
In-Reply-To: Your message of "Thu, 18 Nov 1999 11:34:51 CST."
             <14388.14523.158050.594595@dolphin.mojam.com> 
References: <383156DF.2209053F@lemburg.com> <000401bf30d8$6cf30bc0$a42d153f@tim>  
            <14388.14523.158050.594595@dolphin.mojam.com> 
Message-ID: <199911181846.NAA04547@eric.cnri.reston.va.us>

>     >> FYI, the next version of the proposal ...  File objects opened in
>     >> text mode will use "t#" and binary ones use "s#".
> 
>     Tim> Am I the only one who sees magical distinctions between text and
>     Tim> binary mode as a Really Bad Idea? 
> 
> No.
> 
>     Tim> I wouldn't have guessed the Unix natives here would quietly
>     Tim> acquiesce to importing a bit of Windows madness .
> 
> We figured you and Guido would come to our rescue... ;-)

Don't count on me.  My brain is totally cross-platform these days, and
writing "rb" or "wb" for files containing binary data is second nature
for me.  I actually *like* it.

Anyway, the Unicode stuff ought to have a wrapper open(filename, mode,
encoding) where the 'b' will be added to the mode if you don't give it
and it's needed.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Thu Nov 18 19:50:20 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 13:50:20 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: Your message of "Thu, 18 Nov 1999 19:32:49 +0100."
             <38344651.960878A2@lemburg.com> 
References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>  
            <38344651.960878A2@lemburg.com> 
Message-ID: <199911181850.NAA04576@eric.cnri.reston.va.us>

> > Like Java, it interprets \u.... when the number of backslashes is odd,
> > but not when it's even.  So \\u.... returns exactly that, while
> > \\\u.... returns two backslashes and a unicode character.
> > 
> > This is nice and can be done regardless of whether we are going to
> > interpret other \ escapes or not.
> 
> So I'll take that as: this is what we want in Python too :-)

I'll reserve judgement until we've got some experience with it in the
field, but it seems the best compromise.  It also gives a clear
explanation about why we have \uXXXX when we already have \xXXXX.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From guido at CNRI.Reston.VA.US  Thu Nov 18 19:57:36 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 13:57:36 -0500
Subject: UTF-8 in source code (Re: [Python-Dev] Internationalization Toolkit)
In-Reply-To: Your message of "Thu, 18 Nov 1999 19:38:41 +0100."
             <383447B1.1B7B594C@lemburg.com> 
References: <000101bf3173$f9805340$c0a0143f@tim> <3833C952.C6F154B1@lemburg.com> <199911181711.MAA04342@eric.cnri.reston.va.us>  
            <383447B1.1B7B594C@lemburg.com> 
Message-ID: <199911181857.NAA04617@eric.cnri.reston.va.us>

> Would this definition be fine ?
> """
> 
>   u = ur''
> 
> The 'raw-unicode-escape' encoding is defined as follows:
> 
> ? \uXXXX sequence represent the U+XXXX Unicode character if and
>   only if the number of leading backslashes is odd
> 
> ? all other characters represent themselves as Unicode ordinal
>   (e.g. 'b' -> U+0062)
> 
> """

Yes.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From skip at mojam.com  Thu Nov 18 20:09:46 1999
From: skip at mojam.com (Skip Montanaro)
Date: Thu, 18 Nov 1999 13:09:46 -0600 (CST)
Subject: [Python-Dev] Unicode Proposal: Version 0.7
In-Reply-To: <3834425A.8E9C3B7E@lemburg.com>
References: <382C0A54.E6E8328D@lemburg.com>
	<382D625B.DC14DBDE@lemburg.com>
	<38316685.7977448D@lemburg.com>
	<3834425A.8E9C3B7E@lemburg.com>
Message-ID: <14388.20218.294814.234327@dolphin.mojam.com>

I haven't been following this discussion closely at all, and have no
previous experience with Unicode, so please pardon a couple stupid questions
from the peanut gallery:

    1. What does U+0061 mean (other than 'a')?  That is, what is U?

    2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter
       description.  Given a Unicode object with encoding e1, how do I write
       it to a file that is to be encoded with encoding e2?  Seems like I
       would do something like

           u1 = unicode(s, encoding=e1)
	   f = open("somefile", "wb")
	   u2 = unicode(u1, encoding=e2)
	   f.write(u2)

       Is that how it would be done?  Does this question even make sense?

    3. What will the impact be on programmers such as myself currently
       living with blinders on (that is, writing in plain old 7-bit ASCII)?

Thx,

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From jim at interet.com  Thu Nov 18 20:23:53 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Thu, 18 Nov 1999 14:23:53 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>
Message-ID: <38345249.4AFD91DA@interet.com>

Guido van Rossum wrote:
>
> Let's first complete the requirements gathering.

Yes.

> Are these
> requirements reasonable?  Will they make an implementation too
> complex?

I think you can get 90% of where you want to be with something
much simpler.  And the simpler implementation will be useful in
the 100% solution, so it is not wasted time.

How about if we just design a Python archive file format; provide
code in the core (in Python or C) to import from it; provide a
Python program to create archive files; and provide a Standard
Directory to put archives in so they can be found quickly.  For
extensibility and control, we add functions to the imp module.
Detailed comments follow:


> Compatibility issues:
> ---------------------
> [list of current features...]

Easily met by keeping the current C code.

> 
> New features:
> -------------
> 
> - Integrated support for Greg Ward's distribution utilities (i.e. a
>   module prepared by the distutil tools should install painlessly)
> 
> - Good support for prospective authors of "all-in-one" packaging tool
>   authors like Gordon McMillan's win32 installer or /F's squish.  (But
>   I *don't* require backwards compatibility for existing tools.)

These tools go well beyond just an archive file format, but hopefully
a file format will help.  Greg and Gordon should be able to control the
format so it meets their needs.  We need a standard format.
 
> - Standard import from zip or jar files, in two ways:
> 
>   (1) an entry on sys.path can be a zip/jar file instead of a directory;
>       its contents will be searched for modules or packages
> 
>   (2) a file in a directory that's on sys.path can be a zip/jar file;
>       its contents will be considered as a package (note that this is
>       different from (1)!)

I don't like sys.path at all.  It is currently part of the problem.
I suggest that archive files MUST be put into a known directory.
On Windows this is the directory of the executable, sys.executable.
On Unix this $PREFIX plus version, namely
  "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]).
Other platforms can have different rules.

We should also have the ability to append archive files to the
executable or a shared library assuming the OS allows this
(Windows and Linux do allow it).  This is the first location
searched, nails the archive to the interpreter, insulates us
from an erroneous sys.path, and enables single-file Python programs.

>   I don't particularly care about supporting all zip compression
>   schemes; if Java gets away with only supporting gzip compression
>   in jar files, so can we.

We don't need compression.  The whole ./Lib is 1.2 Meg, and if we
compress
it to zero we save a Meg.  Irrelevant.  Installers provide compression
anyway so when Python programs are shipped, they will be compressed
then.

Problems are that Python does not ship with compression, we will
have to add it, we will have to support it and its current method
of compression forever, and it adds complexity.
 
> - Easy ways to subclass or augment the import mechanism along
>   different dimensions.  For example, while none of the following
>   features should be part of the core implementation, it should be
>   easy to add any or all:
>
>  [ List of new features including hooks...]

Sigh, this proposal does not provide for this.  It seems
like a job for imputil.  But if the file format and import code
is available from the imp module, it can be used as part of the
solution.

>   - support for a new compression scheme to the zip importer

I guess compression should be easy to add if Python ships with
a compression module.
 
>   - a cache for file locations in directories/archives, to improve
>     startup time

If the Python library is available as an archive, I think
startup will be greatly improved anyway.
 
> Implementation:
> ---------------
> 
> - There must clearly be some code in C that can import certain
>   essential modules (to solve the chicken-or-egg problem), but I don't
>   mind if the majority of the implementation is written in Python.
>   Using Python makes it easy to subclass.

Yes.
 
> - In order to support importing from zip/jar files using compression,
>   we'd at least need the zlib extension module and hence libz itself,
>   which may not be available everywhere.

That's a good reason to omit compression.  At least for now.
 
> - I suppose that the bootstrap is solved using a mechanism very
>   similar to what freeze currently used (other solutions seem to be
>   platform dependent).

Yes, except that we need to be careful to preserve the freeze feature
for users.  We don't want to take it over.
 
> - I also want to still support importing *everything* from the
>   filesystem, if only for development.  (It's hard enough to deal with
>   the fact that exceptions.py is needed during Py_Initialize();
>   I want to be able to hack on the import code written in Python
>   without having to rebuild the executable all the time.

Yes, we need a function in imp to turn archives off:
  import imp
  imp.archiveEnable(0)
 
> Finally, to what extent does this impact the desire for dealing
> differently with the Python bytecode compiler (e.g. supporting
> optimizers written in Python)?  And does it affect the desire to
> implement the read-eval-print loop (the >>> prompt) in Python?

I don't think it impacts these at all.

Jim Ahlstrom



From guido at CNRI.Reston.VA.US  Thu Nov 18 20:55:02 1999
From: guido at CNRI.Reston.VA.US (Guido van Rossum)
Date: Thu, 18 Nov 1999 14:55:02 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: Your message of "Thu, 18 Nov 1999 14:23:53 EST."
             <38345249.4AFD91DA@interet.com> 
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>  
            <38345249.4AFD91DA@interet.com> 
Message-ID: <199911181955.OAA04830@eric.cnri.reston.va.us>

> I think you can get 90% of where you want to be with something
> much simpler.  And the simpler implementation will be useful in
> the 100% solution, so it is not wasted time.

Agreed, but I'm not sure that it addresses the problems that started
this thread.  I can't really tell, since the message starting the
thread just requested imputil, without saying which parts of it were
needed.  A followup claimed that imputil was a fine prototype but too
slow for real work.

I inferred that flexibility was requested.  But maybe that was
projection since that was on my own list.  (I'm happy with the
performance and find manipulating zip or jar files clumsy, so I'm not
too concerned about all the nice things you can *do* with that
flexibility. :-)

> How about if we just design a Python archive file format; provide
> code in the core (in Python or C) to import from it; provide a
> Python program to create archive files; and provide a Standard
> Directory to put archives in so they can be found quickly.  For
> extensibility and control, we add functions to the imp module.
> Detailed comments follow:

> These tools go well beyond just an archive file format, but hopefully
> a file format will help.  Greg and Gordon should be able to control the
> format so it meets their needs.  We need a standard format.

I think the standard format should be a subclass of zip or jar (which
is itself a subclass of zip).  We have already written (at CNRI, as
yet unreleased) the necessary Python tools to manipulate zip archives;
moreover 3rd party tools are abundantly available, both on Unix and on
Windows (as well as in Java).  Zip files also lend themselves to
self-extracting archives and similar things, because the file index is
at the end, so I think that Greg & Gordon should be happy.

> I don't like sys.path at all.  It is currently part of the problem.

Eh?  That's the first thing I hear something bad about it.  Maybe
that's because you live on Windows -- on Unix, search paths are
ubiquitous.

> I suggest that archive files MUST be put into a known directory.

Why?  Maybe this works on Windows; on Unix this is asking for trouble
because it prevents users from augmenting the installation provided by
the sysadmin.  Even on newer Windows versions, users without admin
perms may not be allowed to add files to that privileged directory.

> On Windows this is the directory of the executable, sys.executable.
> On Unix this $PREFIX plus version, namely
>   "%s/lib/python%s/" % (sys.prefix, sys.version[0:3]).
> Other platforms can have different rules.
> 
> We should also have the ability to append archive files to the
> executable or a shared library assuming the OS allows this
> (Windows and Linux do allow it).  This is the first location
> searched, nails the archive to the interpreter, insulates us
> from an erroneous sys.path, and enables single-file Python programs.

OK for the executable.  I'm not sure what the point is of appending an
archive to the shared library?  Anyway, does it matter (on Windows) if
you add it to python16.dll or to python.exe?

> We don't need compression.  The whole ./Lib is 1.2 Meg, and if we
> compress
> it to zero we save a Meg.  Irrelevant.  Installers provide compression
> anyway so when Python programs are shipped, they will be compressed
> then.
> 
> Problems are that Python does not ship with compression, we will
> have to add it, we will have to support it and its current method
> of compression forever, and it adds complexity.

OK, OK.  I think most zip tools have a way to turn off the
compression.  (Anyway, it's a matter of more I/O time vs. more CPU
time; hardare for both is getting better faster than we can tweak the
code :-)

> Sigh, this proposal does not provide for this.  It seems
> like a job for imputil.  But if the file format and import code
> is available from the imp module, it can be used as part of the
> solution.

Well, the question is really if we want flexibility or archive files.
I care more about the flexibility.  If we get a clear vote for archive
files, I see no problem with implementing that first.

> If the Python library is available as an archive, I think
> startup will be greatly improved anyway.

Really?  I know about all the system calls it makes, but I don't
really see much of a delay -- I have a prompt in well under 0.1
second.

--Guido van Rossum (home page: http://www.python.org/~guido/)



From gstein at lyra.org  Thu Nov 18 23:03:55 1999
From: gstein at lyra.org (Greg Stein)
Date: Thu, 18 Nov 1999 14:03:55 -0800 (PST)
Subject: [Python-Dev] file modes (was: just say no...)
In-Reply-To: <3833B588.1E31F01B@lemburg.com>
Message-ID: 

On Thu, 18 Nov 1999, M.-A. Lemburg wrote:
> Tim Peters wrote:
> > [MAL]
> > > File objects opened in text mode will use "t#" and binary
> > > ones use "s#".
> > 
> > [Greg Stein]
> > > ...
> > > The real annoying thing would be to assume that opening a file as 'r'
> > > means that I *meant* text mode and to start using "t#".
> > 
> > Isn't that exactly what MAL said would happen?  Note that a "t" flag for
> > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't
> > either; a lone "r" has always meant text mode.
> 
> Em, I think you've got something wrong here: "t#" refers to the
> parsing marker used for writing data to files opened in text mode.

Nope. We've got it right :-)

Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to
refer to the parse marker.

>...
> I guess you won't notice any difference: strings define both
> interfaces ("s#" and "t#") to mean the same thing. Only other
> buffer compatible types may now fail to write to text files
> -- which is not so bad, because it forces the programmer to
> rethink what he really intended when opening the file in text
> mode.

It *is* bad if it breaks my existing programs in subtle ways that are a
bitch to track down.

> Besides, if you are writing portable scripts you should pay
> close attention to "r" vs. "rb" anyway.

I'm not writing portable scripts. I mentioned that once before. I don't
want a difference between 'r' and 'rb' on my Linux box. It was never there
before, I'm lazy, and I don't want to see it added :-).

Honestly, I don't know offhand of any Python types that repond to "s#" and
"t#" in different ways, such that changing file.write would end up writing
something different (and thereby breaking existing code).

I just don't like introduce text/binary to *nix platforms where it didn't
exist before.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From skip at mojam.com  Thu Nov 18 23:15:43 1999
From: skip at mojam.com (Skip Montanaro)
Date: Thu, 18 Nov 1999 16:15:43 -0600 (CST)
Subject: [Python-Dev] file modes (was: just say no...)
In-Reply-To: 
References: <3833B588.1E31F01B@lemburg.com>
	
Message-ID: <14388.31375.296388.973848@dolphin.mojam.com>

    Greg> I'm not writing portable scripts. I mentioned that once before. I
    Greg> don't want a difference between 'r' and 'rb' on my Linux box. It
    Greg> was never there before, I'm lazy, and I don't want to see it added
    Greg> :-).

    ...

    Greg> I just don't like introduce text/binary to *nix platforms where it
    Greg> didn't exist before.

I'll vote with Greg, Guido's cross-platform conversion not withstanding.  If
I haven't been writing portable scripts up to this point because I only care
about a single target platform, why break my scripts for me?  Forcing me to
use "rb" or "wb" on my open calls isn't going to make them portable anyway.

There are probably many other harder to identify and correct portability
issues than binary file access anyway.  Seems like requiring "b" is just
going to cause gratuitous breakage with no obvious increase in portability.

porta-nanny.py-anyone?-ly y'rs,

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From jim at interet.com  Thu Nov 18 23:40:05 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Thu, 18 Nov 1999 17:40:05 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>  
	            <38345249.4AFD91DA@interet.com> <199911181955.OAA04830@eric.cnri.reston.va.us>
Message-ID: <38348045.BB95F783@interet.com>

Guido van Rossum wrote:

> I think the standard format should be a subclass of zip or jar (which
> is itself a subclass of zip).  We have already written (at CNRI, as
> yet unreleased) the necessary Python tools to manipulate zip archives;
> moreover 3rd party tools are abundantly available, both on Unix and on
> Windows (as well as in Java).  Zip files also lend themselves to
> self-extracting archives and similar things, because the file index is
> at the end, so I think that Greg & Gordon should be happy.

Think about multiple packages in multiple zip files.  The zip files
store file directories.  That means we would need a sys.zippath to
search the zip files.  I don't want another PYTHONPATH phenomenon.

Greg Stein and I once discussed this (and Gordon I think).  They
argued that the directories should be flattened.  That is, think of
all directories which can be reached on PYTHONPATH.  Throw
away all initial paths.  The resultant archive has *.pyc at the top
level,
as well as package directories only.  The search path is "." in every
archive file.  No directory information is stored, only module names,
some with dots.
 
> > I don't like sys.path at all.  It is currently part of the problem.
> 
> Eh?  That's the first thing I hear something bad about it.  Maybe
> that's because you live on Windows -- on Unix, search paths are
> ubiquitous.

On windows, just print sys.path.  It is junk.  A commercial
distribution has to "just work", and it fails if a second installation
(by someone else) changes PYTHONPATH to suit their app.  I am trying
to get to "just works", no excuses, no complications.
 
> > I suggest that archive files MUST be put into a known directory.
> 
> Why?  Maybe this works on Windows; on Unix this is asking for trouble
> because it prevents users from augmenting the installation provided by
> the sysadmin.  Even on newer Windows versions, users without admin
> perms may not be allowed to add files to that privileged directory.

It works on Windows because programs install themselves in their own
subdirectories, and can put files there instead of /windows/system32.
This holds true for Windows 2000 also.  A Unix-style installation
to /windows/system32 would (may?) require "administrator" privilege.

On Unix you are right.  I didn't think of that because I am the Unix
sysadmin here, so I can put things where I want.  The Windows
solution doesn't fit with Unix, because executables go in a ./bin
directory and putting library files there is a no-no.  Hmmmm...
This needs more thought.  Anyone else have ideas??

> > We should also have the ability to append archive files to the
> > executable or a shared library assuming the OS allows this
>
> OK for the executable.  I'm not sure what the point is of appending an
> archive to the shared library?  Anyway, does it matter (on Windows) if
> you add it to python16.dll or to python.exe?

The point of using python16.dll is to append the Python library to
it, and append to python.exe (or use files) for everything else.
That way, the 1.6 interpreter is linked to the 1.6 Lib, upgrading
to 1.7 means replacing only one file, and there is no wasted storage
in multiple Lib's.  I am thinking of multiple Python programs in
different directories.

But maybe you are right.  On Windows, if python.exe can be put in
/windows/system32 then it really doesn't matter.

> OK, OK.  I think most zip tools have a way to turn off the
> compression.  (Anyway, it's a matter of more I/O time vs. more CPU
> time; hardare for both is getting better faster than we can tweak the
> code :-)

Well, if Python now has its own compression that is built
in and comes with it, then that is different.  Maybe compression
is OK.
 
> Well, the question is really if we want flexibility or archive files.
> I care more about the flexibility.  If we get a clear vote for archive
> files, I see no problem with implementing that first.

I don't like flexibility, I like standardization and simplicity.
Flexibility just encourages users to do the wrong thing.

Everyone vote please.  I don't have a solid feeling about
what people want, only what they don't like.
 
> > If the Python library is available as an archive, I think
> > startup will be greatly improved anyway.
> 
> Really?  I know about all the system calls it makes, but I don't
> really see much of a delay -- I have a prompt in well under 0.1
> second.

So do I.  I guess I was just echoing someone else's complaint.

JimA



From mal at lemburg.com  Fri Nov 19 00:28:31 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 00:28:31 +0100
Subject: [Python-Dev] file modes (was: just say no...)
References: 
Message-ID: <38348B9F.A31B09C4@lemburg.com>

Greg Stein wrote:
> 
> On Thu, 18 Nov 1999, M.-A. Lemburg wrote:
> > Tim Peters wrote:
> > > [MAL]
> > > > File objects opened in text mode will use "t#" and binary
> > > > ones use "s#".
> > >
> > > [Greg Stein]
> > > > ...
> > > > The real annoying thing would be to assume that opening a file as 'r'
> > > > means that I *meant* text mode and to start using "t#".
> > >
> > > Isn't that exactly what MAL said would happen?  Note that a "t" flag for
> > > "text mode" is an MS extension -- C doesn't define "t", and Python doesn't
> > > either; a lone "r" has always meant text mode.
> >
> > Em, I think you've got something wrong here: "t#" refers to the
> > parsing marker used for writing data to files opened in text mode.
> 
> Nope. We've got it right :-)
> 
> Tim and I used 'r' and "t" to refer to file-open modes. I used "t#" to
> refer to the parse marker.

Ah, ok. But "t" as file opener is non-portable anyways, so I'll
skip it here :-)
 
> >...
> > I guess you won't notice any difference: strings define both
> > interfaces ("s#" and "t#") to mean the same thing. Only other
> > buffer compatible types may now fail to write to text files
> > -- which is not so bad, because it forces the programmer to
> > rethink what he really intended when opening the file in text
> > mode.
> 
> It *is* bad if it breaks my existing programs in subtle ways that are a
> bitch to track down.
> 
> > Besides, if you are writing portable scripts you should pay
> > close attention to "r" vs. "rb" anyway.
> 
> I'm not writing portable scripts. I mentioned that once before. I don't
> want a difference between 'r' and 'rb' on my Linux box. It was never there
> before, I'm lazy, and I don't want to see it added :-).
> 
> Honestly, I don't know offhand of any Python types that repond to "s#" and
> "t#" in different ways, such that changing file.write would end up writing
> something different (and thereby breaking existing code).
> 
> I just don't like introduce text/binary to *nix platforms where it didn't
> exist before.

Please remember that up until now you were probably only using
strings to write to files. Python strings don't differentiate
between "t#" and "s#" so you wont see any change in function
or find subtle errors being introduced.

If you are already using the buffer feature for e.g. array which 
also implement "s#" but don't support "t#" for obvious reasons
you'll run into trouble, but then: arrays are binary data,
so changing from text mode to binary mode is well worth the
effort even if you just consider it a nuisance.

Since the buffer interface and its consequences haven't published
yet, there are probably very few users out there who would
actually run into any problems. And even if they do, its a
good chance to catch subtle bugs which would only have shown
up when trying to port to another platform.

I'll leave the rest for Guido to answer, since it was his idea ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 19 00:41:32 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 00:41:32 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.7
References: <382C0A54.E6E8328D@lemburg.com>
		<382D625B.DC14DBDE@lemburg.com>
		<38316685.7977448D@lemburg.com>
		<3834425A.8E9C3B7E@lemburg.com> <14388.20218.294814.234327@dolphin.mojam.com>
Message-ID: <38348EAC.82B41A4D@lemburg.com>

Skip Montanaro wrote:
> 
> I haven't been following this discussion closely at all, and have no
> previous experience with Unicode, so please pardon a couple stupid questions
> from the peanut gallery:
> 
>     1. What does U+0061 mean (other than 'a')?  That is, what is U?

U+XXXX means Unicode character with ordinal hex number XXXX. It is
basically just another way to say, hey I want the Unicode character
at position 0xXXXX in the Unicode spec.
 
>     2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter
>        description.  Given a Unicode object with encoding e1, how do I write
>        it to a file that is to be encoded with encoding e2?  Seems like I
>        would do something like
> 
>            u1 = unicode(s, encoding=e1)
>            f = open("somefile", "wb")
>            u2 = unicode(u1, encoding=e2)
>            f.write(u2)
> 
>        Is that how it would be done?  Does this question even make sense?

The unicode() constructor converts all input to Unicode as
basis for other conversions. In the above example, s would be
converted to Unicode using the assumption that the bytes in
s represent characters encoded using the encoding given in e1.
The line with u2 would raise a TypeError, because u1 is not
a string. To convert a Unicode object u1 to another encoding,
you would have to call the .encode() method with the intended
new encoding. The Unicode object will then take care of the
conversion of its internal Unicode data into a string using
the given encoding, e.g. you'd write:

f.write(u1.encode(e2))
 
>     3. What will the impact be on programmers such as myself currently
>        living with blinders on (that is, writing in plain old 7-bit ASCII)?

If you don't want your scripts to know about Unicode, nothing
will really change. In case you do use e.g. Latin-1 characters
in your scripts for strings, you are asked to include a pragma
in the comment lines at the beginning of the script (so that
programmers viewing your code using other encoding have a chance
to figure out what you've written).

Here's the text from the proposal:
"""
Note that you should provide some hint to the encoding you used to
write your programs as pragma line in one the first few comment lines
of the source file (e.g. '# source file encoding: latin-1'). If you
only use 7-bit ASCII then everything is fine and no such notice is
needed, but if you include Latin-1 characters not defined in ASCII, it
may well be worthwhile including a hint since people in other
countries will want to be able to read you source strings too.
"""

Other than that you can continue to use normal strings like
you always have.

Hope that clarifies things at least a bit,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mhammond at skippinet.com.au  Fri Nov 19 01:27:09 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 19 Nov 1999 11:27:09 +1100
Subject: [Python-Dev] file modes (was: just say no...)
In-Reply-To: <38348B9F.A31B09C4@lemburg.com>
Message-ID: <003401bf3224$d231be30$0501a8c0@bobcat>

[MAL]

> If you are already using the buffer feature for e.g. array which
> also implement "s#" but don't support "t#" for obvious reasons
> you'll run into trouble, but then: arrays are binary data,
> so changing from text mode to binary mode is well worth the
> effort even if you just consider it a nuisance.

Breaking existing code that works should be considered more than a
nuisance.

However, one answer would be to have "t#" _prefer_ to use the text
buffer, but not insist on it.  eg, the logic for processing "t#" could
check if the text buffer is supported, and if not move back to the
blob buffer.

This should mean that all existing code still works, except for
objects that support both buffers to mean different things.  AFAIK
there are no objects that qualify today, so it should work fine.

Unix users _will_ need to revisit their thinking about "text mode" vs
"binary mode" when writing these new objects (such as Unicode), but
IMO that is more than reasonable - Unix users dont bother qualifying
the open mode of their files, simply because it has no effect on their
files.  If for certain objects or requirements there _is_ a
distinction, then new code can start to think these issues through.
"Portable File IO" will simply be extended from simply "portable among
all platforms" to "portable among all platforms and objects".

Mark.




From gmcm at hypernet.com  Fri Nov 19 03:23:44 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Thu, 18 Nov 1999 21:23:44 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: <38348045.BB95F783@interet.com>
Message-ID: <1269144272-21594530@hypernet.com>

[Guido]
> > I think the standard format should be a subclass of zip or jar
> > (which is itself a subclass of zip).  We have already written
> > (at CNRI, as yet unreleased) the necessary Python tools to
> > manipulate zip archives; moreover 3rd party tools are
> > abundantly available, both on Unix and on Windows (as well as
> > in Java).  Zip files also lend themselves to self-extracting
> > archives and similar things, because the file index is at the
> > end, so I think that Greg & Gordon should be happy.

No problem (I created my own formats for relatively minor 
reasons).
 
[JimA]
> Think about multiple packages in multiple zip files.  The zip
> files store file directories.  That means we would need a
> sys.zippath to search the zip files.  I don't want another
> PYTHONPATH phenomenon.

What if sys.path looked like:
 [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...]
 
> Greg Stein and I once discussed this (and Gordon I think).  They
> argued that the directories should be flattened.  That is, think
> of all directories which can be reached on PYTHONPATH.  Throw
> away all initial paths.  The resultant archive has *.pyc at the
> top level, as well as package directories only.  The search path
> is "." in every archive file.  No directory information is
> stored, only module names, some with dots.

While I do flat archives (no dots, but that's a different story), 
there's no reason the archive couldn't be structured. Flat 
archives are definitely simpler.
 
[JimA]
> > > I don't like sys.path at all.  It is currently part of the
> > > problem.
[Guido] 
> > Eh?  That's the first thing I hear something bad about it. 
> > Maybe that's because you live on Windows -- on Unix, search
> > paths are ubiquitous.
> 
> On windows, just print sys.path.  It is junk.  A commercial
> distribution has to "just work", and it fails if a second
> installation (by someone else) changes PYTHONPATH to suit their
> app.  I am trying to get to "just works", no excuses, no
> complications.

		Py_Initialize ();
		PyRun_SimpleString ("import sys; del sys.path[1:]");

Yeah, there's a hole there. Fixable if you could do a little pre- 
Py_Initialize twiddling.
 
> > > I suggest that archive files MUST be put into a known
> > > directory.

No way. Hard code a directory? Overwrite someone else's 
Python "standalone"? Write to a C: partition that is 
deliberately sized to hold nothing but Windows? Make 
network installations impossible?
 
> > Why?  Maybe this works on Windows; on Unix this is asking for
> > trouble because it prevents users from augmenting the
> > installation provided by the sysadmin.  Even on newer Windows
> > versions, users without admin perms may not be allowed to add
> > files to that privileged directory.
> 
> It works on Windows because programs install themselves in their
> own subdirectories, and can put files there instead of
> /windows/system32. This holds true for Windows 2000 also.  A
> Unix-style installation to /windows/system32 would (may?) require
> "administrator" privilege.

There's nothing Unix-style about installing to 
/Windows/system32. 'Course *they* have symbolic links that 
actually work...
 
> On Unix you are right.  I didn't think of that because I am the
> Unix sysadmin here, so I can put things where I want.  The
> Windows solution doesn't fit with Unix, because executables go in
> a ./bin directory and putting library files there is a no-no. 
> Hmmmm... This needs more thought.  Anyone else have ideas??

The official Windows solution is stuff in registry about app 
paths and such. Putting the dlls in the exe's directory is a 
workaround which works and is more managable than the 
official solution.
 
> > > We should also have the ability to append archive files to
> > > the executable or a shared library assuming the OS allows
> > > this

That's a handy trick on Windows, but it's got nothing to do 
with Python.

> > Well, the question is really if we want flexibility or archive
> > files. I care more about the flexibility.  If we get a clear
> > vote for archive files, I see no problem with implementing that
> > first.
> 
> I don't like flexibility, I like standardization and simplicity.
> Flexibility just encourages users to do the wrong thing.

I've noticed that the people who think there should only be one 
way to do things never agree on what it is.
 
> Everyone vote please.  I don't have a solid feeling about
> what people want, only what they don't like.

Flexibility. You can put Christian's favorite Einstein quote here 
too.
 
> > > If the Python library is available as an archive, I think
> > > startup will be greatly improved anyway.
> > 
> > Really?  I know about all the system calls it makes, but I
> > don't really see much of a delay -- I have a prompt in well
> > under 0.1 second.
> 
> So do I.  I guess I was just echoing someone else's complaint.

Install some stuff. Deinstall some of it. Repeat (mixing up the 
order) until your registry and hard drive are shattered into tiny 
little fragments. It doesn't take long (there's lots of stuff a 
defragmenter can't touch once it's there).


- Gordon



From mal at lemburg.com  Fri Nov 19 10:08:44 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 10:08:44 +0100
Subject: [Python-Dev] file modes (was: just say no...)
References: <003401bf3224$d231be30$0501a8c0@bobcat>
Message-ID: <3835139C.344F3EEE@lemburg.com>

Mark Hammond wrote:
> 
> [MAL]
> 
> > If you are already using the buffer feature for e.g. array which
> > also implement "s#" but don't support "t#" for obvious reasons
> > you'll run into trouble, but then: arrays are binary data,
> > so changing from text mode to binary mode is well worth the
> > effort even if you just consider it a nuisance.
> 
> Breaking existing code that works should be considered more than a
> nuisance.

Its an error that pretty easy to fix... that's what I was
referring to with "nuisance". All you have to do is open
the file in binary mode and you're done.

BTW, the change will only effect platforms that don't differ
between text and binary mode, e.g. Unix ones.
 
> However, one answer would be to have "t#" _prefer_ to use the text
> buffer, but not insist on it.  eg, the logic for processing "t#" could
> check if the text buffer is supported, and if not move back to the
> blob buffer.

I doubt that this is conform to what the buffer interface want's
to reflect: if the getcharbuf slot is not implemented this means
"I am not text". If you would write non-text to a text file,
this may cause line breaks to be interpreted in ways that are
incompatible with the binary data, i.e. when you read the data
back in, it may fail to load because e.g. '\n' was converted to
'\r\n'.
 
> This should mean that all existing code still works, except for
> objects that support both buffers to mean different things.  AFAIK
> there are no objects that qualify today, so it should work fine.

Well, even though the code would work, it might break badly
someday for the above reasons. Better fix that now when there
aren't too many possible cases around than at some later
point where the user has to figure out the problem for himself
due to the system not warning him about this.
 
> Unix users _will_ need to revisit their thinking about "text mode" vs
> "binary mode" when writing these new objects (such as Unicode), but
> IMO that is more than reasonable - Unix users dont bother qualifying
> the open mode of their files, simply because it has no effect on their
> files.  If for certain objects or requirements there _is_ a
> distinction, then new code can start to think these issues through.
> "Portable File IO" will simply be extended from simply "portable among
> all platforms" to "portable among all platforms and objects".

Right.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    42 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From mal at lemburg.com  Fri Nov 19 10:56:03 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 10:56:03 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <002401bf31b3$bf16c230$0501a8c0@bobcat> <3833E5EC.AAFE5016@lemburg.com> <199911181537.KAA03911@eric.cnri.reston.va.us>  
	            <383427ED.45A01BBB@lemburg.com> <199911181637.LAA04260@eric.cnri.reston.va.us>
Message-ID: <38351EB3.153FCDFC@lemburg.com>

Guido van Rossum wrote:
> 
> > Like a path of search functions ? Not a bad idea... I will still
> > want the internal dict for caching purposes though. I'm not sure
> > how often these encodings will be, but even a few hundred function
> > call will slow down the Unicode implementation quite a bit.
> 
> Of course.  (It's like sys.modules caching the results of an import).

I've fixed the "path of search functions" approach in the latest
version of the spec.
 
> [...]
> >     def flush(self):
> >
> >       """ Flushed the codec buffers used for keeping state.
> >
> >           Returns values are not defined. Implementations are free to
> >           return None, raise an exception (in case there is pending
> >           data in the buffers which could not be decoded) or
> >           return any remaining data from the state buffers used.
> >
> >       """
> 
> I don't know where this came from, but a flush() should work like
> flush() on a file. 

It came from Fredrik's proposal.

> It doesn't return a value, it just sends any
> remaining data to the underlying stream (for output).  For input it
> shouldn't be supported at all.
> 
> The idea is that flush() should do the same to the encoder state that
> close() followed by a reopen() would do.  Well, more or less.  But if
> the process were to be killed right after a flush(), the data written
> to disk should be a complete encoding, and not have a lingering shift
> state.

Ok. I've modified the API as follows:

StreamWriter:
    def flush(self):

	""" Flushes and resets the codec buffers used for keeping state.

	    Calling this method should ensure that the data on the
	    output is put into a clean state, that allows appending
	    of new fresh data without having to rescan the whole
	    stream to recover state.

	"""
	pass

StreamReader:
    def read(self,chunksize=0):

	""" Decodes data from the stream self.stream and returns a tuple 
	    (Unicode object, bytes consumed).

	    chunksize indicates the approximate maximum number of
	    bytes to read from the stream for decoding purposes. The
	    decoder can modify this setting as appropriate. The default
	    value 0 indicates to read and decode as much as possible.
	    The chunksize is intended to prevent having to decode huge
	    files in one step.

	    The method should use a greedy read strategy meaning that
	    it should read as much data as is allowed within the
	    definition of the encoding and the given chunksize, e.g.
            if optional encoding endings or state markers are
	    available on the stream, these should be read too.

        """
	... the base class should provide a default implementation
	    of this method using self.decode ...

    def reset(self):

	""" Resets the codec buffers used for keeping state.

	    Note that no stream repositioning should take place.
	    This method is primarely intended to recover from
	    decoding errors.

	"""
	pass

The .reset() method replaces the .flush() method on StreamReaders.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    42 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Fri Nov 19 10:22:48 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Fri, 19 Nov 1999 10:22:48 +0100
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269187709-18981857@hypernet.com> <199911181530.KAA03887@eric.cnri.reston.va.us>
Message-ID: <383516E8.EE66B527@lemburg.com>

Guido van Rossum wrote:
>
> Let's first complete the requirements gathering.  Are these
> requirements reasonable?  Will they make an implementation too
> complex?  Am I missing anything?

Since you were asking: I would like functionality equivalent
to my latest import patch for a slightly different lookup scheme
for module import inside packages to become a core feature.

If it becomes a core feature I promise to never again start
threads about relative imports :-)

Here's the summary again:
"""
[The patch] changes the default import mechanism to work like this:

>>> import d # from directory a/b/c/
try a.b.c.d
try a.b.d
try a.d
try d
fail

instead of just doing the current two-level lookup:

>>> import d # from directory a/b/c/
try a.b.c.d
try d
fail

As a result, relative imports referring to higher level packages
work out of the box without any ugly underscores in the import name.
Plus the whole scheme is pretty simple to explain and straightforward.
"""

You can find the patch attached to the message "Walking up the package
hierarchy" in the python-dev mailing list archive.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    42 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





From captainrobbo at yahoo.com  Fri Nov 19 14:01:04 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Fri, 19 Nov 1999 05:01:04 -0800 (PST)
Subject: [Python-Dev] Codecs and StreamCodecs
Message-ID: <19991119130104.21726.rocketmail@ web605.yahoomail.com>


--- "M.-A. Lemburg"  wrote:
> Guido van Rossum wrote:
> > I don't know where this came from, but a flush()
> should work like
> > flush() on a file. 
> 
> It came from Fredrik's proposal.
> 
> > It doesn't return a value, it just sends any
> > remaining data to the underlying stream (for
> output).  For input it
> > shouldn't be supported at all.
> > 
> > The idea is that flush() should do the same to the
> encoder state that
> > close() followed by a reopen() would do.  Well,
> more or less.  But if
> > the process were to be killed right after a
> flush(), the data written
> > to disk should be a complete encoding, and not
> have a lingering shift
> > state.
> 
This could be useful in real life.  
For example, iso-2022-jp has a 'single-byte-mode'
and a 'double-byte-mode' with shift-sequences to
separate them.  The rule is that each line in the 
text file or email message or whatever must begin
and end in single-byte mode.  So I would take flush()
to mean 'shift back to ASCII now'.

Calling flush and reopen would thus "almost" get the
same data across.

I'm trying to think if it would be dangerous.  Do web
and ftp servers often call flush() in the middle of
transmitting a block of text?

- Andy


=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Bid and sell for free at http://auctions.yahoo.com



From fredrik at pythonware.com  Fri Nov 19 14:33:50 1999
From: fredrik at pythonware.com (Fredrik Lundh)
Date: Fri, 19 Nov 1999 14:33:50 +0100
Subject: [Python-Dev] Codecs and StreamCodecs
References: <19991119130104.21726.rocketmail@ web605.yahoomail.com>
Message-ID: <000701bf3292$b7c49130$f29b12c2@secret.pythonware.com>

Andy Robinson  wrote:
> So I would take flush() to mean 'shift back to
> ASCII now'.

if we're still talking about my "just one
codec, please" proposal, that's exactly
what encoder.flush should do.

while decoder.flush should raise an ex-
ception if you're still in double byte mode
(at least if running in 'strict' mode).

> Calling flush and reopen would thus "almost" get the
> same data across.
> 
> I'm trying to think if it would be dangerous.  Do web
> and ftp servers often call flush() in the middle of
> transmitting a block of text?

again, if we're talking about my proposal,
these flush methods are only called by the
string or stream wrappers, never by the
applications.  see the original post for de-
tails.






From gstein at lyra.org  Fri Nov 19 14:29:50 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 05:29:50 -0800 (PST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us>
Message-ID: 

On Thu, 18 Nov 1999, Guido van Rossum wrote:
> Gordon McMillan wrote:
>...
> > I think imputil's emulation of the builtin importer is more of a 
> > demonstration than a serious implementation. As for speed, it 
> > depends on the test. 
> 
> Agreed.  I like some of imputil's features, but I think the API
> need to be redesigned.

It what ways? It sounds like you've applied some thought. Do you have any
concrete ideas yet, or "just a feeling" :-)  I'm working through some
changes from JimA right now, and would welcome other suggestions. I think
there may be some outstanding stuff from MAL, but I'm not sure (Marc?)

>...
> So here's a challenge: redesign the import API from scratch.

I would suggest starting with imputil and altering as necessary. I'll use
that viewpoint below.

> Let me start with some requirements.
> 
> Compatibility issues:
> ---------------------
> 
> - the core API may be incompatible, as long as compatibility layers
> can be provided in pure Python

Which APIs are you referring to? The "imp" module? The C functions? The
__import__ and reload builtins?

I'm guessing some of imp, the two builtins, and only one or two C
functions.

> - support for rexec functionality

No problem. I can think of a number of ways to do this.

> - support for freeze functionality

No problem. A function in "imp" must be exposed to Python to support this
within the imputil framework.

> - load .py/.pyc/.pyo files and shared libraries from files

No problem. Again, a function is needed for platform-specific loading of
shared libraries.

> - support for packages

No problem. Demo's in current imputil.

> - sys.path and sys.modules should still exist; sys.path might
> have a slightly different meaning

I would suggest that both retain their *exact* meaning. We introduce
sys.importers -- a list of importers to check, in sequence. The first
importer on that list uses sys.path to look for and load modules. The
second importer loads builtins and frozen code (i.e. modules not on
sys.path).

Users can insert/append new importers or alter sys.path as before.

sys.modules continues to record name:module mappings.

> - $PYTHONPATH and $PYTHONHOME should still be supported

No problem.

> (I wouldn't mind a splitting up of importdl.c into several
> platform-specific files, one of which is chosen by the configure
> script; but that's a bit of a separate issue.)

Easy enough. The standard importer can select the appropriate
platform-specific module/function to perform the load. i.e. these can move
to Modules/ and be split into a module-per-platform.

> New features:
> -------------
> 
> - Integrated support for Greg Ward's distribution utilities (i.e. a
>   module prepared by the distutil tools should install painlessly)

I don't know the specific requirements/functionality that would be
required here (does Greg? :-), but I can't imagine any problem with this.

> - Good support for prospective authors of "all-in-one" packaging tool
>   authors like Gordon McMillan's win32 installer or /F's squish.  (But
>   I *don't* require backwards compatibility for existing tools.)

Um. *No* problem. :-)

> - Standard import from zip or jar files, in two ways:
> 
>   (1) an entry on sys.path can be a zip/jar file instead of a directory;
>       its contents will be searched for modules or packages

While this could easily be done, I might argue against it. Old
apps/modules that process sys.path might get confused.

If compatibility is not an issue, then "No problem."

An alternative would be an Importer instance added to sys.importers that
is configured for a specific archive (in other words, don't add the zip
file to sys.path, add ZipImporter(file) to sys.importers).

Another alternative is an Importer that looks at a "sys.py_archives" list.
Or an Importer that has a py_archives instance attribute.

>   (2) a file in a directory that's on sys.path can be a zip/jar file;
>       its contents will be considered as a package (note that this is
>       different from (1)!)

No problem. This will slow things down, as a stat() for *.zip and/or *.jar
must be done, in addition to *.py, *.pyc, and *.pyo.

>   I don't particularly care about supporting all zip compression
>   schemes; if Java gets away with only supporting gzip compression
>   in jar files, so can we.

I presume we would support whatever zlib gives us, and no more.

> - Easy ways to subclass or augment the import mechanism along
>   different dimensions.  For example, while none of the following
>   features should be part of the core implementation, it should be
>   easy to add any or all:
> 
>   - support for a new compression scheme to the zip importer

Presuming ZipImporter is a class (derived from Importer), then this
ability is wholly dependent upon the author of ZipImporter providing the
hook.

The Importer class is already designed for subclassing (and its interface 
is very narrow, which means delegation is also *very* easy; see
imputil.FuncImporter).

>   - support for a new archive format, e.g. tar

A cakewalk. Gordon, JimA, and myself each have archive formats. :-)

>   - a hook to import from URLs or other data sources (e.g. a
>     "module server" imported in CORBA) (this needn't be supported
>     through $PYTHONPATH though)

No problem at all.

>   - a hook that imports from compressed .py or .pyc/.pyo files

No problem at all.

>   - a hook to auto-generate .py files from other filename
>     extensions (as currently implemented by ILU)

No problem at all.

>   - a cache for file locations in directories/archives, to improve
>     startup time

No problem at all.

>   - a completely different source of imported modules, e.g. for an
>     embedded system or PalmOS (which has no traditional filesystem)

No problem at all.

In each of the above cases, the Importer.get_code() method just needs to
grab the byte codes from the XYZ data source. That data source can be
cmopressed, across a network, on-the-fly generated, or whatever. Each
importer can certainly create a cache based on its concept of "location".
In some cases, that would be a mapping from module name to filesystem
path, or to a URL, or to a compiled-in, frozen module.

> - Note that different kinds of hooks should (ideally, and within
>   reason) properly combine, as follows: if I write a hook to recognize
>   .spam files and automatically translate them into .py files, and you
>   write a hook to support a new archive format, then if both hooks are
>   installed together, it should be possible to find a .spam file in an
>   archive and do the right thing, without any extra action.  Right?

Ack. Very, very difficult.

The imputil scheme combines the concept of locating/loading into one step.
There is only one "hook" in the imputil system. Its semantic is "map this
name to a code/module object and return it; if you don't have it, then
return None."

Your compositing example is based on the capabilities of the
find-then-load paradigm of the existing "ihooks.py". One module finds
something (foo.spam) and the other module loads it (by generating a .py).

All is not lost, however. I can easily envision the get_code() hook as
allowing any kind of return type. If it isn't a code or module object,
then another hook is called to transform it.
[ actually, I'd design it similarly: a *series* of hooks would be called
  until somebody transforms the foo.spam into a code/module object. ]

The compositing would be limited ony by the (Python-based) Importer
classes. For example, my ZipImporter might expect to zip up .pyc files
*only*. Obviously, you would want to alter this to support zipping any
file, then use the suffic to determine what to do at unzip time.

> - It should be possible to write hooks in C/C++ as well as Python

Use FuncImporter to delegate to an extension module.

This is one of the benefits of imputil's single/narrow interface.

> - Applications embedding Python may supply their own implementations,
>   default search path, etc., but don't have to if they want to piggyback
>   on an existing Python installation (even though the latter is
>   fraught with risk, it's cheaper and easier to understand).

An application would have full control over the contents of sys.importers.

For a restricted execution app, it might install an Importer that loads
files from *one* directory only which is configured from a specific
Win32 Registry entry. That importer could also refuse to load shared
modules. The BuiltinImporter would still be present (although the app
would certainly omit all but the necessary builtins from the build).
Frozen modules could be excluded.

> Implementation:
> ---------------
> 
> - There must clearly be some code in C that can import certain
>   essential modules (to solve the chicken-or-egg problem), but I don't
>   mind if the majority of the implementation is written in Python.
>   Using Python makes it easy to subclass.

I posited once before that the cost of import is mostly I/O rather than
CPU, so using Python should not be an issue. MAL demonstrated that a good
design for the Importer classes is also required. Based on this, I'm a
*strong* advocate of moving as much as possible into Python (to get
Python's ease-of-coding with little relative cost).

The (core) C code should be able to search a path for a module and import
it. It does not require dynamic loading or packages. This will be used to
import exceptions.py, then imputil.py, then site.py.

The platform-specific module that perform dynamic-loading must be a
statically linked module (in Modules/ ... it doesn't have to be in the
Python/ directory).

site.py can complete the bootstrap by setting up sys.importers with the
appropriate Importer instances (this is where an application can define
its own policy). sys.path was initially set by the import.c bootstrap code
(from the compiled-in path and environment variables).

Note that imputil.py would not install any hooks when it is loaded. That
is up to site.py. This implies the core C code will import a total of
three modules using its builtin system. After that, the imputil mechanism
would be importing everything (site.py would .install() an Importer which
then takes over the __import__ hook).

Further note that the "import" Python statement could be simplified to use
only the hook. However, this would require the core importer to inject
some module names into the imputil module's namespace (since it couldn't
use an import statement until a hook was installed). While this
simplification is "neat", it complicates the run-time system (the import
statement is broken until a hook is installed).

Therefore, the core C code must also support importing builtins. "sys" and
"imp" are needed by imputil to bootstrap.

The core importer should not need to deal with dynamic-load modules.

To support frozen apps, the core importer would need to support loading
the three modules as frozen modules.

The builtin/frozen importing would be exposed thru "imp" for use by
imputil for future imports. imputil would load and use the (builtin)
platform-specific module to do dynamic-load imports.

> - In order to support importing from zip/jar files using compression,
>   we'd at least need the zlib extension module and hence libz itself,
>   which may not be available everywhere.

Yes. I don't see this as a requirement, though. We wouldn't start to use
these by default, would we? Or insist on zlib being present? I see this as
more along the lines of "we have provided a standardized Importer to do
this, *provided* you have zlib support."

> - I suppose that the bootstrap is solved using a mechanism very
>   similar to what freeze currently used (other solutions seem to be
>   platform dependent).

The bootstrap that I outlined above could be done in C code. The import
code would be stripped down dramatically because you'll drop package
support and dynamic loading.

Alternatively, you could probably do the path-scanning in Python and
freeze that into the interpreter. Personally, I don't like this idea as it
would not buy you much at all (it would still need to return to C for
accessing a number of scanning functions and module importing funcs).

> - I also want to still support importing *everything* from the
>   filesystem, if only for development.  (It's hard enough to deal with
>   the fact that exceptions.py is needed during Py_Initialize();
>   I want to be able to hack on the import code written in Python
>   without having to rebuild the executable all the time.

My outline above does not freeze anything. Everything resides in the
filesystem. The C code merely needs a path-scanning loop and functions to
import .py*, builtin, and frozen types of modules.

If somebody nukes their imputil.py or site.py, then they return to Python
1.4 behavior where the core interpreter uses a path for importing (i.e. no
packages). They lose dynamically-loaded module support.

> Let's first complete the requirements gathering.  Are these
> requirements reasonable?  Will they make an implementation too
> complex?  Am I missing anything?

I'm not a fan of the compositing due to it requiring a change to semantics
that I believe are very useful and very clean. However, I outlined a
possible, clean solution to do that (a secondary set of hooks for
transforming get_code() return values).

The requirements are otherwise reasonable to me, as I see that they can
all be readily solved (i.e. they aren't burdensome).

While this email may be long, I do not believe the resulting system would
be complex. From the user-visible side of things, nothing would be
changed. sys.path is still present and operates as before. They *do* have
new functionality they can grow into, though (sys.importers). The
underlying C code is simplified, and the platform-specific dynamic-load
stuff can be distributed to distinct modules, as needed
(e.g. BeOS/dynloadmodule.c and PC/dynloadmodule.c).

> Finally, to what extent does this impact the desire for dealing
> differently with the Python bytecode compiler (e.g. supporting
> optimizers written in Python)?  And does it affect the desire to
> implement the read-eval-print loop (the >>> prompt) in Python?

If the three startup files require byte-compilation, then you could have
some issues (i.e. the byte-compiler must be present).

Once you hit site.py, you have a "full" environment and can easily detect
and import a read-eval-print loop module (i.e. why return to Python? just 
start things up right there).

site.py can also install new optimizers as desired, a new Python-based
parser or compiler, or whatever...  If Python is built without a parser or
compiler (I hope that's an option!), then the three startup modules would
simply be frozen into the executable.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/





From bwarsaw at cnri.reston.va.us  Fri Nov 19 17:30:15 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Fri, 19 Nov 1999 11:30:15 -0500 (EST)
Subject: [Python-Dev] CVS log messages with diffs
References: <199911161700.MAA02716@eric.cnri.reston.va.us>
Message-ID: <14389.31511.706588.20840@anthem.cnri.reston.va.us>

There was a suggestion to start augmenting the checkin emails to
include the diffs of the checkin.  This would let you keep a current
snapshot of the tree without having to do a direct `cvs update'.

I think I can add this without a ton of pain.  It would not be
optional however, and the emails would get larger (and some checkins
could be very large).  There's also the question of whether to
generate unified or context diffs.  Personally, I find context diffs
easier to read; unified diffs are smaller but not by enough to really
matter.

So here's an informal poll.  If you don't care either way, you don't
need to respond.  Otherwise please just respond to me and not to the
list.

1. Would you like to start receiving diffs in the checkin messages?

2. If you answer `yes' to #1 above, would you prefer unified or
   context diffs?

-Barry



From bwarsaw at cnri.reston.va.us  Fri Nov 19 18:04:51 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Fri, 19 Nov 1999 12:04:51 -0500 (EST)
Subject: [Python-Dev] Another 1.6 wish
Message-ID: <14389.33587.947368.547023@anthem.cnri.reston.va.us>

We had some discussion a while back about enabling thread support by
default, if the underlying OS supports it obviously.  I'd like to see
that happen for 1.6.  IIRC, this shouldn't be too hard -- just a few
tweaks of the configure script (and who knows what for those minority
platforms that don't use configure :).

-Barry



From akuchlin at mems-exchange.org  Fri Nov 19 18:07:07 1999
From: akuchlin at mems-exchange.org (Andrew M. Kuchling)
Date: Fri, 19 Nov 1999 12:07:07 -0500 (EST)
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <14389.33587.947368.547023@anthem.cnri.reston.va.us>
References: <14389.33587.947368.547023@anthem.cnri.reston.va.us>
Message-ID: <14389.33723.270207.374259@amarok.cnri.reston.va.us>

Barry A. Warsaw writes:
>We had some discussion a while back about enabling thread support by
>default, if the underlying OS supports it obviously.  I'd like to see

That reminds me... what about the free threading patches?  Perhaps
they should be added to the list of issues to consider for 1.6.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Oh, my fingers! My arms! My legs! My everything! Argh...
    -- The Doctor, in "Nightmare of Eden"




From petrilli at amber.org  Fri Nov 19 18:23:02 1999
From: petrilli at amber.org (Christopher Petrilli)
Date: Fri, 19 Nov 1999 12:23:02 -0500
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <14389.33723.270207.374259@amarok.cnri.reston.va.us>; from akuchlin@mems-exchange.org on Fri, Nov 19, 1999 at 12:07:07PM -0500
References: <14389.33587.947368.547023@anthem.cnri.reston.va.us> <14389.33723.270207.374259@amarok.cnri.reston.va.us>
Message-ID: <19991119122302.B23400@trump.amber.org>

Andrew M. Kuchling [akuchlin at mems-exchange.org] wrote:
> Barry A. Warsaw writes:
> >We had some discussion a while back about enabling thread support by
> >default, if the underlying OS supports it obviously.  I'd like to see

Yes pretty please!  One of the biggest problems we have in the Zope world
is that for some unknown reason, most of hte Linux RPMs don't have threading
on in them, so people end up having to compile it anyway... while this
is a silly thing, it does create problems, and means that we deal with
a lot of "dumb" problems.

> That reminds me... what about the free threading patches?  Perhaps
> they should be added to the list of issues to consider for 1.6.

My recolection was that unfortunately MOST of the time, they actually
slowed down things because of the number of locks involved...  Guido
can no doubt shed more light onto this, but... there was a reason.

Chris
-- 
| Christopher Petrilli
| petrilli at amber.org



From gmcm at hypernet.com  Fri Nov 19 19:22:37 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 19 Nov 1999 13:22:37 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
In-Reply-To: <199911181530.KAA03887@eric.cnri.reston.va.us>
References: Your message of "Thu, 18 Nov 1999 09:19:48 EST."             <1269187709-18981857@hypernet.com> 
Message-ID: <1269086690-25057991@hypernet.com>

[Guido]
> Compatibility issues:
> ---------------------
> 
> - the core API may be incompatible, as long as compatibility
> layers can be provided in pure Python

Good idea. Question: we have keyword import, __import__, 
imp and PyImport_*. Which of those (if any) define the "core 
API"?

[rexec, freeze: yes]

> - load .py/.pyc/.pyo files and shared libraries from files

Shared libraries? Might that not involve some rather shady 
platform-specific magic? If it can be kept kosher, I'm all for it; 
but I'd say no if it involved, um, undocumented features.
 
> support for packages

Absolutely. I'll just comment that the concept of 
package.__path__ is also affected by the next point.
> 
> - sys.path and sys.modules should still exist; sys.path might
> have a slightly different meaning
> 
> - $PYTHONPATH and $PYTHONHOME should still be supported

If sys.path changes meaning, should not $PYTHONPATH 
also?

> New features:
> -------------
> 
> - Integrated support for Greg Ward's distribution utilities (i.e.
> a
>   module prepared by the distutil tools should install
>   painlessly)

I assume that this is mostly a matter of $PYTHONPATH and 
other path manipulation mechanisms?
 
> - Good support for prospective authors of "all-in-one" packaging
> tool
>   authors like Gordon McMillan's win32 installer or /F's squish. 
>   (But I *don't* require backwards compatibility for existing
>   tools.)

I guess you've forgotten: I'm that *really* tall guy .
 
> - Standard import from zip or jar files, in two ways:
> 
>   (1) an entry on sys.path can be a zip/jar file instead of a
>   directory;
>       its contents will be searched for modules or packages

I don't mind this, but it depends on whether sys.path changes 
meaning.
 
>   (2) a file in a directory that's on sys.path can be a zip/jar
>   file;
>       its contents will be considered as a package (note that
>       this is different from (1)!)

But it's affected by the same considerations (eg, do we start 
with filesystem names and wrap them in importers, or do we 
just start with importer instances / specifications for importer 
instances).
 
>   I don't particularly care about supporting all zip compression
>   schemes; if Java gets away with only supporting gzip
>   compression in jar files, so can we.

I think this is a matter of what zip compression is officially 
blessed. I don't mind if it's none; providing / creating zipped 
versions for platforms that support it is nearly trivial.

> - Easy ways to subclass or augment the import mechanism along
>   different dimensions.  For example, while none of the following
>   features should be part of the core implementation, it should
>   be easy to add any or all:
> 
>   - support for a new compression scheme to the zip importer
> 
>   - support for a new archive format, e.g. tar
> 
>   - a hook to import from URLs or other data sources (e.g. a
>     "module server" imported in CORBA) (this needn't be supported
>     through $PYTHONPATH though)

Which begs the question of the meaning of sys.path; and if it's 
still filesystem names, how do you get one of these in there?
 
>   - a hook that imports from compressed .py or .pyc/.pyo files
> 
>   - a hook to auto-generate .py files from other filename
>     extensions (as currently implemented by ILU)
> 
>   - a cache for file locations in directories/archives, to
>   improve
>     startup time
> 
>   - a completely different source of imported modules, e.g. for
>   an
>     embedded system or PalmOS (which has no traditional
>     filesystem)
> 
> - Note that different kinds of hooks should (ideally, and within
>   reason) properly combine, as follows: if I write a hook to
>   recognize .spam files and automatically translate them into .py
>   files, and you write a hook to support a new archive format,
>   then if both hooks are installed together, it should be
>   possible to find a .spam file in an archive and do the right
>   thing, without any extra action.  Right?

A bit of discussion: I've got 2 kinds of archives. One can 
contain anything & is much like a zip (and probably should be 
a zip). The other contains only compressed .pyc or .pyo. The 
latter keys contents by logical name, not filesystem name. No 
extensions, and when a package is imported, the code object 
returned is the __init__ code object, (vs returning None and 
letting the import mechanism come back and ask for 
package.__init__).

When you're building an archive, you have to go thru the .py / 
.pyc / .pyo / is it a package / maybe compile logic anyway. 
Why not get it all over with, so that at runtime there's no 
choices to be made.

Which means (for this kind of archive) that including 
somebody's .spam in your archive isn't a matter of a hook, but 
a matter of adding to the archive's build smarts.
 
> - It should be possible to write hooks in C/C++ as well as Python
> 
> - Applications embedding Python may supply their own
> implementations,
>   default search path, etc., but don't have to if they want to
>   piggyback on an existing Python installation (even though the
>   latter is fraught with risk, it's cheaper and easier to
>   understand).

A way of tweaking that which will become sys.path before 
Py_Initialize would be *most* welcome.

> Implementation:
> ---------------
> 
> - There must clearly be some code in C that can import certain
>   essential modules (to solve the chicken-or-egg problem), but I
>   don't mind if the majority of the implementation is written in
>   Python. Using Python makes it easy to subclass.
> 
> - In order to support importing from zip/jar files using
> compression,
>   we'd at least need the zlib extension module and hence libz
>   itself, which may not be available everywhere.
> 
> - I suppose that the bootstrap is solved using a mechanism very
>   similar to what freeze currently used (other solutions seem to
>   be platform dependent).

There are other possibilites here, but I have only half-
formulated ideas at the moment. The critical part for 
embedding is to be able to *completely* control all path 
related logic.
 
> - I also want to still support importing *everything* from the
>   filesystem, if only for development.  (It's hard enough to deal
>   with the fact that exceptions.py is needed during
>   Py_Initialize(); I want to be able to hack on the import code
>   written in Python without having to rebuild the executable all
>   the time.
> 
> Let's first complete the requirements gathering.  Are these
> requirements reasonable?  Will they make an implementation too
> complex?  Am I missing anything?

I'll summarize as follows:
 1) What "sys.path" means (and how it's construction can be 
manipulated) is critical.
 2) See 1.
 
> Finally, to what extent does this impact the desire for dealing
> differently with the Python bytecode compiler (e.g. supporting
> optimizers written in Python)?  And does it affect the desire to
> implement the read-eval-print loop (the >>> prompt) in 
Python?

I can assure you that code.py runs fine out of an archive :-).

- Gordon



From gstein at lyra.org  Fri Nov 19 22:06:14 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 13:06:14 -0800 (PST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: 
Message-ID: 

[ taking the liberty to CC: this back to python-dev ]

On Fri, 19 Nov 1999, David Ascher wrote:
> > >   (2) a file in a directory that's on sys.path can be a zip/jar file;
> > >       its contents will be considered as a package (note that this is
> > >       different from (1)!)
> > 
> > No problem. This will slow things down, as a stat() for *.zip and/or *.jar
> > must be done, in addition to *.py, *.pyc, and *.pyo.
> 
> Aside: it strikes me that for Python programs which import lots of files,
> 'front-loading' the stat calls could make sense.  When you first look at a
> directory in sys.path, you read the entire directory in memory, and
> successive imports do a stat on the directory to see if it's changed, and
> if not use the in-memory data.  Or am I completely off my rocker here?

Not at all. I thought of this last night after my email. Since the
Importer can easily retain state, it can hold a cache of the directory
listings. If it doesn't find the file in its cached state, then it can
reload the information from disk. If it finds it in the cache, but not on
disk, then it can remove the item from its cache.

The problem occurs when you path is [A, B], the file is in B, and you add
something to A on-the-fly. The cache might direct the importer at B,
missing your file.

Of course, with the appropriate caveats/warnings, the system would work
quite well. It really only breaks during development (which is one reason 
why I didn't accept some caching changes to imputil from MAL; but that
was for the Importer in there; Python's new Importer could have a cache).

I'm also not quite sure what the cost of reading a directory is, compared
to issuing a bunch of stat() calls. Each directory read is an
opendir/readdir(s)/closedir. Note that the DBM approach is kind of
similar, but will amortize this cost over many processes.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From Jasbahr at origin.EA.com  Fri Nov 19 21:59:11 1999
From: Jasbahr at origin.EA.com (Asbahr, Jason)
Date: Fri, 19 Nov 1999 14:59:11 -0600
Subject: [Python-Dev] Another 1.6 wish
Message-ID: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com>

My first Python-Dev post.  :-)

>We had some discussion a while back about enabling thread support by
>default, if the underlying OS supports it obviously.  

What's the consensus about Python microthreads -- a likely candidate
for incorporation in 1.6 (or later)?

Also, we have a couple minor convenience functions for Python in an 
MSDEV environment, an exposure of OutputDebugString for writing to 
the DevStudio log window and a means of tripping DevStudio C/C++ layer
breakpoints from Python code (currently experimental).  The msvcrt 
module seems like a likely candidate for these, would these be 
welcome additions?

Thanks,

Jason Asbahr
Origin Systems, Inc.
jasbahr at origin.ea.com



From gstein at lyra.org  Fri Nov 19 22:35:34 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 13:35:34 -0800 (PST)
Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs
In-Reply-To: <14389.31511.706588.20840@anthem.cnri.reston.va.us>
Message-ID: 

On Fri, 19 Nov 1999, Barry A. Warsaw wrote:
> There was a suggestion to start augmenting the checkin emails to
> include the diffs of the checkin.  This would let you keep a current
> snapshot of the tree without having to do a direct `cvs update'.

I've been using diffs-in-checkin for review, rather than to keep a local
snapshot updated. I guess you use the email for this (procmail truly is
frightening), but I think for most people it would be for purposes of
review.

>...context vs unifed...
> So here's an informal poll.  If you don't care either way, you don't
> need to respond.  Otherwise please just respond to me and not to the
> list.
> 
> 1. Would you like to start receiving diffs in the checkin messages?

Absolutely.

> 2. If you answer `yes' to #1 above, would you prefer unified or
>    context diffs?

Don't care.

I've attached an archive of the files that I use in my CVS repository to
do emailed diffs. These came from Ken Coar (an Apache guy) as an
extraction from the Apache repository. Yes, they do use Perl. I'm not a
Perl guy, so I probably would break things if I tried to "fix" the scripts
by converting them to Python (in fact, Greg Ward helped to improve
log_accum.pl for me!). I certainly would not be adverse to Python versions
of these files, or other cleanups.

I trimmed down the "avail" file, leaving a few examples. It works with
cvs_acls.pl to provide per-CVS-module read/write access control.

I'm currently running mod_dav, PyOpenGL, XML-SIG, PyWin32, and two other
small projects out of this repository. It has been working quite well.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cvs-for-barry.tar.gz
Type: application/octet-stream
Size: 9668 bytes
Desc: 
URL: 

From bwarsaw at cnri.reston.va.us  Fri Nov 19 22:45:14 1999
From: bwarsaw at cnri.reston.va.us (Barry A. Warsaw)
Date: Fri, 19 Nov 1999 16:45:14 -0500 (EST)
Subject: [Python-Dev] Re: [Python-checkins] CVS log messages with diffs
References: <14389.31511.706588.20840@anthem.cnri.reston.va.us>
	
Message-ID: <14389.50410.358686.637483@anthem.cnri.reston.va.us>

>>>>> "GS" == Greg Stein  writes:

    GS> I've been using diffs-in-checkin for review, rather than to
    GS> keep a local snapshot updated.

Interesting; I hadn't though about this use for the diffs.

    GS> I've attached an archive of the files that I use in my CVS
    GS> repository to do emailed diffs. These came from Ken Coar (an
    GS> Apache guy) as an extraction from the Apache repository. Yes,
    GS> they do use Perl. I'm not a Perl guy, so I probably would
    GS> break things if I tried to "fix" the scripts by converting
    GS> them to Python (in fact, Greg Ward helped to improve
    GS> log_accum.pl for me!). I certainly would not be adverse to
    GS> Python versions of these files, or other cleanups.

Well, we all know Greg Ward's one of those subversive types, but then
again it's great to have (hopefully now-loyal) defectors in our camp,
just to keep us honest :)

Anyway, thanks for sending the code, it'll come in handy if I get
stuck.  Of course, my P**l skills are so rusted I don't think even an
oilcan-armed Dorothy could lube 'em up, so I'm not sure how much use I
can put them to.  Besides, I already have a huge kludge that gets run
on each commit, and I don't think it'll be too hard to add diff
generation... IF the informal vote goes that way.

-Barry



From gmcm at hypernet.com  Fri Nov 19 22:56:20 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 19 Nov 1999 16:56:20 -0500
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: 
References: 
Message-ID: <1269073918-25826188@hypernet.com>

[David Ascher got involuntarily forwarded]
> > Aside: it strikes me that for Python programs which import lots
> > of files, 'front-loading' the stat calls could make sense. 
> > When you first look at a directory in sys.path, you read the
> > entire directory in memory, and successive imports do a stat on
> > the directory to see if it's changed, and if not use the
> > in-memory data.  Or am I completely off my rocker here?

I posted something here about dircache not too long ago. 
Essentially, I found it completely unreliable on NT and on 
Linux to stat the directory. There was some test code 
attached.
 


- Gordon



From gstein at lyra.org  Fri Nov 19 23:09:36 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 14:09:36 -0800 (PST)
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <19991119122302.B23400@trump.amber.org>
Message-ID: 

On Fri, 19 Nov 1999, Christopher Petrilli wrote:
> Andrew M. Kuchling [akuchlin at mems-exchange.org] wrote:
> > Barry A. Warsaw writes:
> > >We had some discussion a while back about enabling thread support by
> > >default, if the underlying OS supports it obviously.  I'd like to see

Definitely.

I think you still want a --disable-threads option, but the default really
ought to include them.

> Yes pretty please!  One of the biggest problems we have in the Zope world
> is that for some unknown reason, most of hte Linux RPMs don't have threading
> on in them, so people end up having to compile it anyway... while this
> is a silly thing, it does create problems, and means that we deal with
> a lot of "dumb" problems.

Yah. It's a pain. My RedHat 6.1 box has 1.5.2 with threads. I haven't
actually had to build my own Python(!). Man... imagine that. After almost
five years of using Linux/Python, I can actually rely on the OS getting it
right! :-)

> > That reminds me... what about the free threading patches?  Perhaps
> > they should be added to the list of issues to consider for 1.6.
> 
> My recolection was that unfortunately MOST of the time, they actually
> slowed down things because of the number of locks involved...  Guido
> can no doubt shed more light onto this, but... there was a reason.

Yes, there were problems in the first round with locks and lock
contention. The main issue is that a list must always use a lock to keep
itself consistent. Always. There is no way for an application to say "hey,
list object! I've got a higher-level construct here that guarantees there
will be no cross-thread use of this list. Ignore the locking." Another
issue that can't be avoided is using atomic increment/decrement for the
object refcounts.

Guido has already asked me about free threading patches for 1.6. I don't
know if his intent was to include them, or simply to have them available
for those who need them.

Certainly, this time around they will be simpler since Guido folded in
some of the support stuff (e.g. PyThreadState and per-thread exceptions).
There are some other supporting changes that could definitely go into the
core interpreter. The slow part comes when you start to add integrity
locks to list, dict, etc. That is when the question on whether to include
free threading comes up.

Design-wise, there is a change or two that I would probably make.

Note that shoving free-threading into the standard interpreter would get
more eyeballs at the thing, and that people may have great ideas for
reducing the overheads.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Fri Nov 19 23:11:02 1999
From: gstein at lyra.org (Greg Stein)
Date: Fri, 19 Nov 1999 14:11:02 -0800 (PST)
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com>
Message-ID: 

On Fri, 19 Nov 1999, Asbahr, Jason wrote:
> >We had some discussion a while back about enabling thread support by
> >default, if the underlying OS supports it obviously.  
> 
> What's the consensus about Python microthreads -- a likely candidate
> for incorporation in 1.6 (or later)?

microthreads? eh?

> Also, we have a couple minor convenience functions for Python in an 
> MSDEV environment, an exposure of OutputDebugString for writing to 
> the DevStudio log window and a means of tripping DevStudio C/C++ layer
> breakpoints from Python code (currently experimental).  The msvcrt 
> module seems like a likely candidate for these, would these be 
> welcome additions?

Sure. I don't see why not. I know that I've use OutputDebugString a
bazillion times from the Python layer. The breakpoint thingy... dunno, but
I don't see a reason to exclude it.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From skip at mojam.com  Fri Nov 19 23:11:38 1999
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 19 Nov 1999 16:11:38 -0600 (CST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: 
References: 
	
Message-ID: <14389.51994.809130.22062@dolphin.mojam.com>

    Greg> The problem occurs when you path is [A, B], the file is in B, and
    Greg> you add something to A on-the-fly. The cache might direct the
    Greg> importer at B, missing your file.

Typically your path will be relatively short (< 20 directories), right?
Just stat the directories before consulting the cache.  If any changed since
the last time the cache was built, then invalidate the entire cache (or that
portion of the cached information that is downstream from the first modified
directory).  It's still going to be cheaper than performing listdir for each
directory in the path, and like you said, only require flushes during
development or installation actions.

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...




From skip at mojam.com  Fri Nov 19 23:15:14 1999
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 19 Nov 1999 16:15:14 -0600 (CST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <1269073918-25826188@hypernet.com>
References: 
	<1269073918-25826188@hypernet.com>
Message-ID: <14389.52210.833368.249942@dolphin.mojam.com>

    Gordon> I posted something here about dircache not too long ago.
    Gordon> Essentially, I found it completely unreliable on NT and on Linux
    Gordon> to stat the directory. There was some test code attached.

The modtime of the directory's stat info should only change if you add or
delete entries in the directory.  Were you perhaps expecting changes when
other operations took place, like rewriting an existing file? 

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From skip at mojam.com  Fri Nov 19 23:34:42 1999
From: skip at mojam.com (Skip Montanaro)
Date: Fri, 19 Nov 1999 16:34:42 -0600
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <1269073918-25826188@hypernet.com>
References: 
	<1269073918-25826188@hypernet.com>
Message-ID: <199911192234.QAA24710@dolphin.mojam.com>

Gordon wrote:

    Gordon> I posted something here about dircache not too long ago.
    Gordon> Essentially, I found it completely unreliable on NT and on Linux
    Gordon> to stat the directory. There was some test code attached.

to which I replied:

    Skip> The modtime of the directory's stat info should only change if you
    Skip> add or delete entries in the directory.  Were you perhaps
    Skip> expecting changes when other operations took place, like rewriting
    Skip> an existing file?

I took a couple minutes to write a simple script to check things.  It
created a file, changed its mode, then unlinked it.  I was a bit surprised
that deleting a file didn't appear to change the directory's mod time.  Then
I realized that since file times are only recorded with one-second
precision, you might see no change to the directory's mtime in some
circumstances.  Adding a sleep to the script between directory operations
resolved the apparent inconsistency.  Still, as Gordon stated, you probably
can't count on directory modtimes to tell you when to invalidate the cache.
It's consistent, just not reliable...

if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs,

Skip Montanaro | http://www.mojam.com/
skip at mojam.com | http://www.musi-cal.com/
847-971-7098   | Python: Programming the way Guido indented...



From mhammond at skippinet.com.au  Sat Nov 20 01:04:28 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sat, 20 Nov 1999 11:04:28 +1100
Subject: [Python-Dev] Another 1.6 wish
In-Reply-To: <11A17AA2B9EAD111BCEA00A0C9B4179303385C08@molach.origin.ea.com>
Message-ID: <005f01bf32ea$d0b82b90$0501a8c0@bobcat>

> Also, we have a couple minor convenience functions for Python in an
> MSDEV environment, an exposure of OutputDebugString for writing to
> the DevStudio log window and a means of tripping DevStudio C/C++
layer
> breakpoints from Python code (currently experimental).  The msvcrt
> module seems like a likely candidate for these, would these be
> welcome additions?

These are both available in the win32api module.  They dont really fit
in the "msvcrt" module, as they are not part of the C runtime library,
but the win32 API itself.

This is really a pointer to the fact that some or all of the win32api
should be moved into the core - registry access is the thing people
most want, but there are plenty of other useful things that people
reguarly use...

Guido objects to the coding style, but hopefully that wont be a big
issue.  IMO, the coding style isnt "bad" - it is just more an "MS"
flavour than a "Python" flavour - presumably people reading the code
will have some experience with Windows, so it wont look completely
foreign to them.  The good thing about taking it "as-is" is that it
has been fairly well bashed on over a few years, so is really quite
stable.  The final "coding style" issue is that there are no "doc
strings" - all documentation is embedded in C comments, and extracted
using a tool called "autoduck" (similar to "autodoc").  However, Im
sure we can arrange something there, too.

Mark.




From jcw at equi4.com  Sat Nov 20 01:21:43 1999
From: jcw at equi4.com (Jean-Claude Wippler)
Date: Sat, 20 Nov 1999 01:21:43 +0100
Subject: [Python-Dev] Import redesign [LONG]
References: 
		<1269073918-25826188@hypernet.com> <199911192234.QAA24710@dolphin.mojam.com>
Message-ID: <3835E997.8A4F5BC5@equi4.com>

Skip Montanaro wrote:
>
[dir stat cache times]
> I took a couple minutes to write a simple script to check things.  It
> created a file, changed its mode, then unlinked it.  I was a bit
> surprised that deleting a file didn't appear to change the directory's
> mod time.  Then I realized that since file times are only recorded
> with one-second

Or two, on Windows with older (FAT, as opposed to VFAT) file systems.

> precision, you might see no change to the directory's mtime in some
> circumstances.  Adding a sleep to the script between directory
> operations resolved the apparent inconsistency.  Still, as Gordon
> stated, you probably can't count on directory modtimes to tell you
> when to invalidate the cache. It's consistent, just not reliable...
> 
> if-we-slow-import-down-enough-we-can-use-this-trick-though-ly y'rs,

If the dir stat time is less than 2 seconds ago, flush - always.

If the dir stat time says it hasn't been changed for at least 2 seconds
then you can cache all entries and trust that any change is detected.
In other words: take the *current* time into account, then it can work.

I think.  Maybe.  Until you get into network drives and clock skew...

-- Jean-Claude



From gmcm at hypernet.com  Sat Nov 20 04:43:32 1999
From: gmcm at hypernet.com (Gordon McMillan)
Date: Fri, 19 Nov 1999 22:43:32 -0500
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <3835E997.8A4F5BC5@equi4.com>
Message-ID: <1269053086-27079185@hypernet.com>

Jean-Claude wrote:
> Skip Montanaro wrote:
> >
> [dir stat cache times]
> > ...  Then I realized that since
> > file times are only recorded with one-second
> 
> Or two, on Windows with older (FAT, as opposed to VFAT) file
> systems.

Oh lordy, it gets worse. 

With a time.sleep(1.0) between new files, Linux detects the 
change in the dir's mtime immediately. Cool.

On NT, I get an average 2.0 sec delay. But sometimes it 
doesn't detect a delay in 100 secs (and my script quits). Then 
I added a stat of some file in the directory before the stat of 
the directory, (not the file I added). Now it acts just like Linux - 
no delay (on both FAT and NTFS partitions). OK...

> I think.  Maybe.  Until you get into network drives and clock
> skew...

No success whatsoever in either direction across Samba. In 
fact the mtime of my Linux home directory as seen from NT is 
Jan 1, 1980.

- Gordon



From gstein at lyra.org  Sat Nov 20 13:06:48 1999
From: gstein at lyra.org (Greg Stein)
Date: Sat, 20 Nov 1999 04:06:48 -0800 (PST)
Subject: [Python-Dev] updated imputil
Message-ID: 

I've updated imputil... The main changes is that I added SysPathImporter
and BuiltinImporter. I also did some restructing to help with
bootstrapping the module (remove dependence on os.py).

For testing a revamped Python import system, you can importing the thing
and call imputil._test_revamp() to set it up. This will load normal,
builtin, and frozen modules via imputil. Dynamic modules are still
handled by Python, however.

I ran a timing comparisons of importing all modules in /usr/lib/python1.5
(using standard and imputil-based importing). The standard mechanism can
do it in about 8.8 seconds. Through imputil, it does it in about 13.0
seconds. Note that I haven't profiled/optimized any of the Importer stuff
(yet).

The point about dynamic modules actually discovered a basic problem that I
need to resolve now. The current imputil assumes that if a particular
Importer loaded the top-level module in a package, then that Importer is
responsible for loading all other modules within that package. In my
particular test, I tried to import "xml.parsers.pyexpat". The two package
modules were handled by SysPathImporter. The pyexpat module is a dynamic
load module, so it is *not* handled by the Importer -- bam. Failure.

Basically, each part of "xml.parsers.pyexpat" may need to use a different
Importer...

Off to ponder,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Sat Nov 20 13:11:37 1999
From: gstein at lyra.org (Greg Stein)
Date: Sat, 20 Nov 1999 04:11:37 -0800 (PST)
Subject: [Python-Dev] updated imputil
In-Reply-To: 
Message-ID: 

oops... forgot:

   http://www.lyra.org/greg/python/imputil.py

-g

On Sat, 20 Nov 1999, Greg Stein wrote:
> I've updated imputil... The main changes is that I added SysPathImporter
> and BuiltinImporter. I also did some restructing to help with
> bootstrapping the module (remove dependence on os.py).
> 
> For testing a revamped Python import system, you can importing the thing
> and call imputil._test_revamp() to set it up. This will load normal,
> builtin, and frozen modules via imputil. Dynamic modules are still
> handled by Python, however.
> 
> I ran a timing comparisons of importing all modules in /usr/lib/python1.5
> (using standard and imputil-based importing). The standard mechanism can
> do it in about 8.8 seconds. Through imputil, it does it in about 13.0
> seconds. Note that I haven't profiled/optimized any of the Importer stuff
> (yet).
> 
> The point about dynamic modules actually discovered a basic problem that I
> need to resolve now. The current imputil assumes that if a particular
> Importer loaded the top-level module in a package, then that Importer is
> responsible for loading all other modules within that package. In my
> particular test, I tried to import "xml.parsers.pyexpat". The two package
> modules were handled by SysPathImporter. The pyexpat module is a dynamic
> load module, so it is *not* handled by the Importer -- bam. Failure.
> 
> Basically, each part of "xml.parsers.pyexpat" may need to use a different
> Importer...
> 
> Off to ponder,
> -g
> 
> -- 
> Greg Stein, http://www.lyra.org/
> 
> 
> _______________________________________________
> Python-Dev maillist  -  Python-Dev at python.org
> http://www.python.org/mailman/listinfo/python-dev
> 

-- 
Greg Stein, http://www.lyra.org/




From skip at mojam.com  Sat Nov 20 15:16:58 1999
From: skip at mojam.com (Skip Montanaro)
Date: Sat, 20 Nov 1999 08:16:58 -0600 (CST)
Subject: [Python-Dev] Import redesign [LONG]
In-Reply-To: <1269053086-27079185@hypernet.com>
References: <3835E997.8A4F5BC5@equi4.com>
	<1269053086-27079185@hypernet.com>
Message-ID: <14390.44378.83128.546732@dolphin.mojam.com>

    Gordon> No success whatsoever in either direction across Samba. In fact
    Gordon> the mtime of my Linux home directory as seen from NT is Jan 1,
    Gordon> 1980.

Ain't life grand? :-(

Ah, well, it was a nice idea...

S



From jim at interet.com  Mon Nov 22 17:43:39 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Mon, 22 Nov 1999 11:43:39 -0500
Subject: [Python-Dev] Import redesign [LONG]
References: 
Message-ID: <383972BB.C65DEB26@interet.com>

Greg Stein wrote:
> 
> I would suggest that both retain their *exact* meaning. We introduce
> sys.importers -- a list of importers to check, in sequence. The first
> importer on that list uses sys.path to look for and load modules. The
> second importer loads builtins and frozen code (i.e. modules not on
> sys.path).

We should retain the current order.  I think is is:
first builtin, next frozen, next sys.path.
I really think frozen modules should be loaded in preference
to sys.path.  After all, they are compiled in.
 
> Users can insert/append new importers or alter sys.path as before.

I agree with Greg that sys.path should remain as it is.  A list
of importers can add the extra functionality.  Users will
probably want to adjust the order of the list.

> > Implementation:
> > ---------------
> >
> > - There must clearly be some code in C that can import certain
> >   essential modules (to solve the chicken-or-egg problem), but I don't
> >   mind if the majority of the implementation is written in Python.
> >   Using Python makes it easy to subclass.
> 
> I posited once before that the cost of import is mostly I/O rather than
> CPU, so using Python should not be an issue. MAL demonstrated that a good
> design for the Importer classes is also required. Based on this, I'm a
> *strong* advocate of moving as much as possible into Python (to get
> Python's ease-of-coding with little relative cost).

Yes, I agree.  And I think the main() should be written in Python.  Lots
of Python should be written in Python.

> The (core) C code should be able to search a path for a module and import
> it. It does not require dynamic loading or packages. This will be used to
> import exceptions.py, then imputil.py, then site.py.

But these can be frozen in (as you mention below).  I dislike depending
on sys.path to load essential modules.  If they are not frozen in,
then we need a command line argument to specify their path, with
sys.path used otherwise.
 
Jim Ahlstrom



From jim at interet.com  Mon Nov 22 18:25:46 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Mon, 22 Nov 1999 12:25:46 -0500
Subject: [Python-Dev] Import redesign (was: Python 1.6 status)
References: <1269144272-21594530@hypernet.com>
Message-ID: <38397C9A.DF6B7112@interet.com>

Gordon McMillan wrote:

> [JimA]
> > Think about multiple packages in multiple zip files.  The zip
> > files store file directories.  That means we would need a
> > sys.zippath to search the zip files.  I don't want another
> > PYTHONPATH phenomenon.
> 
> What if sys.path looked like:
>  [DirImporter('.'), ZlibImporter('c:/python/stdlib.pyz'), ...]

Well, that changes the current meaning of sys.path.
 
> > > > I suggest that archive files MUST be put into a known
> > > > directory.
> 
> No way. Hard code a directory? Overwrite someone else's
> Python "standalone"? Write to a C: partition that is
> deliberately sized to hold nothing but Windows? Make
> network installations impossible?

Ooops.  I didn't mean a known directory you couldn't change.
But I did mean a directory you shouldn't change.

But you are right.  The directory should be configurable.  But
I would still like to see a highly encouraged directory.  I
don't yet have a good design for this.  Anyone have ideas on an
official way to find library files?

I think a Python library file is a Good Thing, but it is not useful if
the archive can't be found.

I am thinking of a busy SysAdmin with someone nagging him/her to
install Python.  SysAdmin doesn't want another headache.  What if
Python becomes popular and users want it on Unix and PC's?  More
work!  There should be a standard way to do this that just works
and is dumb-stupid-simple.  This is a Python promotion issue.  Yes
everyone here can make sys.path work, but that is not the point.

> The official Windows solution is stuff in registry about app
> paths and such. Putting the dlls in the exe's directory is a
> workaround which works and is more managable than the
> official solution.

I agree completely.
 
> > > > We should also have the ability to append archive files to
> > > > the executable or a shared library assuming the OS allows
> > > > this
> 
> That's a handy trick on Windows, but it's got nothing to do
> with Python.

It also works on Linux.  I don't know about other systems.
 
> Flexibility. You can put Christian's favorite Einstein quote here
> too.

I hope we can still have ease of use with all this flexibility.
As I said, we need to promote Python.
 
Jim Ahlstrom



From mal at lemburg.com  Tue Nov 23 14:32:42 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Tue, 23 Nov 1999 14:32:42 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.8
References: <382C0A54.E6E8328D@lemburg.com> <382D625B.DC14DBDE@lemburg.com> <38316685.7977448D@lemburg.com> <3834425A.8E9C3B7E@lemburg.com>
Message-ID: <383A977A.C20E6518@lemburg.com>

FYI, I've uploaded a new version of the proposal which includes
the encodings package, definition of the 'raw unicode escape'
encoding (available via e.g. ur""), Unicode format strings and
a new method .breaklines().

The latest version of the proposal is available at:

        http://starship.skyport.net/~lemburg/unicode-proposal.txt

Older versions are available as:

        http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt

Some POD (points of discussion) that are still open:

? Stream readers:

  What about .readline(), .readlines() ? These could be implemented
  using .read() as generic functions instead of requiring their
  implementation by all codecs. Also see Line Breaks.

? Python interface for the Unicode property database

? What other special Unicode formatting characters should be
  enhanced to work with Unicode input ? Currently only the
  following special semantics are defined:

    u"%s %s" % (u"abc", "abc") should return u"abc abc".


Pretty quiet around here lately...
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    38 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From jcw at equi4.com  Tue Nov 23 16:17:36 1999
From: jcw at equi4.com (Jean-Claude Wippler)
Date: Tue, 23 Nov 1999 16:17:36 +0100
Subject: [Python-Dev] New thread ideas in Perl-land
Message-ID: <383AB010.DD46A1FB@equi4.com>

Just got a note about a paper on a new way of dealing with threads, as
presented to the Perl-Porters list.  The idea is described in:
	http://www.cpan.org/modules/by-authors/id/G/GB/GBARTELS/thread_0001.txt

I have no time to dive in, comment, or even judge the relevance of this,
but perhaps someone else on this list wishes to check it out.

The author of this is Greg London .

-- Jean-Claude



From mhammond at skippinet.com.au  Tue Nov 23 23:45:14 1999
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 24 Nov 1999 09:45:14 +1100
Subject: [Python-Dev] Unicode Proposal: Version 0.8
In-Reply-To: <383A977A.C20E6518@lemburg.com>
Message-ID: <002301bf3604$68fd8f00$0501a8c0@bobcat>

> Pretty quiet around here lately...

My guess is that most positions and opinions have been covered.  It is
now probably time for less talk, and more code!

It is time to start an implementation plan?  Do we start with /F's
Unicode implementation (which /G *smirk* seemed to approve of)?  Who
does what?  When can we start to play with it?

And a key point that seems to have been thrust in our faces at the
start and hardly mentioned recently - does the proposal as it stands
meet our sponsor's (HP) requirements?

Mark.




From gstein at lyra.org  Wed Nov 24 01:40:44 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 23 Nov 1999 16:40:44 -0800 (PST)
Subject: [Python-Dev] Re: updated imputil
In-Reply-To: 
Message-ID: 

 :-)

On Sat, 20 Nov 1999, Greg Stein wrote:
>...
> The point about dynamic modules actually discovered a basic problem that I
> need to resolve now. The current imputil assumes that if a particular
> Importer loaded the top-level module in a package, then that Importer is
> responsible for loading all other modules within that package. In my
> particular test, I tried to import "xml.parsers.pyexpat". The two package
> modules were handled by SysPathImporter. The pyexpat module is a dynamic
> load module, so it is *not* handled by the Importer -- bam. Failure.
> 
> Basically, each part of "xml.parsers.pyexpat" may need to use a different
> Importer...

I've thought about this and decided the issue is with my particular
Importer, rather than the imputil design. The PathImporter traverses a set
of paths and establishes a package hierarchy based on a filesystem layout.
It should be able to load dynamic modules from within that filesystem
area.

A couple alternatives, and why I don't believe they work as well:

* A separate importer to just load dynamic libraries: this would need to
  replicate PathImporter's mapping of Python module/package hierarchy onto
  the filesystem. There would also be a sequencing issue because one
  Importer's paths would be searched before the other's paths. Current
  Python import rules establishes that a module earlier in sys.path
  (whether a dyn-lib or not) is loaded before one later in the path. This
  behavior could be broken if two Importers were used.

* A design whereby other types of modules can be placed into the
  filesystem and multiple Importers are used to load parts of the path
  (e.g. PathImporter for xml.parsers and DynLibImporter for pyexpat). This
  design doesn't work well because the mapping of Python module/package to
  the filesystem is established by PathImporter -- try to mix a "private"
  mapping design among Importers creates too much coupling.


There is also an argument that the design is fundamentally incorrect :-).
I would argue against that, however. I'm not sure what form an argument
*against* imputil would be, so I'm not sure how to preempty it :-). But we
can get an idea of various arguments by hypothesizing different scenarios
and requireing that the imputil design satisifies them.

In the above two alternatives, they were examing the use of a secondary
Importer to load things out of the filesystem (and it explained why two
Importers in whatever configuration is not a good thing). Let's state for
argument's sake that files of some type T must be placable within the
filesystem (i.e. according to the layout defined by PathImporter). We'll
also say that PathImporter doesn't understand T, since the latter was
designed later or is private to some app. The way to solve this is to
allow PathImporter to recognize it through some configuration of the
instance (e.g. self.recognized_types). A set of hooks in the PathImporter
would then understand how to map files of type T to a code or module
object. (alternatively, a generalized set of hooks at the Importer class
level) Note that you could easily have a utility function that scans
sys.importers for a PathImporter instance and adds the data to recognize a
new type -- this would allow for simple installation of new types.

Note that PathImporter inherently defines a 1:1 mapping from a module to a
file. Archives (zip or jar files) cannot be recognized and handled by
PathImporter. An archive defines an entirely different style of mapping
between a module/package and a file in the filesystem. Of course, an
Importer that uses archives can certainly look for them in sys.path.

The imputil design is derived directly from the "import" statement. "Here
is a module/package name, give me a module."  (this is embodied in the
get_code() method in Importer)

The find/load design established by ihooks is very filesystem-based. In
many situations, a find/load is very intertwined. If you want to take the
URL case, then just examine the actual network activity -- preferably, you
want a single transaction (e.g. one HTTP GET). Find/load implies two
transactions. With nifty context handling between the two steps, you can
get away with a single transaction. But the point is that the design
requires you to get work around its inherent two-step mechanism and
establish a single step. This is weird, of course, because importing is
never *just* a find or a load, but always both.

Well... since I've satisfied to myself that PathImporter needs to load
dynamic lib modules, I'm off to code it...

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From gstein at lyra.org  Wed Nov 24 02:45:29 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 23 Nov 1999 17:45:29 -0800 (PST)
Subject: [Python-Dev] breaking out code for dynamic loading
Message-ID: 

Guido,

I can't find the message, but it seems that at some point you mentioned
wanting to break out importdl.c into separate files. The configure process
could then select the appropriate one to use for the platform.

Sounded great until I looked at importdl.c. There are a 13 variants of
dynamic loading. That would imply 13 separate files/modules.

I'd be happy to break these out, but are you actually interested in that
many resulting modules? If so, then any suggestions for naming?
(e.g. aix_dynload, win32_dynload, mac_dynload)

Here are the variants:

* NeXT, using FVM shlibs             (USE_RLD)
* NeXT, using frameworks             (USE_DYLD)
* dl / GNU dld                       (USE_DL)
* SunOS, IRIX 5 shared libs          (USE_SHLIB)
* AIX dynamic linking                (_AIX)
* Win32 platform                     (MS_WIN32)
* Win16 platform                     (MS_WIN16)
* OS/2 dynamic linking               (PYOS_OS2)
* Mac CFM                            (USE_MAC_DYNAMIC_LOADING)
* HP/UX dyn linking                  (hpux)
* NetBSD shared libs                 (__NetBSD__)
* FreeBSD shared libs                (__FreeBSD__)
* BeOS shared libs                   (__BEOS__)


Could I suggest a new top-level directory in the Python distribution named
"Platform"? Move BeOS, PC, and PCbuild in there (bring back Mac?). Add new
directories for each of the above platforms and move the appropriate
portion of importdl.c into there as a Python C Extension Module. (the
module would still be statically linked into the interpreter!)

./configure could select the module and write a Setup.dynload, much like
it does with Setup.thread.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/





From gstein at lyra.org  Wed Nov 24 03:43:50 1999
From: gstein at lyra.org (Greg Stein)
Date: Tue, 23 Nov 1999 18:43:50 -0800 (PST)
Subject: [Python-Dev] another round of imputil work completed
In-Reply-To: 
Message-ID: 

On Tue, 23 Nov 1999, Greg Stein wrote:
>...
> Well... since I've satisfied to myself that PathImporter needs to load
> dynamic lib modules, I'm off to code it...

All right. imputil.py now comes with code to emulate the builtin Python
import mechanism. It loads all the same types of files, uses sys.path, and
(pointed out by JimA) loads builtins before looking on the path.

The only "feature" it doesn't support is using package.__path__ to look
for submodules. I never liked that thing, so it isn't in there.
(imputil *does* set the __path__ attribute, tho)

Code is available at:

   http://www.lyra.org/greg/python/imputil.py


Next step is to add a "standard" library/archive format. JimA and I have
been tossing some stuff back and forth on this.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/




From mal at lemburg.com  Wed Nov 24 09:34:52 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 24 Nov 1999 09:34:52 +0100
Subject: [Python-Dev] Unicode Proposal: Version 0.8
References: <002301bf3604$68fd8f00$0501a8c0@bobcat>
Message-ID: <383BA32C.2E6F4780@lemburg.com>

Mark Hammond wrote:
> 
> > Pretty quiet around here lately...
> 
> My guess is that most positions and opinions have been covered.  It is
> now probably time for less talk, and more code!

Or that everybody is on holidays... like Guido.
 
> It is time to start an implementation plan?  Do we start with /F's
> Unicode implementation (which /G *smirk* seemed to approve of)?  Who
> does what?  When can we start to play with it?

This depends on whether HP agrees on the current specs. If they
do, there should be code by mid December, I guess.
 
> And a key point that seems to have been thrust in our faces at the
> start and hardly mentioned recently - does the proposal as it stands
> meet our sponsor's (HP) requirements?

Haven't heard anything from them yet (this is probably mainly
due to Guido being offline).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    37 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From mal at lemburg.com  Wed Nov 24 10:32:46 1999
From: mal at lemburg.com (M.-A. Lemburg)
Date: Wed, 24 Nov 1999 10:32:46 +0100
Subject: [Python-Dev] Import Design
Message-ID: <383BB0BE.BF116A28@lemburg.com>

Before hooking on to some more PathBuiltinImporters ;-), I'd like
to spawn a thread leading in a different direction...

There has been some discussion on what we really expect of the
import mechanism to be able to do. Here's a summary of what I
think we need:

* compatibility with the existing import mechanism

* imports from library archives (e.g. .pyl or .par-files)

* a modified intra package import lookup scheme (the thingy
  which I call "walk-me-up-Scotty" patch -- see previous posts)

And for some fancy stuff:

* imports from URLs (e.g. these could be put on the path for
  automatic inclusion in the import scan or be passed explicitly
  to __import__)

* a (file based) static lookup cache to enhance lookup
  performance which is enabled via a command line switch
  (rather than being enabled per default), so that the
  user can decide whether to apply this optimization or
  not

The point I want to make is: there aren't all that many features
we are really looking for, so why not incorporate these into
the builtin importer and only *then* start thinking about
schemes for hooks, managers, etc. ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    37 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/




From captainrobbo at yahoo.com  Wed Nov 24 12:40:16 1999
From: captainrobbo at yahoo.com (=?iso-8859-1?q?Andy=20Robinson?=)
Date: Wed, 24 Nov 1999 03:40:16 -0800 (PST)
Subject: [Python-Dev] Unicode Proposal: Version 0.8
Message-ID: <19991124114016.7706.rocketmail@web601.mail.yahoo.com>

--- Mark Hammond  wrote:
> > Pretty quiet around here lately...
> 
> My guess is that most positions and opinions have
> been covered.  It is
> now probably time for less talk, and more code!
> 
> It is time to start an implementation plan?  Do we
> start with /F's
> Unicode implementation (which /G *smirk* seemed to
> approve of)?  Who
> does what?  When can we start to play with it?
> 
> And a key point that seems to have been thrust in
> our faces at the
> start and hardly mentioned recently - does the
> proposal as it stands
> meet our sponsor's (HP) requirements?
> 
> Mark.

I had a long chat with them on Friday :-)  They want
it done, but nobody is actively working on it now as
far as I can tell, and they are very busy.

The per-thread thing was a red herring - they just
want to be able to do (for example) web servers
handling different encodings from a central unicode
database, so per-output-stream works just fine.

They will be at IPC8; I'd suggest that a round of
prototyping, we insist they read it and then discuss
it at IPC8, and be prepared to rework things
thereafter are important.  Hopefully then we'll have a
plan on how to tackle the much larger (but less
interesting to python-dev) job of writing and
verifying all the codecs and utilities.


Andy Robinson



=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Thousands of Stores.  Millions of Products.  All in one place.
Yahoo! Shopping: http://shopping.yahoo.com



From jim at interet.com  Wed Nov 24 15:43:57 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Wed, 24 Nov 1999 09:43:57 -0500
Subject: [Python-Dev] Re: updated imputil
References: 
Message-ID: <383BF9AD.E183FB98@interet.com>

Greg Stein wrote:
> * A separate importer to just load dynamic libraries: this would need to
>   replicate PathImporter's mapping of Python module/package hierarchy onto
>   the filesystem. There would also be a sequencing issue because one
>   Importer's paths would be searched before the other's paths. Current
>   Python import rules establishes that a module earlier in sys.path
>   (whether a dyn-lib or not) is loaded before one later in the path. This
>   behavior could be broken if two Importers were used.

I would like to argue that on Windows, import of dynamic libraries is
broken.  If a file something.pyd is imported, then sys.path is searched
to find the module.  If a file something.dll is imported, the same thing
happens.  But Windows defines its own search order for *.dll files which
Python ignores.  I would suggest that this is wrong for files named
*.dll,
but OK for files named *.pyd.

A SysAdmin should be able to install and maintain *.dll as she has
been trained to do.  This makes maintaining Python installations
simpler and more un-surprising.

I have no solution to the backward compatibilty problem.  But the
code is only a couple lines.  A LoadLibrary() call does its own
path searching.

Jim Ahlstrom



From jim at interet.com  Wed Nov 24 16:06:17 1999
From: jim at interet.com (James C. Ahlstrom)
Date: Wed, 24 Nov 1999 10:06:17 -0500
Subject: [Python-Dev] Import Design
References: <383BB0BE.BF116A28@lemburg.com>
Message-ID: <383BFEE9.B4FE1F19@interet.com>

"M.-A. Lemburg" wrote:

> The point I want to make is: there aren't all that many features
> we are really looking for, so why not incorporate these into
> the builtin importer and only *then* start thinking about
> schemes for hooks, managers, etc. ?!

Marc has made this point before, and I think it should be
considered carefully.  It is a lot of work to re-create the
current import logic in Python and it is almost guaranteed
to be slower.  So why do it?

I like imputil.py because it leads
to very simple Python installations.  I view this as
a Python promotion issue.  If we have a boot mechanism plus
archive files, we can have few-file Python installations
with package addition being just adding another file.

But at least some of this code must be in C.  I volunteer to
write the rest of it in C if that is what people want.  But it
would add two hundred more lines of code to import.c.  So
maybe now is the time to switch to imputil, instead of waiting
for later.

But I am indifferent as long as I can tell a Python user
to just put an archive file libpy.pyl in his Python directory
and everything will Just Work.

Jim Ahlstrom



From bwarsaw at python.org  Tue Nov 30 21:23:40 1999
From: bwarsaw at python.org (Barry Warsaw)
Date: Tue, 30 Nov 1999 15:23:40 -0500 (EST)
Subject: [Python-Dev] CFP Developers' Day - 8th International Python Conference
Message-ID: <14404.12876.847116.288848@anthem.cnri.reston.va.us>

Hello Python Developers!

Thursday January 27 2000, the final day of the 8th International
Python Conference is Developers' Day, where Python hackers get
together to discuss and reach agreements on the outstanding issues
facing Python.  This is also your once-a-year chance for face-to-face
interactions with Python's creator Guido van Rossum and other
experienced Python developers.

To make Developers' Day a success, we need you!  We're looking for a
few good champions to lead topic sessions.  As a champion, you will
choose a topic that fires you up and write a short position paper for
publication on the web prior to the conference.  You'll also prepare
introductory material for the topic overview session, and lead a 90
minute topic breakout group.

We've had great champions and topics in previous years, and many
features of today's Python had their start at past Developers' Days.
This is your chance to help shape the future of Python for 1.6,
2.0 and beyond.

If you are interested in becoming a topic champion, you must email me
by Wednesday December 15, 1999.  For more information, please visit
the IPC8 Developers' Day web page at

    

This page has more detail on schedule, suggested topics, important
dates, etc.  To volunteer as a champion, or to ask other questions,
you can email me at bwarsaw at python.org.

-Barry