From martin@loewis.home.cs.tu-berlin.de Sun Feb 4 15:13:21 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sun, 4 Feb 2001 16:13:21 +0100 Subject: [I18n-sig] Re: [4suite] 4Suite 0.10.2 alpha 1 In-Reply-To: <3A7D15EB.970D327E@fourthought.com> (message from Uche Ogbuji on Sun, 04 Feb 2001 01:42:19 -0700) References: <3A7D15EB.970D327E@fourthought.com> Message-ID: <200102041513.f14FDLZ01273@mira.informatik.hu-berlin.de> > Please test the new internationalization: French and German translations > hve been added courtesy Alexandre and Martin. This is indeed causing problems for me. Invoking 4xslt gives raceback (most recent call last): File "/usr/local/bin/4xslt", line 4, in ? from xml.xslt import _4xslt File "/usr/local/lib/python2.1/site-packages/_xmlplus/xslt/__init__.py", line 16, in ? from xml import xpath File "/usr/local/lib/python2.1/site-packages/_xmlplus/xpath/__init__.py", line 41, in ? import XPathParserBase File "/usr/local/lib/python2.1/site-packages/_xmlplus/xpath/XPathParserBase.py", line 7, in ? gettext.install('4Suite', locale_dir) File "/usr/local/lib/python2.1/gettext.py", line 251, in install translation(domain, localedir).install(unicode) File "/usr/local/lib/python2.1/gettext.py", line 238, in translation raise IOError(ENOENT, 'No translation file found for domain', domain) IOError: [Errno 2] No translation file found for domain: '4Suite' The problem is two-fold: For one thing, there is no German xpath message catalog. However, it shouldn't fail if LANG is set to an unsupported language, so you should catch IOError also. I consider this is a gettext bug: gettext should not fail in the absence of a catalog, but default to the "C" locale. Regards, Martin From paulp@ActiveState.com Tue Feb 6 14:49:09 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 06 Feb 2001 06:49:09 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model Message-ID: <3A800EE5.A8122B3C@ActiveState.com> I went to a very interesting talk about internationalization by Tim Bray, one of the editors of the XML spec and a real expert on i18n. It inspired me to wrestle one more time with the architectural issues in Python that are preventing us from saying that it is a really internationalized language. Those geek cruises aren't just about sun, surf and sand. There's a pretty high level of intellectual give and take also! Email me for more info... Anyhow, we deferred many of these issues (probably out of exhaustion) the last time we talked about it but we cannot and should not do so forever. In particular, I do not think that we should add more features for working with Unicode (e.g. unichr) before thinking through the issues. --- Abstract Many of the world's written languages have more than 255 characters. Therefore Python is out of date in its insistence that "basic strings" are lists of characters with ordinals between 0 and 255. Python's basic character type must allow at least enough digits for Eastern languages. Problem Description Python's western bias stems from a variety of issues. The first problem is that Python's native character type is an 8-bit character. You can see that it is an 8-bit character by trying to insert a value with an ordinal higher than 255. Python should allow for ordinal numbers up to at least the size of a single Eastern language such as Chinese or Japanese. Whenever a Python file object is "read", it returns one of these lists of 8-byte characters. The standard file object "read" method can never return a list of Chinese or Japanese characters. This is an unacceptable state of affairs in the 21st century. Goals 1. Python should have a single string type. It should support Eastern characters as well as it does European characters. Operationally speaking: type("") == type(chr(150)) == type(chr(1500)) == type(file.read()) 2. It should be easier and more efficient to encode and decode information being sent to and retrieved from devices. 3. It should remain possible to work with the byte-level representation. This is sometimes useful for for performance reasons. Definitions Character Set A character set is a mapping from integers to characters. Note that both integers and characters are abstractions. In other words, a decision to use a particular character set does not in any way mandate a particular implementation or representation for characters. In Python terms, a character set can be thought of as no more or less than a pair of functions: ord() and chr(). ASCII, for instance, is a pair of functions defined only for 0 through 127 and ISO Latin 1 is defined only for 0 through 255. Character sets typically also define a mapping from characters to names of those characters in some natural language (often English) and to a simple graphical representation that native language speakers would recognize. It is not possible to have a concept of "character" without having a character set. After all, characters must be chosen from some repertoire and there must be a mapping from characters to integers (defined by ord). Character Encoding A character encoding is a mechanism for representing characters in terms of bits. Character encodings are only relevant when information is passed from Python to some system that works with the characters in terms of representation rather than abstraction. Just as a Python programmer would not care about the representation of a long integer, they should not care about the representation of a string. Understanding the distinction between an abstract character and its bit level representation is essential to understanding this Python character model. A Python programmer does not need to know or care whether a long integer is represented as twos complement, ones complement or in terms of ASCII digits. Similarly a Python programmer does not need to know or care how characters are represented in memory. We might even change the representation over time to achieve higher performance. Universal Character Set There is only one standardized international character set that allows for mixed-language information. It is called the Universal Character Set and it is logically defined for characters 0 through 2^32 but practically is deployed for characters 0 through 2^16. The Universal Character Set is an international standard in the sense that it is standardized by ISO and has the force of law in international agreements. A popular subset of the Universal Character Set is called Unicode. The most popular subset of Unicode is called the "Unicode Basic Multilingual Plane (Unicode BMP)". The Unicode BMP has space for all of the world's major languages including Chinese, Korean, Japanese and Vietnamese. There are 2^16 characters in the Unicode BMP. The Unicode BMP subset of UCS is becoming a defacto standard on the Web. In any modern browser you can create an HTML or XML document with ĭ and get back a rendered version of Unicode character 301. In other words, Unicode is becoming the defato character set for the Internet in addition to being the officially mandated character set for international commerce. In addition to defining ord() and chr(), Unicode provides a database of information about characters. Each character has an english language name, a classification (letter, number, etc.) a "demonstration" glyph and so forth. The Unicode Contraversy Unicode is not entirely uncontroversial. In particular there are Japanese speakers who dislike the way Unicode merges characters from various languages that were considered "the same" by the experts that defined the specification. Nevertheless Unicode is in used as the character set for important Japanese software such as the two most popular word processors, Ichitaro and Microsoft Word. Other programming languages have also moved to use Unicode as the basic character set instead of ASCII or ISO Latin 1. From memory, I believe that this is the case for: Java Perl JavaScript Visual Basic TCL XML is also Unicode based. Note that the difference between all of these languages and Python is that Unicode is the *basic* character type. Even when you type ASCII literals, they are immediately converted to Unicode. It is the author's belief this "running code" is evidence of Unicode's practical applicability. Arguments against it seem more rooted in theory than in practical problems. On the other hand, this belief is informed by those who have done heavy work with Asian characters and not based on my own direct experience. Python Character Set As discussed before, Python's native character set happens to consist of exactly 255 characters. If we increase the size of Python's character set, no existing code would break and there would be no cost in functionality. Given that Unicode is a standard character set and it is richer than that of Python's, Python should move to that character set. Once Python moves to that character set it will no longer be necessary to have a distinction between "Unicode string" and "regular string." This means that Unicode literals and escape codes can also be merged with ordinary literals and escape codes. unichr can be merged with chr. Character Strings and Byte Arrays Two of the most common constructs in computer science are strings of characters and strings of bytes. A string of bytes can be represented as a string of characters between 0 and 255. Therefore the only reason to have a distinction between Unicode strings and byte strings is for implementation simplicity and performance purposes. This distinction should only be made visible to the average Python programmer in rare circumstances. Advanced Python programmers will sometimes care about true "byte strings". They will sometimes want to build and parse information according to its representation instead of its abstract form. This should be done with byte arrays. It should be possible to read bytes from and write bytes to arrays. It should also be possible to use regular expressions on byte arrays. Character Encodings for I/O Information is typically read from devices such as file systems and network cards one byte at a time. Unicode BMP characters can have values up to 2^16 (or even higher, if you include all of UCS). There is a fundamental disconnect there. Each character cannot be represented as a single byte anymore. To solve this problem, there are several "encodings" for large characters that describe how to represent them as series of bytes. Unfortunately, there is not one, single, dominant encoding. There are at least a dozen popular ones including ASCII (which supports only 0-127), ISO Latin 1 (which supports only 0-255), others in the ISO "extended ASCII" family (which support different European scripts), UTF-8 (used heavily in C programs and on Unix), UTF-16 (preferred by Java and Windows), Shift-JIS (preferred in Japan) and so forth. This means that the only safe way to read data from a file into Python strings is to specify the encoding explicitly. Python's current assumption is that each byte translates into a character of the same ordinal. This is only true for "ISO Latin 1". Python should require the user to specify this explicitly instead. Any code that does I/O should be changed to require the user to specify the encoding that the I/O should use. It is the opinion of the author that there should be no default encoding at all. If you want to read ASCII text, you should specify ASCII explicitly. If you want to read ISO Latin 1, you should specify it explicitly. Once data is read into Python objects the original encoding is irrelevant. This is similar to reading an integer from a binary file, an ASCII file or a packed decimal file. The original bits and bytes representation of the integer is disconnected from the abstract representation of the integer object. Proposed I/O API This encoding could be chosen at various levels. In some applications it may make sense to specify the encoding on every read or write as an extra argument to the read and write methods. In most applications it makes more sense to attach that information to the file object as an attribute and have the read and write methods default the encoding to the property value. This attribute value could be initially set as an extra argument to the "open" function. Here is some Python code demonstrating a proposed API: fileobj = fopen("foo", "r", "ASCII") # only accepts values < 128 fileobj2 = fopen("bar", "r", "ISO Latin 1") # byte-values "as is" fileobj3 = fopen("baz", "r", "UTF-8") fileobj2.encoding = "UTF-16" # changed my mind! data = fileobj2.read(1024, "UTF-8" ) # changed my mind again For efficiency, it should also be possible to read raw bytes into a memory buffer without doing any interpretation: moredata = fileobj2.readbytes(1024) This will generate a byte array, not a character string. This is logically equivalent to reading the file as "ISO Latin 1" (which happens to map bytes to characters with the same ordinals) and generating a byte array by copying characters to bytes but it is much more efficient. Python File Encoding It should be possible to create Python files in any of the common encodings that are backwards compatible with ASCII. This includes ASCII itself, all language-specific "extended ASCII" variants (e.g. ISO Latin 1), Shift-JIS and UTF-8 which can actually encode any UCS character value. The precise variant of "super-ASCII" must be declared with a specialized comment that precedes any other lines other than the shebang line if present. It has a syntax like this: #?encoding="UTF-8" #?encoding="ISO-8859-1" ... #?encoding="ISO-8859-9" #?encoding="Shift_JIS" For now, this is the complete list of legal encodings. Others may be added in the future. Python files which use non-ASCII characters without defining an encoding should be immediately deprecated and made illegal in some future version of Python. C APIs The only time representation matters is when data is being moved from Python's internal model to something outside of Python's control or vice versa. Reading and writing from a device is a special case discussed above. Sending information from Python to C code is also an issue. Python already has a rule that allows the automatic conversion of characters up to 255 into their C equivalents. Once the Python character type is expanded, characters outside of that range should trigger an exception (just as converting a large long integer to a C int triggers an exception). Some might claim it is inappropriate to presume that the character-for- byte mapping is the correct "encoding" for information passing from Python to C. It is best not to think of it as an encoding. It is merely the most straightforward mapping from a Python type to a C type. In addition to being straightforward, I claim it is the best thing for several reasons: * It is what Python already does with string objects (but not Unicode objects). * Once I/O is handled "properly", (see above) it should be extremely rare to have characters in strings above 128 that mean anything OTHER than character values. Binary data should go into byte arrays. * It preserves the length of the string so that the length C sees is the same as the length Python sees. * It does not require us to make an arbitrary choice of UTF-8 versus UTF-16. * It means that C extensions can be internationalized by switching from C's char type to a wchar_t and switching from the string format code to the Unicode format code. Python's built-in modules should migrate from char to wchar_t (aka Py_UNICODE) over time. That is, more and more functions should support characters greater than 255 over time. Rough Implementation Requirements Combine String and Unicode Types: The StringType and UnicodeType objects should be aliases for the same object. All PyString_* and PyUnicode_* functions should work with objects of this type. Remove Unicode String Literals Ordinary string literals should allow large character escape codes and generate Unicode string objects. Unicode objects should "repr" themselves as Python string objects. Unicode string literals should be deprecated. Generalize C-level Unicode conversion The format string "S" and the PyString_AsString functions should accept Unicode values and convert them to character arrays by converting each value to its equivalent byte-value. Values greater than 255 should generate an exception. New function: fopen fopen should be like Python's current open function except that it should allow and require an encoding parameter. It should be considered a replacement for open. fopen should return an encoding-aware file object. open should eventually be deprecated. Add byte arrays The regular expression library should be generalized to handle byte arrays without converting them to Python strings. This will allow those who need to work with bytes to do so more efficiently. In general, it should be possible to use byte arrays where-ever it is possible to use strings. Byte arrays could be thought of as a special kind of "limited but efficient" string. Arguably we could go so far as to call them "byte strings" and reuse Python's current string implementation. The primary differences would be in their "repr", "type" and literal syntax. In a sense we would have kept the existing distinction between Unicode strings and 8-bit strings but made Unicode the "default" and provided 8-bit strings as an efficient alternative. Appendix: Using Non-Unicode character sets Let's presume that a linguistics researcher objected to the unification of Han characters in Unicode and wanted to invent a character set that included separate characters for all Chinese, Japanese and Korean character sets. Perhaps they also want to support some non-standard character set like Klingon. Klingon is actually scheduled to become part of Unicode eventually but let's presume it wasn't. This section will demonstrate that this researcher is no worse off under the new system than they were under historical Python. Adopting Unicode as a standard has no down-side for someone in this situation. They have several options under the new system: 1. Ignore Unicode Read in the bytes using the encoding "RAW" which would mean that each byte would be translated into a character between 0 and 255. It would be a synonym for ISO Latin 1. Now you can process the data using exactly the same Python code that you would have used in Python 1.5 through Python 2.0. The only difference is that the in-memory representation of the data MIGHT be less space efficient because Unicode characters MIGHT be implemented internally as 16 or 32 bit integers. This solution is the simplest and easiest to code. 2. Use Byte Arrays As discussed earlier, a byte array is like a string where the characters are restricted to characters between 0 and 255. The only virtues of byte arrays are that they enforce this rule and they can be implemented in a more memory-efficient manner. According to the proposal, it should be possible to load data into a byte array (or "byte string") using the "readbytes" method. This solution is the most efficient. 3. Use Unicode's Private Use Area (PUA) Unicode is an extensible standard. There are certain character codes reserved for private use between consenting parties. You could map characters like Klingon or certain Korean ideographs into the private use area. Obviously the Unicode character database would not have meaningful information about these characters and rendering systems would not know how to render them. But this situation is no worse than in today's Python. There is no character database for arbitrary character sets and there is no automatic way to render them. One limitation to this issue is that the Private Use Area can only handle so many characters. The BMP PUA can hold thousands and if we step up to "full" Unicode support we have room for hundreds of thousands. This solution gets the maximum benefit from Unicode for the characters that are defined by Unicode without losing the ability to refer to characters outside of Unicode. 4. Use A Higher Level Encoding You could wrap Korean characters in ... tags. You could describe a characters as \KLINGON-KAHK (i.e. 13 Unicode characters). You could use a special Unicode character as an "escape flag" to say that the next character should be interpreted specially. This solution is the most self-descriptive and extensible. In summary, expanding Python's character type to support Unicode characters does not restrict even the most estoric, Unicode-hostile types of text processing. Therefore there is no basis for objecting to Unicode as some form of restriction. Those who need to use another logial character set have as much ability to do so as they always have. Conclusion Python needs to support international characters. The "ASCII" of internationalized characters is Unicode. Most other languages have moved or are moving their basic character and string types to support Unicode. Python should also. From mal@lemburg.com Tue Feb 6 15:09:46 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 06 Feb 2001 16:09:46 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> Message-ID: <3A8013BA.2FF93E8B@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > > [pre-PEP] > > > > You have a lot of good points in there (also some inaccuracies) and > > I agree that Python should move to using Unicode for text data > > and arrays for binary data. > > That's my primary goal. If we can all agree that is the goal then we can > start to design new features with that mind. I'm overjoyed to have you > on board. I'm pretty sure Fredrick agrees with the goals (probably not > every implementation detail). I'll send to i18n sig and see if I can get > buy-in from Andy Robinson et. al. Then it's just Guido. Oh, I think that everybody agrees on moving to Unicode as basic text storage container. The question is how to get there ;-) Today we are facing a problem in that strings are also used as containers for binary data and no distinction is made between the two. We also have to watch out for external interfaces which still use 8-bit character data, so there's a lot ahead. > > Some things you may be missing though is that Python already > > has support for a few features you mention, e.g. codecs.open() > > provide more or less what you have in mind with fopen() and > > the compiler can already unify Unicode and string literals using > > the -U command line option. > > The problem with unifying string literals without unifying string > *types* is that many functions probably check for and type("") not > type(u""). Well, with -U on, Python will compile "" into u"", so you can already test Unicode compatibility today... last I tried, Python didn't even start up :-( > > What you don't talk about in the PEP is that Python's stdlib isn't > > even Unicode aware yet, and whatever unification steps we take, > > this project will have to preceed it. > > I'm not convinced that is true. We should be able to figure it out > quickly though. We can use that knowledge to base future design upon. The problem with many stdlib modules is that they don't make a difference between text and binary data (and often can't, e.g. take sockets), so we'll have to figure out a way to differentiate between the two. We'll also need an easy-to-use binary data type -- as you mention in the PEP, we could take the old string implementation as basis and then perhaps turn u"" into "" and use b"" to mean what "" does now (string object). > > The problem with making the > > stdlib Unicode aware is that of deciding which parts deal with > > text data or binary data -- the code sometimes makes assumptions > > about the nature of the data and at other times it simply doesn't > > care. > > Can you give an example? If the new string type is 100% backwards > compatible in every way with the old string type then the only code that > should break is silly code that did stuff like: > > try: > something = chr( somethingelse ) > except ValueError: > print "Unicode is evil!" > > Note that I expect types.StringType == types(chr(10000)) etc. Sure, but there are interfaces which don't differentiate between text and binary data, e.g. many IO-operations don't care about what exactly they are writing or reading. We'd probably define a new set of text data APIs (meaning methods) to make this difference clear and visible, e.g. .writetext() and .readtext(). > > In this light I think you ought to focus Python 3k with your > > PEP. This will also enable better merging techniques due to the > > lifting of the type/class difference. > > Python3K is a beautiful dream but we have problems we need to solve > today. We could start moving to a Unicode future in baby steps right > now. Your "open" function could be moved into builtins as "fopen". > Python's "binary" open function could be deprecated under its current > name and perhaps renamed. Hmm, I'd prefer to keep things separate for a while and then switch over to new APIs once we get used to them. > The sooner we start the sooner we finish. You and /F laid some beautiful > groundwork. Now we just need to keep up the momentum. I think we can do > this without a big backwards compatibility earthquake. VB and TCL > figured out how to do it... ... and we should probably try to learn from them. They have put a considerable amount of work into getting the low-level interfacing issues straight. It would be nice if we could avoid adding more conversion magic... -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paulp@ActiveState.com Tue Feb 6 15:54:49 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 06 Feb 2001 07:54:49 -0800 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> Message-ID: <3A801E49.F8DF70E2@ActiveState.com> "M.-A. Lemburg" wrote: > > ... > > Oh, I think that everybody agrees on moving to Unicode as > basic text storage container. The last time we went around there was an anti-Unicode faction who argued that adding Unicode support was fine but making it the default would inconvenience Japanese users. > ... > Well, with -U on, Python will compile "" into u"", so you can > already test Unicode compatibility today... last I tried, Python > didn't even start up :-( I'm going to say again that I don't see that as a test of Unicode-compatibility. It is a test of compatibility with our existing Unicode object. If we simply allowed string objects to support higher character numbers I *cannot see* how that could break existing code. > ... > We can use that knowledge to base future design upon. The problem > with many stdlib modules is that they don't make a difference > between text and binary data (and often can't, e.g. take sockets), > so we'll have to figure out a way to differentiate between the > two. We'll also need an easy-to-use binary data type -- as you > mention in the PEP, we could take the old string implementation > as basis and then perhaps turn u"" into "" and use b"" to mean > what "" does now (string object). I agree that we need all of this but I strongly disagree that there is any dependency relationship between improving the Unicode-awareness of I/O routines (sockets and files) and allowing string objects to support higher character numbers. I claim that allowing higher character numbers in strings will not break socket objects. It might simply be the case that for a while socket objects never create these higher charcters. Similarly, we could improve socket objects so that they have different readtext/readbinary and writetext/writebinary without unifying the string objects. There are lots of small changes we can make without breaking anything. One I would like to see right now is a unification of chr() and unichr(). We are just making life harder for ourselves by walking further and further down one path when "everyone agrees" that we are eventually going to end up on another path. > ... It would be nice if we could avoid > adding more conversion magic... We already have more "magic" in our conversions than we need. I don't think I'm proposing any new conversions. Paul Prescod From tdickenson@geminidataloggers.com Tue Feb 6 16:54:22 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Tue, 06 Feb 2001 16:54:22 +0000 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <3A800EE5.A8122B3C@ActiveState.com> References: <3A800EE5.A8122B3C@ActiveState.com> Message-ID: Its annoying (for me) that the discussion of this is happening on python-dev, rather than the i18n-sig list. should I join python-dev list too? Toby Dickenson tdickenson@geminidataloggers.com From mal@lemburg.com Tue Feb 6 17:43:05 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 06 Feb 2001 18:43:05 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> Message-ID: <3A8037A9.2E842800@lemburg.com> [Moving the follow ups to i18n-sig...] Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > > ... > > > > Oh, I think that everybody agrees on moving to Unicode as > > basic text storage container. > > The last time we went around there was an anti-Unicode faction who > argued that adding Unicode support was fine but making it the default > would inconvenience Japanese users. Unicode is the defacto international standard for unified script encodings. Discussing whether Unicode is good or bad is really beyond the scope of language design and should be dealt with in other more suitable forums, IMHO. > > ... > > Well, with -U on, Python will compile "" into u"", so you can > > already test Unicode compatibility today... last I tried, Python > > didn't even start up :-( > > I'm going to say again that I don't see that as a test of > Unicode-compatibility. It is a test of compatibility with our existing > Unicode object. If we simply allowed string objects to support higher > character numbers I *cannot see* how that could break existing code. It's a nice way of identifying problem locations in existing Python code. I don't understand your statement about allowing string objects to support "higher" ordinals... are you proposing to add a third character type ? > > ... > > We can use that knowledge to base future design upon. The problem > > with many stdlib modules is that they don't make a difference > > between text and binary data (and often can't, e.g. take sockets), > > so we'll have to figure out a way to differentiate between the > > two. We'll also need an easy-to-use binary data type -- as you > > mention in the PEP, we could take the old string implementation > > as basis and then perhaps turn u"" into "" and use b"" to mean > > what "" does now (string object). > > I agree that we need all of this but I strongly disagree that there is > any dependency relationship between improving the Unicode-awareness of > I/O routines (sockets and files) and allowing string objects to support > higher character numbers. I claim that allowing higher character numbers > in strings will not break socket objects. It might simply be the case > that for a while socket objects never create these higher charcters. > > Similarly, we could improve socket objects so that they have different > readtext/readbinary and writetext/writebinary without unifying the > string objects. There are lots of small changes we can make without > breaking anything. One I would like to see right now is a unification of > chr() and unichr(). This won't work: programs simply do not expect to get Unicode characters out of chr() and would break. OTOH, programs using unichr() don't expect 8bit-strings as output. Let's keep the two worlds well separated for a while and unify afterwards (this is much easier to do when everything's in place and well tested). > We are just making life harder for ourselves by walking further and > further down one path when "everyone agrees" that we are eventually > going to end up on another path. No. We are just sending off a pioneer team to try to find an alternative path. Once that path is found we can switch signs to have the mainstream use the new alternative path. > > ... It would be nice if we could avoid > > adding more conversion magic... > > We already have more "magic" in our conversions than we need. I don't > think I'm proposing any new conversions. Well, let's hope so :-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paulp@ActiveState.com Tue Feb 6 18:27:10 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 06 Feb 2001 10:27:10 -0800 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <3A8037A9.2E842800@lemburg.com> Message-ID: <3A8041FE.F506891F@ActiveState.com> "M.-A. Lemburg" wrote: > > ... > > Unicode is the defacto international standard for unified > script encodings. Discussing whether Unicode is good or bad is > really beyond the scope of language design and should be dealt > with in other more suitable forums, IMHO. We are in violent agreement. >... > > I don't understand your statement about allowing string objects > to support "higher" ordinals... are you proposing to add a third > character type ? Yes and no. I want to make a type with a superset of the functionality of strings and Unicode strings. > > Similarly, we could improve socket objects so that they have different > > readtext/readbinary and writetext/writebinary without unifying the > > string objects. There are lots of small changes we can make without > > breaking anything. Before we go on: do you agree that we could add fopen and readtext/readbinary on various I/O types without breaking anything? And that that we should do so? > > One I would like to see right now is a unification of > > chr() and unichr(). > > This won't work: programs simply do not expect to get Unicode > characters out of chr() and would break. Why would a program pass a large integer to chr() if it cannot handle the resulting wide string???? > OTOH, programs using > unichr() don't expect 8bit-strings as output. Where would an 8bit string break code that expected a Unicode string? The upward conversion is automatic and lossless! Having chr() and unichr() is like having a special function for adding integers versus longs. IMO it is madness. > Let's keep the two worlds well separated for a while and > unify afterwards (this is much easier to do when everything's > in place and well tested). No, the more we keep the worlds seperated the more code will be written that expects to deal with two separate types. We need to get people thinking in terms of strings of characters not strings of bytes and we need to do it as soon as possible. Paul Prescod From martin@loewis.home.cs.tu-berlin.de Tue Feb 6 20:49:42 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 6 Feb 2001 21:49:42 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <3A800EE5.A8122B3C@ActiveState.com> (message from Paul Prescod on Tue, 06 Feb 2001 06:49:09 -0800) References: <3A800EE5.A8122B3C@ActiveState.com> Message-ID: <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> Hi Paul, Interesting remarks. I comment only on those where I disagree. > 1. Python should have a single string type. I disagree. There should be a character string type and a byte string type, at least. I would agree that a single character string type is desirable. > type("") == type(chr(150)) == type(chr(1500)) == type(file.read()) I disagree. For the last one, much depends on what file is. If it is a byte-oriented file, reading from it should not return character strings. > 2. It should be easier and more efficient to encode and decode > information being sent to and retrieved from devices. I disagree. Easier, maybe; more efficient - I don't think Python is particular inefficient in encoding/decoding. > It is not possible to have a concept of "character" without having > a character set. After all, characters must be chosen from some > repertoire and there must be a mapping from characters to integers > (defined by ord). Sure it is possible. Different character sets (in your terminology) have common characters, which is a phenomenon that your definition cannot describe. Mathematically speaking, there is an unlimited domain CHAR (the set of all characters), and then a character set would map a subset of NAT (the set of all natural numbers, including zero) to a subset of CHAR. Then, a character is an element of CHAR. Depending on the character set, it has different associated numbers, though (or may not have an associated ordinal at all). > A character encoding is a mechanism for representing characters > in terms of bits. More generally, it is a mechanism for representing character sequences in terms of bit sequences. Otherwise, you can not cover the phenomenon that the encoding of a string is not the concatenation of the encodings of the individual characters in some encodings. Also, this term is often called "coded character set" (CCS). > A Python programmer does not need to know or care whether a long > integer is represented as twos complement, ones complement or > in terms of ASCII digits. They need to know if they want to explain the outcome of, say, hex(~1) (for that, they need the size of the internal representation at a minimum). In general, I agree. > Similarly a Python programmer does not need to know or care > how characters are represented in memory. We might even > change the representation over time to achieve higher > performance. Programmers need to know the character set, at a minimum. Since you were assuming that you can't have characters without character sets, I guess you've assumed that as implied. > Universal Character Set > > There is only one standardized international character set that > allows for mixed-language information. Not true. E.g. ISO 8859-5 allows both Russian and English text, ISO 8859-2 allows English, Polish, German, Slovakian, and a few others. ISO 2022 (and by reference all incorporated character sets) supports virtually all existing languages. > A popular subset of the Universal Character Set is called > Unicode. The most popular subset of Unicode is called the "Unicode > Basic Multilingual Plane (Unicode BMP)". Isn't the BMP the same as Unicode, as it is the BMP (i.e. group 0, plane 0) of ISO 10646? > Java > It is the author's belief this "running code" is evidence of > Unicode's practical applicability. At least in the case of Java, I disagree. It very much depends on the exact version of the JVM that you are using, but I had the following problems: - AWT would not find a font to display a specific character, although such a font was available. After changing JDK configuration files, AWT would not be able to display strings that mix languages. - JDK could not print a non-Latin-1 string to System.out; there was no way of telling it that it should use UTF-8 for output. (sounds familiar ?-) - While javac would accept non-ASCII letters in class names, the interpreter would refuse to load class files with "funny characters". Please note that all of these occured on the first attempt to use a certain feature which works "in theory". Since Java's Unicode support is considered as most advanced by many, I think there is still a long way to go. BTW, for dealing with GUI output, I believe that Tk's handling is most advanced. > As discussed before, Python's native character set happens to consist > of exactly 255 characters. If we increase the size of Python's > character set, no existing code would break and there would be no > cost in functionality. Sure. Code that treats character strings as if they are byte strings will break. > Once Python moves to that character set it will no longer be necessary > to have a distinction between "Unicode string" and "regular string." Right. The distinction will between "character string" and "byte string". > This means that Unicode literals and escape codes can also be > merged with ordinary literals and escape codes. unichr can be merged > with chr. Not sure. That means that there won't be byte string literals. It is particular worrying that you want to remove the way to get the numeric value of a byte in a byte string. > Two of the most common constructs in computer science are strings of > characters and strings of bytes. A string of bytes can be represented > as a string of characters between 0 and 255. Therefore the only > reason to have a distinction between Unicode strings and byte > strings is for implementation simplicity and performance purposes. > This distinction should only be made visible to the average Python > programmer in rare circumstances. Are you saying that byte strings are visible to the average programmer in rare circumstances only? Then I disagree; byte strings are extremely common, as they are what file.read returns. > Unfortunately, there is not one, single, dominant encoding. There are > at least a dozen popular ones including ASCII (which supports only > 0-127), ISO Latin 1 (which supports only 0-255), others in the ISO > "extended ASCII" family (which support different European scripts), > UTF-8 (used heavily in C programs and on Unix), UTF-16 (preferred by > Java and Windows), Shift-JIS (preferred in Japan) and so forth. This > means that the only safe way to read data from a file into Python > strings is to specify the encoding explicitly. Note how you are mixing character sets and encodings here. As you had defined earlier, a single character set (such as US-ASCII) can have multiply encodings (e.g. with checksum bit or without). > Python's current assumption is that each byte translates into a > character of the same ordinal. This is only true for "ISO Latin 1". I disagree. With your definition of character set, many character sets have the property that a single byte is sufficient to represent a single character (e.g. all of ISO 8859). You seem to assume that the current Python character set is Latin-1, which it is not. Instead, Python's character set is defined by the application and the operating system. > Any code that does I/O should be changed to require the user to > specify the encoding that the I/O should use. It is the opinion of > the author that there should be no default encoding at all. Not sure. IMO, the default should be to read and write byte strings. > Here is some Python code demonstrating a proposed API: > > fileobj = fopen("foo", "r", "ASCII") # only accepts values < 128 > fileobj2 = fopen("bar", "r", "ISO Latin 1") # byte-values "as is" > fileobj3 = fopen("baz", "r", "UTF-8") Sounds good. Note that the proper way to write this is fileobj = codecs.open("foo", "r", "ASCII") # etc > fileobj2.encoding = "UTF-16" # changed my mind! Why is that a requirement. In a normal stream, you cannot change the encoding in the middle - in particular not from Latin 1 single-byte to UTF-16. > For efficiency, it should also be possible to read raw bytes into > a memory buffer without doing any interpretation: > > moredata = fileobj2.readbytes(1024) Disagree. If a file is open for reading characters, reading bytes from the middle is not possible. If made possible, it won't be more efficient, as you have to keep track of the encoder's state. Instead, the right way to write this is fileobj2 = open("bar", "rb") moredata = fileobj2.read(1024) > It should be possible to create Python files in any of the common > encodings that are backwards compatible with ASCII. By "Python files", you mean source code, I assume? > #?encoding="UTF-8" > #?encoding="ISO-8859-1" The specific syntax may be debatable; I dislike semantics being put in comments. There should be first-class syntax for that. Agree on the principle approach. > Python files which use non-ASCII characters without defining an > encoding should be immediately deprecated and made illegal in some > future version of Python. Agree. > Python already has a rule that allows the automatic conversion > of characters up to 255 into their C equivalents. If it is a character (i.e. Unicode) string, it only converts 127 characters in that way. > Once the Python character type is expanded, characters outside > of that range should trigger an exception (just as converting a > large long integer to a C int triggers an exception). Agree; that is what it does today. > Some might claim it is inappropriate to presume that > the character-for- byte mapping is the correct "encoding" for > information passing from Python to C. Indeed, I would claim so. I could not phrase a rebuttal, though, because your understanding of the desired Python type system seems not to match mine. > Python's built-in modules should migrate from char to wchar_t (aka > Py_UNICODE) over time. That is, more and more functions should > support characters greater than 255 over time. Some certainly should. Others, which were designed for dealing with byte strings, should not. > The StringType and UnicodeType objects should be aliases for > the same object. All PyString_* and PyUnicode_* functions should > work with objects of this type. Disagree. There should be support for a byte string type. > Ordinary string literals should allow large character escape codes > and generate Unicode string objects. That is available today with the -U option. I'm -0 on disallowing byte string literals, as I don't consider them too important. > The format string "S" and the PyString_AsString functions should > accept Unicode values and convert them to character arrays > by converting each value to its equivalent byte-value. Values > greater than 255 should generate an exception. Disagree. Conversion should be automatic only up to 127; everything else gives questionable results. > fopen should be like Python's current open function except that > it should allow and require an encoding parameter. Disagree. This is codec.open. > In general, it should be possible to use byte arrays where-ever > it is possible to use strings. Byte arrays could be thought of > as a special kind of "limited but efficient" string. Arguably we > could go so far as to call them "byte strings" and reuse Python's > current string implementation. The primary differences would be > in their "repr", "type" and literal syntax. Agreed. > Appendix: Using Non-Unicode character sets > > Let's presume that a linguistics researcher objected to the > unification of Han characters in Unicode and wanted to invent a > character set that included separate characters for all Chinese, > Japanese and Korean character sets. With ISO 10646, he could easily do so in a private-use plane. Of course, implementations that only provide BMP support are somewhat handicapped here. > Python needs to support international characters. The "ASCII" of > internationalized characters is Unicode. Most other languages have > moved or are moving their basic character and string types to > support Unicode. Python should also. And indeed, Python does today. I don't see a problem *at all* with the structure of the Unicode support in Python 2.0. As initial experiences show, application *will* need to be modified to take Unicode into account; I doubt that any enhancements will change that. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Feb 6 21:04:10 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 6 Feb 2001 22:04:10 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: (message from Toby Dickenson on Tue, 06 Feb 2001 16:54:22 +0000) References: <3A800EE5.A8122B3C@ActiveState.com> Message-ID: <200102062104.f16L4AY01228@mira.informatik.hu-berlin.de> > Its annoying (for me) that the discussion of this is happening on > python-dev, rather than the i18n-sig list. i18n-sig clearly seems to be the right place; I'm equally annoyed. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Feb 6 21:16:38 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 6 Feb 2001 22:16:38 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A8041FE.F506891F@ActiveState.com> (message from Paul Prescod on Tue, 06 Feb 2001 10:27:10 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <3A8037A9.2E842800@lemburg.com> <3A8041FE.F506891F@ActiveState.com> Message-ID: <200102062116.f16LGcE01306@mira.informatik.hu-berlin.de> > Before we go on: do you agree that we could add fopen and > readtext/readbinary on various I/O types without breaking anything? That's a trivial question: Simply adding the functions will likely not break anything, unless somebody else already had been using these names. > And that that we should do so? No. Your fopen is already available, and readtext/readbinary only work on a per-file basis, not on a per-read basis. > > This won't work: programs simply do not expect to get Unicode > > characters out of chr() and would break. > > Why would a program pass a large integer to chr() if it cannot handle > the resulting wide string???? It won't. What it might do is to interpret the result as a byte string, which would break depending on how exactly your new type system works. > No, the more we keep the worlds seperated the more code will be written > that expects to deal with two separate types. We need to get people > thinking in terms of strings of characters not strings of bytes and we > need to do it as soon as possible. For that, we need a patch first. Any volunteer attempting such a patch risks being ignored, thus wasting his time. E.g. I invented a Unicode-for-Python solution several years ago which was used rarely. Marc-Andre developed one which was integrated in Python 2.0; that is the one you want to tear down now. Why do yo think you will have more luck? In any case, I encourage you to try. I promise I will analyse your patch and find its weaknesses with respect to existing applications (I'm pretty sure there will be weaknesses). Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Feb 6 21:00:59 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 6 Feb 2001 22:00:59 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A801E49.F8DF70E2@ActiveState.com> (message from Paul Prescod on Tue, 06 Feb 2001 07:54:49 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> Message-ID: <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> > If we simply allowed string objects to support higher character > numbers I *cannot see* how that could break existing code. To take a specific example: What would you change about imp and py_compile.py? What is the type of imp.get_magic()? If character string, what about this fragment? import imp MAGIC = imp.get_magic() def wr_long(f, x): """Internal; write a 32-bit int to a file in little-endian order.""" f.write(chr( x & 0xff)) f.write(chr((x >> 8) & 0xff)) f.write(chr((x >> 16) & 0xff)) f.write(chr((x >> 24) & 0xff)) ... fc = open(cfile, 'wb') fc.write('\0\0\0\0') wr_long(fc, timestamp) fc.write(MAGIC) Would that continue to write the same file that the current version writes? > We are just making life harder for ourselves by walking further and > further down one path when "everyone agrees" that we are eventually > going to end up on another path. I think a problem of discussing on a theoretical level is that the impact of changes is not clear. You seem to claim that you want changes that have zero impact on existing programs. Can you provide a patch implementing these changes, so that others can experiment and find out whether their application would break? Regards, Martin From paulp@ActiveState.com Tue Feb 6 23:05:29 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 06 Feb 2001 15:05:29 -0800 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> Message-ID: <3A808339.7B2BD5D6@ActiveState.com> "Martin v. Loewis" wrote: > > > If we simply allowed string objects to support higher character > > numbers I *cannot see* how that could break existing code. > > To take a specific example: What would you change about imp and > py_compile.py? What is the type of imp.get_magic()? If character > string, what about this fragment? > > ... > > Would that continue to write the same file that the current version > writes? Yes. Why wouldn't it? You haven't specified an encoding for the file write so it would default to what it does today. You aren't using any large characters so there is no need for multi-byte encoding. Below is some code that may further illuminate my idea. wr_long is basically your code but it shows that chr and unichr are interchangable by allowing you to pass in "func". magic is also passed in as a string or unicode string with no ill effects. I had to define a unicode() and oldstr() function to work around a bug in the way Python does default conversions between Unicode strings and ordinary strings. It should just map equivalent ordinals as my functions do. import imp def wr_long(f, x, func, magic): """Internal; write a 32-bit int to a file in little-endian order.""" f.write(func( x & 0xff)) f.write(func((x >> 8) & 0xff)) f.write(func((x >> 16) & 0xff)) f.write(func((x >> 24) & 0xff)) f.write('\0\0\0\0') f.write(oldstr(magic)) def unicode(string): return u"".join([unichr(ord(char)) for char in string]) def oldstr(string): return "".join([chr(ord(char)) for char in string]) wr_long(open("out1.txt","wb"), 5, chr, str(imp.get_magic())) wr_long(open("out2.txt","wb"), 5, chr, str(imp.get_magic())) wr_long(open("out3.txt","wb"), 5, unichr, unicode(imp.get_magic())) wr_long(open("out4.txt","wb"), 5, unichr, str(imp.get_magic())) assert( open("out1.txt").read() == open("out2.txt").read() == open("out3.txt").read() == open("out4.txt").read()) Paul Prescod From paulp@ActiveState.com Tue Feb 6 23:07:08 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 06 Feb 2001 15:07:08 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <3A800EE5.A8122B3C@ActiveState.com> <200102062104.f16L4AY01228@mira.informatik.hu-berlin.de> Message-ID: <3A80839C.19C69C35@ActiveState.com> The sig is the right place to work out the details but I think that Guido needs to decide that unifying the string and unicode types is the right thing before we can get there (and hopefully before we spend too much energy arguing about details). "Martin v. Loewis" wrote: > > > Its annoying (for me) that the discussion of this is happening on > > python-dev, rather than the i18n-sig list. > > i18n-sig clearly seems to be the right place; I'm equally annoyed. > > Regards, > Martin > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://mail.python.org/mailman/listinfo/i18n-sig From paulp@ActiveState.com Tue Feb 6 23:21:38 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 06 Feb 2001 15:21:38 -0800 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> Message-ID: <3A808702.5FF36669@ActiveState.com> Let me say one more thing. Unicode and string types are *already widely interoperable*. You run into problems: a) when you try to convert a character greater than 128. In my opinion this is just a poor design decision that can be easily reversed b) some code does an explicit check for types.StringType which of course is not compatible with types.UnicodeType. This can only be fixed by merging the features of types.StringType and types.UnicodeType so that they can be the same object. This is not as trivial as the other fix in terms of lines of code that must change but conceptually it doesn't seem complicated at all. I think a lot of Unicode interoperability problems would just go away if "a" was fixed... Paul Prescod From martin@loewis.home.cs.tu-berlin.de Tue Feb 6 23:50:52 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 7 Feb 2001 00:50:52 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A808339.7B2BD5D6@ActiveState.com> (message from Paul Prescod on Tue, 06 Feb 2001 15:05:29 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> Message-ID: <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de> > Yes. Why wouldn't it? > > You haven't specified an encoding for the file write so it would default > to what it does today. You aren't using any large characters so there is > no need for multi-byte encoding. I'm certainly using characters > 128. In UTF-8, they would become multi-byte. I'm not certain whether this would cause a problem; you did not give all implementation details of your approach, so it is hard to say. For example, f.write would use the s# conversion (since the file was opened in binary). What exactly would that do? If your change would be to *just* widen the internal representation of characters, it would do PyString_AS_STRING/PyString_GET_SIZE, so it would return a pointer to the internal representation. As a result, writing the MAGIC would result in only two bytes of the magic being written, with intermediate \0 bytes; that would be wrong. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Feb 6 23:30:23 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 7 Feb 2001 00:30:23 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A808339.7B2BD5D6@ActiveState.com> (message from Paul Prescod on Tue, 06 Feb 2001 15:05:29 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> Message-ID: <200102062330.f16NUNX02359@mira.informatik.hu-berlin.de> From martin@loewis.home.cs.tu-berlin.de Tue Feb 6 23:54:47 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 7 Feb 2001 00:54:47 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <3A80839C.19C69C35@ActiveState.com> (message from Paul Prescod on Tue, 06 Feb 2001 15:07:08 -0800) References: <3A800EE5.A8122B3C@ActiveState.com> <200102062104.f16L4AY01228@mira.informatik.hu-berlin.de> <3A80839C.19C69C35@ActiveState.com> Message-ID: <200102062354.f16NslG02393@mira.informatik.hu-berlin.de> > The sig is the right place to work out the details but I think that > Guido needs to decide that unifying the string and unicode types is > the right thing before we can get there (and hopefully before we > spend too much energy arguing about details). I think it must be exactly vice versa. An agreement "in principle" is worth nothing if it then turns out that an implementation is not feasible, or would have undesirable side effects. That is how PEPs work: you first work out all the details, get feedback from the community, and *then* can ask for BDFL pronouncement. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Feb 7 00:00:11 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 7 Feb 2001 01:00:11 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A808702.5FF36669@ActiveState.com> (message from Paul Prescod on Tue, 06 Feb 2001 15:21:38 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> Message-ID: <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> > a) when you try to convert a character greater than 128. In my opinion > this is just a poor design decision that can be easily reversed Technically, you can easily convert expand it to 256; not that easily beyond. Then, people who put KOI8-R into their Python source code will complain why the strings come out incorrectly, even though they set their language to Russion, and even though it worked that way in earlier Python versions. Or, if they then tag their sources as KOI8-R, writing strings to a "plain" file will fail, as they have characters > 256 in the string. > I think a lot of Unicode interoperability problems would just go > away if "a" was fixed... No, that would be just open a new can of worms. Again, provide a specific patch, and I can tell you specific problems. Regards, Martin From paulp@ActiveState.com Wed Feb 7 00:07:47 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 06 Feb 2001 16:07:47 -0800 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de> Message-ID: <3A8091D3.F45F666A@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > I'm certainly using characters > 128. In UTF-8, they would become > multi-byte. I'm not certain whether this would cause a problem; you > did not give all implementation details of your approach, so it is > hard to say. I think this is specified properly in the PEP but I know it is way too much learn in one day so I'm not blaming you. I'm just pointing out that it isn't as underspecified as it seems: Python already has a rule that allows the automatic conversion of characters up to 255 into their C equivalents. Once the Python character type is expanded, characters outside of that range should trigger an exception (just as converting a large long integer to a C int triggers an exception). > For example, f.write would use the s# conversion (since the file was > opened in binary). What exactly would that do? Answer above. > If your change would be to *just* widen the internal representation of > characters, it would do PyString_AS_STRING/PyString_GET_SIZE, so it > would return a pointer to the internal representation. Is it a requirement that PyString_AS_STRING return a pointer to the internal representation instead of a narrowed equivalent? Paul Prescod From paulp@ActiveState.com Wed Feb 7 00:09:28 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 06 Feb 2001 16:09:28 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <3A800EE5.A8122B3C@ActiveState.com> <200102062104.f16L4AY01228@mira.informatik.hu-berlin.de> <3A80839C.19C69C35@ActiveState.com> <200102062354.f16NslG02393@mira.informatik.hu-berlin.de> Message-ID: <3A809238.2219FA2E@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > I think it must be exactly vice versa. An agreement "in principle" is > worth nothing if it then turns out that an implementation is not > feasible, or would have undesirable side effects. That is how PEPs > work: you first work out all the details, get feedback from the > community, and *then* can ask for BDFL pronouncement. If Guido is philosophically opposed to Unicode as some people were the last time we discussed it, then I do not have time to work out details and then later find out that the project was doomed from the start because of the philosophical issue. Paul Prescod From paulp@ActiveState.com Wed Feb 7 00:21:50 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 06 Feb 2001 16:21:50 -0800 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> Message-ID: <3A80951E.DF725F03@ActiveState.com> "Martin v. Loewis" wrote: > > > a) when you try to convert a character greater than 128. In my opinion > > this is just a poor design decision that can be easily reversed > > Technically, you can easily convert expand it to 256; not that easily > beyond. Beyond that is like putting a long integer into a 32 bit integer slot. It's a TypeError. > Then, people who put KOI8-R into their Python source code will > complain why the strings come out incorrectly, even though they set > their language to Russion, and even though it worked that way in > earlier Python versions. I don't follow. If I have: a="abcXXXdef" XXX is a series of non-ASCII bytes. Those are mapped into Unicode characters with the same ordinals. Now you write them to a file. You presumably do not specify an encoding on the file write operation. So the characters get mapped back to bytes with the same ordinals. It all behaves as it did in Python 1.0 ... You can only introduce characters greater than 256 into strings explicitly and presumably legacy code does not do that because there was no way to do that! > > I think a lot of Unicode interoperability problems would just go > > away if "a" was fixed... > > No, that would be just open a new can of worms. > > Again, provide a specific patch, and I can tell you specific problems. It isn't the appropriate time to create such a core code patch. I'm trying to figure out our direction so that we can figure out what can be done in the short term. The only two things I can think of are merge chr/unichr (easy) and provide encoding-smart alternatives to open() and read() (also easy). The encoding-smart alternatives should also be documented as preferred replacements as soon as possible. Paul Prescod From paulp@ActiveState.com Wed Feb 7 01:12:43 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 06 Feb 2001 17:12:43 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> Message-ID: <3A80A10B.1E978B30@ActiveState.com> "Martin v. Loewis" wrote: > > ... > I disagree. There should be a character string type and a byte string > type, at least. I would agree that a single character string type is > desirable. It depends on whether we decide to talk about "byte strings" or "byte arrays". > > type("") == type(chr(150)) == type(chr(1500)) == type(file.read()) > > I disagree. For the last one, much depends on what file is. If it is a > byte-oriented file, reading from it should not return character > strings. I don't think that there should be such a thing as a byte-oriented file...but that's a pretty small detail. I think that the result of the read() function should be consistently a character string and not different from one type of file object to another...getting a byte array/string/thing should be a seperate method. > > 2. It should be easier and more efficient to encode and decode > > information being sent to and retrieved from devices. > > I disagree. Easier, maybe; more efficient - I don't think Python is > particular inefficient in encoding/decoding. Once I have a file object, I don't know of a way to read unicode from it without reading bytes and then decoding into another string...but I may just not know that there is a more efficient way. > Sure it is possible. Different character sets (in your terminology) > have common characters, which is a phenomenon that your definition > cannot describe. Mathematically speaking, there is an unlimited domain > CHAR (the set of all characters), CHAR is not a useful set in a computer science sense because if items from it are addressable or comparable then there exists an ord() function. Therefore there is a character set. If the items are not addressable or comparable then how would you make use of it? We could argue about the platonic truth embedded in the word "character" but I think that's a waste of time. > More generally, it is a mechanism for representing character sequences > in terms of bit sequences. Otherwise, you can not cover the phenomenon > that the encoding of a string is not the concatenation of the > encodings of the individual characters in some encodings. > > Also, this term is often called "coded character set" (CCS). Fair enough. > > Similarly a Python programmer does not need to know or care > > how characters are represented in memory. We might even > > change the representation over time to achieve higher > > performance. > > Programmers need to know the character set, at a minimum. Since you > were assuming that you can't have characters without character sets, I > guess you've assumed that as implied. The whole point of these two sections is that programmers should care alot about the character set and not at all about its in-memory representation. > > Universal Character Set > > > > There is only one standardized international character set that > > allows for mixed-language information. > > Not true. E.g. ISO 8859-5 allows both Russian and English text, > ISO 8859-2 allows English, Polish, German, Slovakian, and a few > others. If you want to use a definition of "international" that means "European" then I guess that's fair. But you don't say you've internationalized a computer program when you've added support for the Canadian dollar along with the American one. :) > ISO 2022 (and by reference all incorporated character sets) > supports virtually all existing languages. I do not believe that ISO 2022 is really considered a character set. > > A popular subset of the Universal Character Set is called > > Unicode. The most popular subset of Unicode is called the "Unicode > > Basic Multilingual Plane (Unicode BMP)". > > Isn't the BMP the same as Unicode, as it is the BMP (i.e. group 0, > plane 0) of ISO 10646? No, Unicode has space for 16 planes: UTF-16 extra planes (to be filled by Unicode 4 and ISO-10646-2) Non-Han Supplementary Plane 1: {U-00010000..U-0001FFFF} Etruscan: {U-00010200..U-00010227} Gothic: {U-00010230..U-0001024B} Klingon: {U-000123D0..U-000123F9} Western Musical Symbols: {U-0001D103..U-0001D1D7} Han Supplementary Plane 2: {U-00020000..U-0002FFFF} Reserved Planes 3..13: {U-00030000..U-000DFFFF} Plane 14: {U-000E0000..U-000EFFFF} Language Tag Characters: {U-000E0000..U-000E007F} Private Use Planes: {U-000F0000..U-0010FFFF} > > Java > > It is the author's belief this "running code" is evidence of > > Unicode's practical applicability. > > At least in the case of Java, I disagree. It very much depends on the > exact version of the JVM that you are using, but I had the following > problems: I'm not saying that any particular Unicode-using system is perfect. I'm saying that they work. I don't think that Java would work better if it used something other than Unicode. > Sure. Code that treats character strings as if they are byte strings > will break. We've discussed this further and I think I may yet convince you otherwise... > > This means that Unicode literals and escape codes can also be > > merged with ordinary literals and escape codes. unichr can be merged > > with chr. > > Not sure. That means that there won't be byte string literals. It is > particular worrying that you want to remove the way to get the numeric > value of a byte in a byte string. I don't recall suggesting any such thing! chr() of a byte string should return the byte value. chr() of a unicode string should return the character value. > Are you saying that byte strings are visible to the average programmer > in rare circumstances only? Then I disagree; byte strings are > extremely common, as they are what file.read returns. Not under my proposal. file.read returns a character string. Sometimes the character string contains characters between 0 and 255 and is indistinguishable from today's string type. Sometimes the file object knows that you want the data decoded and it returns large characters. > > Unfortunately, there is not one, single, dominant encoding. There are > > at least a dozen popular ones including ASCII (which supports only > > 0-127), ISO Latin 1 (which supports only 0-255), others in the ISO > > "extended ASCII" family (which support different European scripts), > > UTF-8 (used heavily in C programs and on Unix), UTF-16 (preferred by > > Java and Windows), Shift-JIS (preferred in Japan) and so forth. This > > means that the only safe way to read data from a file into Python > > strings is to specify the encoding explicitly. > > Note how you are mixing character sets and encodings here. As you had > defined earlier, a single character set (such as US-ASCII) can have > multiply encodings (e.g. with checksum bit or without). I believe that ASCII is both a character set and an encoding. If not, what is the name for the encoding we've been using prior to Unicode? > > Any code that does I/O should be changed to require the user to > > specify the encoding that the I/O should use. It is the opinion of > > the author that there should be no default encoding at all. > > Not sure. IMO, the default should be to read and write byte strings. The default for current Python code, yes. The default going forward? We could debate that. > Sounds good. Note that the proper way to write this is We need a built-in function that everyone uses as an alternative to the byte/string-ambiguous "open". > fileobj = codecs.open("foo", "r", "ASCII") > # etc > > > fileobj2.encoding = "UTF-16" # changed my mind! > > Why is that a requirement. In a normal stream, you cannot change the > encoding in the middle - in particular not from Latin 1 single-byte to > UTF-16. What is a "normal stream?" Python must be able to handle all streams, right? I can imagine all kinds of pickle-like or structured stream file formats that switch back and forth between binary information, strings and unicode. I'd rather not require our users to handle these in multiple passes. BTW, you only know the encoding of an XML file after you've read the first line... > Disagree. If a file is open for reading characters, reading bytes from > the middle is not possible. If made possible, it won't be more efficient, > as you have to keep track of the encoder's state. Instead, the right way > to write this is > > fileobj2 = open("bar", "rb") > moredata = fileobj2.read(1024) I disagree on many levels...but I'm willing to put off this argument. > ... > > #?encoding="UTF-8" > > #?encoding="ISO-8859-1" > > The specific syntax may be debatable; I dislike semantics being put in > comments. There should be first-class syntax for that. Agree on the > principle approach. We need a backwards-compatible syntax... > > Python already has a rule that allows the automatic conversion > > of characters up to 255 into their C equivalents. > > If it is a character (i.e. Unicode) string, it only converts 127 > characters in that way. Yes, this is an annoying difference. But I was talking about *Python strings* not Unicode strings. > > Ordinary string literals should allow large character escape codes > > and generate Unicode string objects. > > That is available today with the -U option. I'm -0 on disallowing byte > string literals, as I don't consider them too important. I don't know what you mean by disallowing byte string literals. If I type: a="abcdef" Python is ambiguous whether this is a character string literal or a byte string literal. I'm planning on interpreting it as a character string literal. That's just a definitional thing and it doesn't break anything or remove anything. It doesn't even hurt if you use escapes to embed nulls or other control characters. Unicode character equivalents exist for all of them. > > The format string "S" and the PyString_AsString functions should > > accept Unicode values and convert them to character arrays > > by converting each value to its equivalent byte-value. Values > > greater than 255 should generate an exception. > > Disagree. Conversion should be automatic only up to 127; everything > else gives questionable results. This is a fundamental disagreement that we will have to work through. What is "questionable" about interpreting a unicode 245 as a character 245? If you wanted UTF-8 you would have asked for UTF-8!!! > > fopen should be like Python's current open function except that > > it should allow and require an encoding parameter. > > Disagree. This is codec.open. code.open will never become popular. > > Python needs to support international characters. The "ASCII" of > > internationalized characters is Unicode. Most other languages have > > moved or are moving their basic character and string types to > > support Unicode. Python should also. > > And indeed, Python does today. I don't see a problem *at all* with the > structure of the Unicode support in Python 2.0. As initial experiences > show, application *will* need to be modified to take Unicode into > account; I doubt that any enhancements will change that. Let's say you are a Chinese TCL programmer. If you know the escape code for a Kanji character you put it in a string literal just as a Westerner would do. The same Chinese Python programmer must use a special syntax of string literal and the object he creates has a different type and lots and lots of trivial, otherwise language-agnostic code crashes because it tests for type("") when it could handle large character codes without a problem. I see this as a big problem... Paul Prescod From brian@tomigaya.shibuya.tokyo.jp Wed Feb 7 06:01:06 2001 From: brian@tomigaya.shibuya.tokyo.jp (Hooper Brian) Date: Wed, 7 Feb 2001 15:01:06 +0900 (JST) Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model Message-ID: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> Hi there, this is Brian Hooper from Japan, --- Paul Prescod wrote: > If Guido is philosophically opposed to Unicode as > some people were the > last time we discussed it, then I do not have time > to work out details > and then later find out that the project was doomed > from the start > because of the philosophical issue. As someone who is frequently using Python with Japanese from day to day, I'd just like to offer that I think that most Japanese users are not philosophically opposed to Unicode, they would just like support for Unicode to have as little an impact as possible on older pre-Unicode-support code. One fairly extended discussion on this list concerned how to allow for a different encoding default than UTF-8, since a lot of programs here are written to handle EUC and SJIS directly as byte-string literals. The best thing, at least from the point of view of supporting old code, would be to be able to continue to have Python continue to handle SJIS and EUC (which, in spite of Unicode support in Windows, etc., are still by far the dominant encodings for information interchange in Japan) without trying to help out by converting it into characters. If my input is a blob of binary data, then having the bytes of that data automatically grouped into two- or four- bytes per character, or automatically converted into Unicode, isn't so nice if what I actually wanted was the binary data as is. What about adding an optional encoding argument to the existing open(), allowing encoding to be passed to that, and using 'raw' as the default format (what it does now)? As one example of this, Java (unless you give the compiler an -encoding flag) assumes that string literals and file input is in Unicode, but for example in web programming, where almost all the clients are using SJIS or EUC, and the designers of the web sites are also using SJIS or EUC, none of the input is in Unicode. This is also kind of a pain with JSP where pages are compiled int servlets by the server, again in the "wrong" encoding. Unicode _support_ is already here, on many fronts, but compatibility is important, because the old encodings will take a long time to go away, I think. I agree that Unicode is where we want to go - being able to do things like cleanly slice double-byte strings without having to worry about breaking the encoding would be a refreshing change from the current state of things, and it would be nice to be able to have a useful string length measure also! I do however think that some things _will_ break in the process of getting there... the question is just how much will break, and when. In this sense, adding new functions like fopen() seems like a reasonable solution to me, since it doesn't change the way already existing constructs work. Sorry that this message is kind of a ramble, but I hope it adds to the discussion. Cheers, -Brian __________________________________________________ Do You Yahoo!? インスタントメッセージを送ろう! Yahoo!メッセンジャー http://messenger.yahoo.co.jp/ From martin@loewis.home.cs.tu-berlin.de Wed Feb 7 07:25:04 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 7 Feb 2001 08:25:04 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A8091D3.F45F666A@ActiveState.com> (message from Paul Prescod on Tue, 06 Feb 2001 16:07:47 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de> <3A8091D3.F45F666A@ActiveState.com> Message-ID: <200102070725.f177P4X00905@mira.informatik.hu-berlin.de> > Python already has a rule that allows the automatic conversion > of characters up to 255 into their C equivalents. Once the Python > character type is expanded, characters outside of that range should > trigger an exception (just as converting a large long integer to a > C int triggers an exception). > > > For example, f.write would use the s# conversion (since the file was > > opened in binary). What exactly would that do? > > Answer above. So every s and s# conversion would trigger a copying of the string. How is that implemented? Currently, every Unicode object has a reference to a string object that is produced by converting to the default character set. Would it grow another reference to a string object that is carrying the Latin-1-conversion? > Is it a requirement that PyString_AS_STRING return a pointer to the > internal representation instead of a narrowed equivalent? Certainly. Applications expect to write to the resulting memory, and expect to change the underlying string; this is valid only if one had been passing NULL to PyString_FromStringAndSize. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Feb 7 07:32:53 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 7 Feb 2001 08:32:53 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A80951E.DF725F03@ActiveState.com> (message from Paul Prescod on Tue, 06 Feb 2001 16:21:50 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> Message-ID: <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> > > Then, people who put KOI8-R into their Python source code will > > complain why the strings come out incorrectly, even though they set > > their language to Russion, and even though it worked that way in > > earlier Python versions. > > I don't follow. > > If I have: > > a="abcXXXdef" > > XXX is a series of non-ASCII bytes. Those are mapped into Unicode > characters with the same ordinals. Now you write them to a file. You > presumably do not specify an encoding on the file write operation. So > the characters get mapped back to bytes with the same ordinals. It all > behaves as it did in Python 1.0 ... They don't write them to a file. Instead, they print them in the IDLE terminal, or display them in a Tk or PythonWin window. Both support arbitrary many characters, and will treat the bytes as characters originating from Latin-1 (according to their ordinals). Or, they pass them as attributes in a DOM method, which, on write-back, will encode every string as UTF-8 (as that is the default encoding of XML). Then the characters will get changed, when they shouldn't. > You can only introduce characters greater than 256 into strings > explicitly and presumably legacy code does not do that because there > was no way to do that! Legacy code will pass them to applications that know to operate with the full Unicode character set, e.g. by applying encodings where necessary, or selecting proper fonts (which might include applying encodings). *That* is where it will break, and the library has no way of telling whether the strings where meant as byte strings (in an unspecified character set), or as Unicode character strings. > It isn't the appropriate time to create such a core code patch. I'm > trying to figure out our direction so that we can figure out what can be > done in the short term. The only two things I can think of are merge > chr/unichr (easy) and provide encoding-smart alternatives to open() and > read() (also easy). The encoding-smart alternatives should also be > documented as preferred replacements as soon as possible. I'm not sure they are preferred. They are if you know the encoding of your data sources. If you don't, you better be safe than sorry. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Feb 7 08:06:40 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 7 Feb 2001 09:06:40 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <3A80A10B.1E978B30@ActiveState.com> (message from Paul Prescod on Tue, 06 Feb 2001 17:12:43 -0800) References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> <3A80A10B.1E978B30@ActiveState.com> Message-ID: <200102070806.f1786eg01079@mira.informatik.hu-berlin.de> > Once I have a file object, I don't know of a way to read unicode from it > without reading bytes and then decoding into another string...but I may > just not know that there is a more efficient way. Just try reader = codecs.lookup("ISO-8859-2")[2] charfile = reader(file) There could be a convenience function, but that also is a detail. > CHAR is not a useful set in a computer science sense because if items > from it are addressable or comparable then there exists an ord() > function. This domain was for definition purposes only; I would not assume that items are addressable or comparable except for equality (i.e. they are unordered). > Therefore there is a character set. If the items are not > addressable or comparable then how would you make use of it? To represent a character in a computer, you need to have a character set; I certainly agree with that. I was just pointing out that the *same* character can exist in different character sets. > > > There is only one standardized international character set that > > > allows for mixed-language information. > > > > Not true. E.g. ISO 8859-5 allows both Russian and English text, > > ISO 8859-2 allows English, Polish, German, Slovakian, and a few > > others. > > If you want to use a definition of "international" that means "European" > then I guess that's fair. But you don't say you've internationalized a > computer program when you've added support for the Canadian dollar along > with the American one. :) My definition of "international standard" is "defined by an international organization", such as ISO. So ISO 8859 certainly qualifies. ISO 646 (aka ASCII) is also an international standard; it even allows for "national variants", but it does not allow mixed-language information. As for ISO 8859, it also supports Arabic and Hebrew, BTW. > > Isn't the BMP the same as Unicode, as it is the BMP (i.e. group 0, > > plane 0) of ISO 10646? > > No, Unicode has space for 16 planes: > > UTF-16 extra planes (to be filled by Unicode 4 and ISO-10646-2) Ok. Good that they consider that part of Unicode now; that was not always the case. > I don't recall suggesting any such thing! chr() of a byte string should > return the byte value. chr() of a unicode string should return the > character value. chr of a byte string? How exactly do I write this down? I.e. if I have chr(42), what do I get? > Not under my proposal. file.read returns a character string. Sometimes > the character string contains characters between 0 and 255 and is > indistinguishable from today's string type. Sometimes the file object > knows that you want the data decoded and it returns large characters. I guess we have to defer this until I see whether it is feasible (which I believe it is not - it was the mistake Sun made in the early JDKs). > I believe that ASCII is both a character set and an encoding. If not, > what is the name for the encoding we've been using prior to Unicode? For ASCII, only a single encoding is common today. I think there used to be other modes of operation, but nobody cared to give them names. > > Sounds good. Note that the proper way to write this is > > We need a built-in function that everyone uses as an alternative to the > byte/string-ambiguous "open". Why is that a requirement? > > fileobj = codecs.open("foo", "r", "ASCII") > > # etc > > > > > fileobj2.encoding = "UTF-16" # changed my mind! > > > > Why is that a requirement. In a normal stream, you cannot change the > > encoding in the middle - in particular not from Latin 1 single-byte to > > UTF-16. > > What is a "normal stream?" I meant the one returned from open(). > I can imagine all kinds of pickle-like or structured stream file > formats that switch back and forth between binary information, > strings and unicode. For example? If a format supports mixing binary and text information, it needs to specify what encoding to use for the text fragments, and it needs to specify how exactly conversion is performed (in case of stateful codecs). It is certainly the application's job to get this right; only the application knows how the format is supposed to work. > BTW, you only know the encoding of an XML file after you've read the > first line... Certainly. You don't know the encoding of a MIME message until you have seen the Content-Type and Content-Transfer-Encoding fields. > > The specific syntax may be debatable; I dislike semantics being put in > > comments. There should be first-class syntax for that. Agree on the > > principle approach. > > We need a backwards-compatible syntax... Why is that? The backwards-compatible way of writing funny bytes is to use \x escapes. > This is a fundamental disagreement that we will have to work through. > What is "questionable" about interpreting a unicode 245 as a character > 245? If you wanted UTF-8 you would have asked for UTF-8!!! Likewise, if you want Latin-1 you should ask for it. Explicit is better than implicit. > > Disagree. This is codec.open. > > code.open will never become popular. Why is that? > Let's say you are a Chinese TCL programmer. If you know the escape code > for a Kanji character you put it in a string literal just as a Westerner > would do. If, as a programmer, I have to use escape codes to put a character into my source, I consider this quite inconvenient. Instead, I'd like to use my keyboard to put in the characters I care about, and I'd like them to be printed in the way I recognize them. > The same Chinese Python programmer must use a special syntax of string > literal and the object he creates has a different type and lots and lots > of trivial That Chinese Python programmer should use his editor of choice, and put _() around strings that are meant as text (as opposed to strings that are protocol). At the beginning of the module, he should write def _(str):return unicode(str, "BIG-5") (assuming BIG-5 is what his editor produces). Not that inconvenient, and I doubt the same thing is easier in Tcl. > otherwise language-agnostic code crashes because it tests for > type("") when it could handle large character codes without a > problem. Yes, using type("") is a problem. I'd like to see a symbolic name StringTypes = [StringType, UnicodeType] in the types module. Regards, Martin From fredrik@pythonware.com Wed Feb 7 10:00:03 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 7 Feb 2001 11:00:03 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> Message-ID: <00cf01c090ec$c4eb7220$0900a8c0@SPIFF> martin wrote: > To take a specific example: What would you change about imp and > py_compile.py? What is the type of imp.get_magic()? If character > string, what about this fragment? > > import imp > MAGIC = imp.get_magic() > > def wr_long(f, x): > """Internal; write a 32-bit int to a file in little-endian order.""" > f.write(chr( x & 0xff)) > f.write(chr((x >> 8) & 0xff)) > f.write(chr((x >> 16) & 0xff)) > f.write(chr((x >> 24) & 0xff)) > ... > fc = open(cfile, 'wb') > fc.write('\0\0\0\0') > wr_long(fc, timestamp) > fc.write(MAGIC) > > Would that continue to write the same file that the current version > writes? yes (file opened in binary mode, no encoding, no code points above 255) Cheers /F From tdickenson@geminidataloggers.com Wed Feb 7 10:35:53 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Wed, 07 Feb 2001 10:35:53 +0000 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A8041FE.F506891F@ActiveState.com> References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <3A8037A9.2E842800@lemburg.com> <3A8041FE.F506891F@ActiveState.com> Message-ID: <1c728tobr3u4impgmih5nn6mmr5i00o2gg@4ax.com> On Tue, 06 Feb 2001 10:27:10 -0800, Paul Prescod wrote: >"M.-A. Lemburg" wrote: >>=20 >> ... >>=20 >> Unicode is the defacto international standard for unified >> script encodings. Discussing whether Unicode is good or bad is >> really beyond the scope of language design and should be dealt >> with in other more suitable forums, IMHO. > >We are in violent agreement. > >>... >>=20 >> I don't understand your statement about allowing string objects >> to support "higher" ordinals... are you proposing to add a third >> character type ? > >Yes and no. I want to make a type with a superset of the functionality >of strings and Unicode strings. > >> > Similarly, we could improve socket objects so that they have = different >> > readtext/readbinary and writetext/writebinary without unifying the >> > string objects. There are lots of small changes we can make without >> > breaking anything.=20 > >Before we go on: do you agree that we could add fopen and >readtext/readbinary on various I/O types without breaking anything? >And >that that we should do so? I dislike the idea of burdening the file object interface with separate functions for binary and text IO, and a way of changing the encoding. There are many other types/classes that support the file interface, and I think it is desirable to support text IO on all of them. The wrapper approach from the codecs module seems better, since it can be used to convert any byte file into a text file. Also consider a hypothetical new storage device that stores unicode natively: how should it implement readbytes? (however, an implicit 'import codecs.open as fopen' may make sense) >> > One I would like to see right now is a unification of >> > chr() and unichr(). >>=20 >> This won't work: programs simply do not expect to get Unicode >> characters out of chr() and would break.=20 > >Why would a program pass a large integer to chr() if it cannot handle >the resulting wide string???? > >> OTOH, programs using >> unichr() don't expect 8bit-strings as output. We can unify these two only if we change the default encoding from ASCII to latin1, otherwise: Python 2.0 (#6, Oct 6 2000, 15:49:48) [MSC 32 bit (Intel)] on win32 Type "copyright", "credits" or "license" for more information. >>> >>> u'\310'+unichr(200) u'\310\310' >>> u'\310'+chr(200) Traceback (most recent call last): File "", line 1, in ? UnicodeError: ASCII decoding error: ordinal not in range(128) The counter-argument from last time around was that this will do the wrong thing for anyone mixing unicode objects with plain strings containing non-latin1 content. This argument goes away once there is only one type used for storing text. Toby Dickenson tdickenson@geminidataloggers.com From tdickenson@geminidataloggers.com Wed Feb 7 11:03:18 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Wed, 07 Feb 2001 11:03:18 +0000 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> Message-ID: On Tue, 6 Feb 2001 21:49:42 +0100, "Martin v. Loewis" wrote: >Hi Paul, > >Interesting remarks. I comment only on those where I disagree. > >> 1. Python should have a single string type.=20 > >I disagree. There should be a character string type and a byte string >type, at least. I would agree that a single character string type is >desirable. There is already a large body of code that mixes text and binary data in the same type. If we have separate text/binary types, then we need to plan a transition period to allow code to distinguish between the two uses. >> Two of the most common constructs in computer science are strings = of >> characters and strings of bytes. A string of bytes can be = represented >> as a string of characters between 0 and 255. Therefore the only >> reason to have a distinction between Unicode strings and byte >> strings is for implementation simplicity and performance purposes. >> This distinction should only be made visible to the average Python >> programmer in rare circumstances. I disagree. Many programmers will be satisfied when they read a byte string from a text file, print it, and see "Hello World". Much better that we distinguish the two types, so that it looks like binary data when printed. Toby Dickenson tdickenson@geminidataloggers.com From mal@lemburg.com Wed Feb 7 11:47:53 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 07 Feb 2001 12:47:53 +0100 Subject: [I18n-sig] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <3A8037A9.2E842800@lemburg.com> <3A8041FE.F506891F@ActiveState.com> Message-ID: <3A8135E9.E360A267@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > >... > > > > I don't understand your statement about allowing string objects > > to support "higher" ordinals... are you proposing to add a third > > character type ? > > Yes and no. I want to make a type with a superset of the functionality > of strings and Unicode strings. Hmm and I was under the impression that we try to replace strings with Unicode and then perhaps reuse the 8-bit string implementation for binary data. > > > Similarly, we could improve socket objects so that they have different > > > readtext/readbinary and writetext/writebinary without unifying the > > > string objects. There are lots of small changes we can make without > > > breaking anything. > > Before we go on: do you agree that we could add fopen and > readtext/readbinary on various I/O types without breaking anything? And > that that we should do so? Sure. We can always add new things, then deprecate the old stuff and slowly move to the new methods as standard. E.g. adding .readtext() and .writetext() would be a good start in that direction since those names make it clear that the code will deal with text rather than binary data. > > > One I would like to see right now is a unification of > > > chr() and unichr(). > > > > This won't work: programs simply do not expect to get Unicode > > characters out of chr() and would break. > > Why would a program pass a large integer to chr() if it cannot handle > the resulting wide string???? As result of an error. Ok, some other part in the program will then probably break, but this hides the original error location. > > OTOH, programs using > > unichr() don't expect 8bit-strings as output. > > Where would an 8bit string break code that expected a Unicode string? > The upward conversion is automatic and lossless! But why would you want to do upward conversion on single characters ? That would only cost performance. > Having chr() and unichr() is like having a special function for adding > integers versus longs. IMO it is madness. No. chr() is a constructor for a single 8-bit character, unichr() is the corresponding constructor for a single Unicode character. This is much like the difference between int() and long(). > > Let's keep the two worlds well separated for a while and > > unify afterwards (this is much easier to do when everything's > > in place and well tested). > > No, the more we keep the worlds seperated the more code will be written > that expects to deal with two separate types. We need to get people > thinking in terms of strings of characters not strings of bytes and we > need to do it as soon as possible. Ok, then let me put it this way: let's first make people aware that there is an important difference between text data and binary data. Once this is being accepted, we can move on to thinking about making Unicode the standard for text data. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Wed Feb 7 12:58:32 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 07 Feb 2001 13:58:32 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> Message-ID: <3A814678.2F245D14@lemburg.com> Hooper Brian wrote: > ... > What about adding an > optional encoding argument to the existing open(), > allowing encoding to be passed to that, and using 'raw' as > the default format (what it does now)? This is what codecs.open() already provides. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From uche.ogbuji@fourthought.com Wed Feb 7 19:21:28 2001 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 07 Feb 2001 12:21:28 -0700 Subject: [I18n-sig] Re: [4suite] 4Suite 0.10.2 alpha 1 In-Reply-To: Message from "Martin v. Loewis" of "Sun, 04 Feb 2001 16:13:21 +0100." <200102041513.f14FDLZ01273@mira.informatik.hu-berlin.de> Message-ID: <200102071921.MAA07019@localhost.localdomain> > > Please test the new internationalization: French and German translations > > hve been added courtesy Alexandre and Martin. > > This is indeed causing problems for me. Invoking 4xslt gives [snip] O'oer. I'm glad I happened to read i18n-sig before releaseing 0.10.2. My procmail recipes were lame and dumped all three copies of your message here. > The problem is two-fold: For one thing, there is no German xpath > message catalog. However, it shouldn't fail if LANG is set to an > unsupported language, so you should catch IOError also. OK. By the way, did you have any comments on the update procedure I suggested to you and Alexandre? I'd like to get the German Translations of XPath (and ODS, etc.) in before release if possible. Meanwhile, I'll add the IOError to the exceptions list. Thanks. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From paulp@ActiveState.com Wed Feb 7 19:44:05 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 11:44:05 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A814678.2F245D14@lemburg.com> Message-ID: <3A81A585.F0771269@ActiveState.com> "M.-A. Lemburg" wrote: > > Hooper Brian wrote: > > ... > > What about adding an > > optional encoding argument to the existing open(), > > allowing encoding to be passed to that, and using 'raw' as > > the default format (what it does now)? > > This is what codecs.open() already provides. There is a reason that Brian and I independently invented the same idea. It's because Joe Programmer without a degree in rocket science is going to expect it to work that way. Joe Programmer does not know what a codec is, will not consider importing the codecs module and will have no idea what to do with the object once they've got there hands on it. It's a million times easier to tell a programmer: "If you expect to read ASCII data add a third argument with the string 'ASCII', if you know about encodings choose another one. If you know what raw binary data is, and want to read it, here's another function." One important part of Python philosophy is making it easy to do the right thing and a little bit more work to do the wrong thing. Right now we have the exact opposite situation. We make it incredibly convenient for programmers to read data that they may consider strings or may consider binary data into the same string type and then we complain: "Oh geez, we can't do anything intelligent with strings because we don't know whether the user intended them to be really strings or binary data." Paul Prescod From paulp@ActiveState.com Wed Feb 7 19:51:51 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 11:51:51 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> Message-ID: <3A81A757.78B3F527@ActiveState.com> Hooper Brian wrote: > > ... > > As someone who is frequently using Python with Japanese > from day to day, I'd just like to offer that I think that > most Japanese users are not philosophically opposed to > Unicode, they would just like support for Unicode to have > as little an impact as possible on older > pre-Unicode-support code. One fairly extended discussion > on this list concerned how to allow for a different > encoding default than UTF-8, since a lot of programs here > are written to handle EUC and SJIS directly as byte-string > literals. In my opinion there should be *no* encoding default. New code should always specify an encoding. Old code should continue to work the same. > ... What about adding an > optional encoding argument to the existing open(), > allowing encoding to be passed to that, and using 'raw' as > the default format (what it does now)? I'm not content to have a "default" in the long term. Users should just choose their encodings. Why would your Japanese user prefer to work with the raw bytes of their Shift-JIS instead of having it decoded into Unicode characters? Requiring Asians hacking bytes instead of characters is what we are trying to avoid! Shift-JIS and Unicode are not at odds. Shift-JIS is a great *encoding* for Unicode (the abstract character set). Shift-JIS is what should be on the disk. Unicode is what you should be working with in memory. Of course there will always be some corner cases where this is not the case but that should be the general model... Paul Prescod From paulp@ActiveState.com Wed Feb 7 19:59:35 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 11:59:35 -0800 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de> <3A8091D3.F45F666A@ActiveState.com> <200102070725.f177P4X00905@mira.informatik.hu-berlin.de> Message-ID: <3A81A927.FAE4303D@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > So every s and s# conversion would trigger a copying of the > string. How is that implemented? Currently, every Unicode object has a > reference to a string object that is produced by converting to the > default character set. Would it grow another reference to a string > object that is carrying the Latin-1-conversion? I'm not clear on the status of the concept of "default charater set." First, I think you mean "default character encoding". Second, I thought that that idea was removed from user-view at least, wasn't it? I was thinking that we would use that slot to hold the char->ord->char conversion (which you can interpret as Latin-1 or not depending on your philosophy). > Certainly. Applications expect to write to the resulting memory, and > expect to change the underlying string; this is valid only if one had > been passing NULL to PyString_FromStringAndSize. The documentation says that the PyString_AsString and PyString_AS_STRING buffers must never be modified. I forgot that the "real" protocol is that that buffer can be modified. We'll need to copy its contents back to the Unicode string before the next operation that uses the Unicode value. Not rocket science but somewhat tedious. Paul Prescod From paulp@ActiveState.com Wed Feb 7 20:13:48 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 12:13:48 -0800 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> Message-ID: <3A81AC7C.3FFE73E5@ActiveState.com> "Martin v. Loewis" wrote: > > > ... > > XXX is a series of non-ASCII bytes. Those are mapped into Unicode > > characters with the same ordinals. Now you write them to a file. You > > presumably do not specify an encoding on the file write operation. So > > the characters get mapped back to bytes with the same ordinals. It all > > behaves as it did in Python 1.0 ... > > They don't write them to a file. Instead, they print them in the IDLE > terminal, or display them in a Tk or PythonWin window. Both support > arbitrary many characters, and will treat the bytes as characters > originating from Latin-1 (according to their ordinals). I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R data in a string literal. PythonWin and Tk expect Unicode. How could they display the characters correctly? > Or, they pass them as attributes in a DOM method, which, on > write-back, will encode every string as UTF-8 (as that is the default > encoding of XML). Then the characters will get changed, when they > shouldn't. What do you think *should* happen? These are the only choices I can think of: 1. DOM encodes it as UTF-8 2. DOM blindly passes it through and creates illegal XML 3. (correct) User explicitly decodes data into Unicode charset. 3) is unchanged today and under my proposal. You've got some bytes. Python doesn't know what you mean. The only way to let it know what you mean is to decode it. >... > Legacy code will pass them to applications that know to operate with > the full Unicode character set, e.g. by applying encodings where > necessary, or selecting proper fonts (which might include applying > encodings). *That* is where it will break, and the library has no way > of telling whether the strings where meant as byte strings (in an > unspecified character set), or as Unicode character strings. The only sane thing to do when you don't know is to pass the characters as-is, char->ord->char. > > It isn't the appropriate time to create such a core code patch. I'm > > trying to figure out our direction so that we can figure out what can be > > done in the short term. The only two things I can think of are merge > > chr/unichr (easy) and provide encoding-smart alternatives to open() and > > read() (also easy). The encoding-smart alternatives should also be > > documented as preferred replacements as soon as possible. > > I'm not sure they are preferred. They are if you know the encoding of > your data sources. If you don't, you better be safe than sorry. If you don't know the encoding of your data sources then you should say that explicitly in code rather than using the same functions as people who *do* know what their encoding is. Explicit is better than implicit, right? Our current default is totally implicit. Paul Prescod From paulp@ActiveState.com Wed Feb 7 20:35:51 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 12:35:51 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> <3A80A10B.1E978B30@ActiveState.com> <200102070806.f1786eg01079@mira.informatik.hu-berlin.de> Message-ID: <3A81B1A7.4E1D022C@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > Just try > > reader = codecs.lookup("ISO-8859-2")[2] > charfile = reader(file) > > There could be a convenience function, but that also is a detail. Usability is not a detail in this particular case. We are trying to change people's behavior and help them make more robust code. >... > My definition of "international standard" is "defined by an > international organization", such as ISO. So ISO 8859 certainly > qualifies. ISO 646 (aka ASCII) is also an international standard; it > even allows for "national variants", but it does not allow > mixed-language information. As for ISO 8859, it also supports Arabic > and Hebrew, BTW. That's fine. I'll change the document to be more explicit. Would you agree that: "Unicode is the only *character set* that supports *all of the world's major written languages.*" > ... > > I don't recall suggesting any such thing! chr() of a byte string should > > return the byte value. chr() of a unicode string should return the > > character value. > > chr of a byte string? How exactly do I write this down? I.e. if I have > chr(42), what do I get? Sorry, I meant ord. ord of a byte string (or byte array) should return the byte value. Ord of a character string should return the character value. > > Not under my proposal. file.read returns a character string. Sometimes > > the character string contains characters between 0 and 255 and is > > indistinguishable from today's string type. Sometimes the file object > > knows that you want the data decoded and it returns large characters. > > I guess we have to defer this until I see whether it is feasible > (which I believe it is not - it was the mistake Sun made in the early > JDKs). What was the mistake? > > I can imagine all kinds of pickle-like or structured stream file > > formats that switch back and forth between binary information, > > strings and unicode. > > For example? If a format supports mixing binary and text information, > it needs to specify what encoding to use for the text fragments, and > it needs to specify how exactly conversion is performed (in case of > stateful codecs). It is certainly the application's job to get this > right; only the application knows how the format is supposed to work. You and I agree that streams can change encoding mid-stream. You probably think that should be handled by passing the stream to various codecs as you read (or by doing double-buffer reads). I think that it should be possible right in the read method. But I don't care enough to argue about it. > > > The specific syntax may be debatable; I dislike semantics being put in > > > comments. There should be first-class syntax for that. Agree on the > > > principle approach. > > > > We need a backwards-compatible syntax... > > Why is that? The backwards-compatible way of writing funny bytes is to use \x escapes. Maybe we don't need a backards-compatible syntax after all. I haven't thought through all of those issues. > > This is a fundamental disagreement that we will have to work through. > > What is "questionable" about interpreting a unicode 245 as a character > > 245? If you wanted UTF-8 you would have asked for UTF-8!!! > > Likewise, if you want Latin-1 you should ask for it. Explicit is > better than implicit. It's funny how we switch back and forth. If I say that Python reads byte 245 into character 245 and thus uses Latin 1 as its default encoding I'm told I'm wrong. Python has no native encoding. If I claim that in passing data to C we should treat character 245 as the C "char" with the value 245 you tell me that I'm proposing Latin 1 as the default encoding. Python has a concept of character that extends from 0 to 255. C has a concept of character that extends from 0 to 255. There is no issue of "encoding" as long as you stay within those ranges. This is *exactly* like the int/long int situation. Once you get out of these ranges you switch the type in C to wchar_t and you are off to the races. If you can't change the C code then that means you work around it from the Python side -- you UTF-8 encode it before passing it to the C code. > ... > That Chinese Python programmer should use his editor of choice, and > put _() around strings that are meant as text (as opposed to strings > that are protocol). I don't know what you mean by "protocol" here. But nevertheless, you are saying that the Chinese programmer must do more than the English programmer does and I consider that a problem. > Yes, using type("") is a problem. I'd like to see a symbolic name > > StringTypes = [StringType, UnicodeType] > > in the types module. That doesn't help to reform the mass of code out there. Paul Prescod From paulp@ActiveState.com Wed Feb 7 20:38:35 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 12:38:35 -0800 Subject: [I18n-sig] Re: [Python-Dev] unichr References: Message-ID: <3A81B24B.6AE348A9@ActiveState.com> Ka-Ping Yee wrote: > > ... > > At the moment, since the default encoding is ASCII, something like > > u"abc" + chr(200) > > would cause an exception because 200 is outside of the ASCII range. Yes, this is another mistake in Python's current handling of strings. there is absolutely nothing special about the 128-255 range of characters. We shouldn't start throwing exceptions until we get to 256. Paul prescod From paulp@ActiveState.com Wed Feb 7 22:53:53 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 14:53:53 -0800 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <3A8037A9.2E842800@lemburg.com> <3A8041FE.F506891F@ActiveState.com> <1c728tobr3u4impgmih5nn6mmr5i00o2gg@4ax.com> Message-ID: <3A81D201.DC88CDC0@ActiveState.com> Toby Dickenson wrote: > > I dislike the idea of burdening the file object interface with > separate functions for binary and text IO, and a way of changing the > encoding. There are many other types/classes that support the file > interface, and I think it is desirable to support text IO on all of > them. It is not burdensome to change each of them over. It's probably about 10 lines of code each. > The wrapper approach from the codecs module seems better, since it can > be used to convert any byte file into a text file. The wrapper approach is not user friendly and users will not make use of it unless they are already i18n experts. My goal is to nudge people toward thinking about i18n. > Also consider a hypothetical new storage device that stores unicode > natively: how should it implement readbytes? It could simply choose not to. > We can unify these two only if we change the default encoding from > ASCII to latin1, otherwise: I prefer not to think of it as a "default encoding of Latin1" and more as "doing the obvious thing." C has a character 245. Python has a character 245. Only someone who knows too much would expect anything other than an obvious mapping. > The counter-argument from last time around was that this will do the > wrong thing for anyone mixing unicode objects with plain strings > containing non-latin1 content. This argument goes away once there is > only one type used for storing text. That's where I'm trying to get to but I'm trying to minimize the amount of cruft added to the language between here and there. Paul Prescod From andy@reportlab.com Wed Feb 7 23:06:12 2001 From: andy@reportlab.com (Andy Robinson) Date: Wed, 7 Feb 2001 23:06:12 -0000 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A801E49.F8DF70E2@ActiveState.com> Message-ID: > The last time we went around there was an anti-Unicode faction who > argued that adding Unicode support was fine but making it > the default would inconvenience Japanese users. Whoops, I nearly missed the biggest debate of the year! I guess the faction was Brian and I, and our concerns were misunderstood. We can lay this to rest forever now as the current implementation and forward direction incorporate everything I originally hoped for: (1) Frequently you need to work with byte arrays, but need a rich bunch of string-like routines - search and replace, regex etc. This applies both to non-natural-language data and also to the special case of corrupt native encodings that need repair. We loosely defined the 'string interface' in UserString, so that other people could define string-like types if they wished and so that users can expect to find certain methods and operations in both Unicode and Byte Array types. I'd be really happy one day to explicitly type x= ByteArray('some raw data') as long as I had my old friends split, join, find etc. (2) Japanese projects often need small extensions to codecs to deal with user-defined characters. Java and VB give you some canned codecs but no way to extend them. All the Python asian codec drafts involve 'open' code you can hack and use simple dictionaries for mapping tables; so it will be really easy to roll your own "Shift-JIS-plus" with 20 extra characters mapping to a private use area. This will be a huge win over other languages. (3) The Unicode conversion was based on a more general notion of 'stream conversion filters' which work with bytes. This leaves the door open to writing, for example, a direct Shift-JIS-to-EUC filter which adds nothing in the case of clean data but is much more robust in the case of user-defined characters or which can handle cleanup of misencoded data. We could also write image manipulation or crypto codecs. Some of us hope to provide general machinery for fast handling of byte-stream-filters which could be useful in image processing and crypto as well as encodings. This might need an extended or different lookup function (after all, neither end of the filter need be Unicode) but could be cleanly layered on top of the codec mechanism we have built in. (4) I agree 100% on being explicit whenever you do I/O or conversion and on generally using Unicode characters where possible. Defaults are evil. But we needed a compatibility route to get there. Guido has said that long term there will be Unicode strings and Byte Arrays. That's the time to require arguments to open(). > Similarly, we could improve socket objects so that they > have different > readtext/readbinary and writetext/writebinary without unifying the > string objects. There are lots of small changes we can make without > breaking anything. One I would like to see right now is a > unification of > chr() and unichr(). Here's a thought. How about BinaryFile/BinarySocket/ByteArray which do not need an encoding, and File/Socket/String which require explicit encodings on opeening. We keep broad parity between their methods. That seems more straightforward to me than having text/binary methods, and also provides a cleaner upgrade path for existing code. - Andy From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 00:22:50 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 01:22:50 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <3A81A757.78B3F527@ActiveState.com> (message from Paul Prescod on Wed, 07 Feb 2001 11:51:51 -0800) References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A81A757.78B3F527@ActiveState.com> Message-ID: <200102080022.f180Mo101584@mira.informatik.hu-berlin.de> > In my opinion there should be *no* encoding default. New code should > always specify an encoding. Old code should continue to work the same. However, matter-of-factually, you propose that ISO-8859-1 is the default encoding, as this is the encoding that is used when converting character strings to char* in the C API. I'd certainly call it a default. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 00:16:34 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 01:16:34 +0100 Subject: [I18n-sig] Re: [4suite] 4Suite 0.10.2 alpha 1 In-Reply-To: <200102071921.MAA07019@localhost.localdomain> (message from Uche Ogbuji on Wed, 07 Feb 2001 12:21:28 -0700) References: <200102071921.MAA07019@localhost.localdomain> Message-ID: <200102080016.f180GYD01555@mira.informatik.hu-berlin.de> > OK. By the way, did you have any comments on the update procedure I > suggested to you and Alexandre? I'd like to get the German > Translations of XPath (and ODS, etc.) in before release if possible. I don't know what the proposal exactly was (*). Here's how updates are typically done in the Linux Internationalization Project: - each version of the .pot (**) file has a unique identification (e.g. 0.10.1a). - each translator indicates which version of the pot file his translation corresponds to. - once the message catalog changes, the *full* .pot is distributed to translators. - each translator uses GNU msgmerge to carry-over old translations into the new catalog, and then updates the catalog (again indicating which version this is a translation of) Using both unique identifications and msgmerge allows for quite automatic processing, while at the same time giving good consistency checks and flexible analysis of the changes. Automation goes as far that the Robot produces the merged catalogs, but that is not a requirement for me. Regards, Martin (*) I think you suggested to send diffs; that would be troublesome. (**) What you call en_US.po really is the .pot file, as it is the output of the extractor. It would become a .po file if a translator checked it for proper application of US-English spelling etc. From martin@loewis.home.cs.tu-berlin.de Wed Feb 7 23:59:37 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 00:59:37 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: (message from Toby Dickenson on Wed, 07 Feb 2001 11:03:18 +0000) References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> Message-ID: <200102072359.f17NxbL01137@mira.informatik.hu-berlin.de> > >> 1. Python should have a single string type. > > > >I disagree. There should be a character string type and a byte string > >type, at least. I would agree that a single character string type is > >desirable. > > There is already a large body of code that mixes text and binary data > in the same type. If we have separate text/binary types, then we need > to plan a transition period to allow code to distinguish between the > two uses. I think the current Unicode implementation has this property: Unicode is the type for representing character strings; the string type the one for representing byte strings. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 00:27:29 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 01:27:29 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A81A927.FAE4303D@ActiveState.com> (message from Paul Prescod on Wed, 07 Feb 2001 11:59:35 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de> <3A8091D3.F45F666A@ActiveState.com> <200102070725.f177P4X00905@mira.informatik.hu-berlin.de> <3A81A927.FAE4303D@ActiveState.com> Message-ID: <200102080027.f180RTl01586@mira.informatik.hu-berlin.de> > I'm not clear on the status of the concept of "default charater set." > First, I think you mean "default character encoding". Both encoding and character set, yes. I disagree with the notion that any encoding is a Unicode encoding, since not all encodings can represent all of Unicode; nor where they originally designed to encode Unicode. > Second, I thought that that idea was removed from user-view at > least, wasn't it? Yes, unless you modify sitecustomize.py. > I was thinking that we would use that slot to hold the > char->ord->char conversion (which you can interpret as Latin-1 or > not depending on your philosophy). I would interpret it that way. What do you do about t# conversions, then? > The documentation says that the PyString_AsString and PyString_AS_STRING > buffers must never be modified. I forgot that the "real" protocol is > that that buffer can be modified. We'll need to copy its contents back > to the Unicode string before the next operation that uses the Unicode > value. Not rocket science but somewhat tedious. This scheme is easy to break; the application could hold onto the pointer and start using the object already. It remains to be seen whether existing code would break; this I can only speculate about as I don't know the exact scheme that you have in mind. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 00:37:56 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 01:37:56 +0100 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A81AC7C.3FFE73E5@ActiveState.com> (message from Paul Prescod on Wed, 07 Feb 2001 12:13:48 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> Message-ID: <200102080037.f180bul01609@mira.informatik.hu-berlin.de> > > They don't write them to a file. Instead, they print them in the IDLE > > terminal, or display them in a Tk or PythonWin window. Both support > > arbitrary many characters, and will treat the bytes as characters > > originating from Latin-1 (according to their ordinals). > > I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R data > in a string literal. PythonWin and Tk expect Unicode. How could they > display the characters correctly? No, PythonWin and Tk both tell apart Unicode and byte strings (although Tk uses quite a funny algorithm to do so). If they see a byte string, they convert it using the platform encoding (which is user-settable on both Windows and Unix) to a Unicode string, and display that. > > Or, they pass them as attributes in a DOM method, which, on > > write-back, will encode every string as UTF-8 (as that is the default > > encoding of XML). Then the characters will get changed, when they > > shouldn't. > > What do you think *should* happen? These are the only choices I can > think of: > > 1. DOM encodes it as UTF-8 > 2. DOM blindly passes it through and creates illegal XML > 3. (correct) User explicitly decodes data into Unicode charset. What users expect to happen is 2; blindly pass-through. They think they can get it right; given enough control, this is feasible. It was even common practice in the absence of Unicode objects, so a lot of code depends on libraries passing things through as-is. > The only sane thing to do when you don't know is to pass the characters > as-is, char->ord->char. So libraries need a way of telling for sure. With Python 2.0, they can look at the type() and tell that something is really meant as a character string; otherwise, I agree, they have to pass through. Under your proposal, this strategy will fail: libraries cannot tell for sure anymore that something is really meant as a character string. > > > The encoding-smart alternatives should also be > > > documented as preferred replacements as soon as possible. > > > > I'm not sure they are preferred. They are if you know the encoding of > > your data sources. If you don't, you better be safe than sorry. > > If you don't know the encoding of your data sources then you should say > that explicitly in code rather than using the same functions as people > who *do* know what their encoding is. Explicit is better than implicit, > right? Our current default is totally implicit. No, it's not. The current default is: always produce byte strings. In many applications, people certainly *should* use character strings, but they have to change their code for that. Telling everybody to use fopen for everything is wrong; telling them to use codecs.open for character streams is right. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 00:21:00 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 01:21:00 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <3A81A585.F0771269@ActiveState.com> (message from Paul Prescod on Wed, 07 Feb 2001 11:44:05 -0800) References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A814678.2F245D14@lemburg.com> <3A81A585.F0771269@ActiveState.com> Message-ID: <200102080021.f180L0201582@mira.informatik.hu-berlin.de> > There is a reason that Brian and I independently invented the same idea. > It's because Joe Programmer without a degree in rocket science is going > to expect it to work that way. Joe Programmer does not know what a codec > is, will not consider importing the codecs module and will have no idea > what to do with the object once they've got there hands on it. > > It's a million times easier to tell a programmer: "If you expect to read > ASCII data add a third argument with the string 'ASCII', if you know > about encodings choose another one. If you know what raw binary data is, > and want to read it, here's another function." Of course, if Joe Programmer would suddenly be confronted with all his open calls failing, he'd hate the new release, and would start to flame comp.lang.python. If he guesses that there is some issue with ASCII in his program, he'd probably look into the documentation of open(); I agree. If that would point to codecs.open, I think Joe could arrange to import the codecs module and invoke the open function. Regards, Martin From paulp@ActiveState.com Thu Feb 8 01:10:53 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 17:10:53 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A814678.2F245D14@lemburg.com> <3A81A585.F0771269@ActiveState.com> <200102080021.f180L0201582@mira.informatik.hu-berlin.de> Message-ID: <3A81F21D.650888C2@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > Of course, if Joe Programmer would suddenly be confronted with all his > open calls failing, he'd hate the new release, and would start to > flame comp.lang.python. I don't believe that open() calls should fail! We should present a pair of explicit alternatives for strings and binary data that are as easy as open() and document them as the recommended way. We should change the tutorials and the books to encourage people to choose the right function for the right job. Years from now we should deprecate open as an old way of doing things that has been superceded. > If he guesses that there is some issue with ASCII in his program, he'd > probably look into the documentation of open(); I agree. How would a user guess that there is "some issue with ASCII." Only the I18N-heads in this mailing list even understand that there is an issue. We need to inform people that there is a decision to be made. > If that would > point to codecs.open, I think Joe could arrange to import the codecs > module and invoke the open function. I think it would be a really big mistake to make the right thing involve so much more code than the easy thing. What is your aversion to fopen/stropen/txtopen/binopen or whatever you want to call them? Why not make life easier? Paul Prescod From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 01:08:56 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 02:08:56 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <3A81B1A7.4E1D022C@ActiveState.com> (message from Paul Prescod on Wed, 07 Feb 2001 12:35:51 -0800) References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> <3A80A10B.1E978B30@ActiveState.com> <200102070806.f1786eg01079@mira.informatik.hu-berlin.de> <3A81B1A7.4E1D022C@ActiveState.com> Message-ID: <200102080108.f1818uG01762@mira.informatik.hu-berlin.de> > > Just try > > > > reader = codecs.lookup("ISO-8859-2")[2] > > charfile = reader(file) > > > > There could be a convenience function, but that also is a detail. > > Usability is not a detail in this particular case. We are trying to > change people's behavior and help them make more robust code. Ok, just propose a specific patch; I'd recommend to add another function to the codecs module, rather than adding another built-in. > That's fine. I'll change the document to be more explicit. Would you > agree that: "Unicode is the only *character set* that supports *all of > the world's major written languages.*" That is certainly the case. > > > Not under my proposal. file.read returns a character string. Sometimes > > > the character string contains characters between 0 and 255 and is > > > indistinguishable from today's string type. Sometimes the file object > > > knows that you want the data decoded and it returns large characters. > > > > I guess we have to defer this until I see whether it is feasible > > (which I believe it is not - it was the mistake Sun made in the early > > JDKs). > > What was the mistake? Java early had methods that treated Strings and byte array interchangably if the strings had character values below 256. One left-over from that is public String(byte[] ascii, int hibyte); // in class java.lang.String It would use the ascii array, and fill it with hibyte in-between; hibyte was typically 0. The documentation now says # Deprecated. This method does not properly convert bytes into # characters. As of JDK 1.1, the preferred way to do this is via the # String constructors that take a character-encoding name or that use # the platform's default encoding. The reverse operation of that is getBytes(nt srcBegin, int srcEnd, byte[] dst, int dstBegin): # Deprecated. This method does not properly convert characters into # bytes. As of JDK 1.1, the preferred way to do this is via the # getBytes(String enc) method, which takes a character-encoding name, # or the getBytes() method, which uses the platform's default # encoding. I'd say your proposal is in the direction of repeating this mistake. > You and I agree that streams can change encoding mid-stream. You > probably think that should be handled by passing the stream to various > codecs as you read (or by doing double-buffer reads). I think that it > should be possible right in the read method. Please take it as a fact that it is impossible to do that at an arbitrary point in the stream; codecs that need to maintain state will result strangely. > It's funny how we switch back and forth. If I say that Python reads byte > 245 into character 245 and thus uses Latin 1 as its default encoding I'm > told I'm wrong. Python has no native encoding. If I claim that in > passing data to C we should treat character 245 as the C "char" with the > value 245 you tell me that I'm proposing Latin 1 as the default > encoding. Python has no default character set *in its byte string type*. Once you have Unicode objects, talking about language-specified character sets is meaningful. > Python has a concept of character that extends from 0 to 255. C has a > concept of character that extends from 0 to 255. There is no issue of > "encoding" as long as you stay within those ranges. C supports various character sets, depending on context. Encodings do matter here already, e.g. when selecting fonts. Some character sets supported in C have characters >256, even if they are stored in char* (in particular, MBCS have these properties). > > That Chinese Python programmer should use his editor of choice, and > > put _() around strings that are meant as text (as opposed to strings > > that are protocol). > > I don't know what you mean by "protocol" here. If you do print "GET "+url+" HTTP/1.0" then the strings are really not meant to be human-readable, they are part of some machine-to-machine communication protocol. > But nevertheless, you are saying that the Chinese programmer must do > more than the English programmer does and I consider that a problem. It just works for the English programmer by coincidence; that programmer should really tell apart text and byte strings in source as well. Following the Unicode path, source files should be UTF-8, but that won't work in practice because of missing editor support. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 01:37:05 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 02:37:05 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <3A81F21D.650888C2@ActiveState.com> (message from Paul Prescod on Wed, 07 Feb 2001 17:10:53 -0800) References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A814678.2F245D14@lemburg.com> <3A81A585.F0771269@ActiveState.com> <200102080021.f180L0201582@mira.informatik.hu-berlin.de> <3A81F21D.650888C2@ActiveState.com> Message-ID: <200102080137.f181b5101963@mira.informatik.hu-berlin.de> > What is your aversion to fopen/stropen/txtopen/binopen or whatever you > want to call them? I'm opposed to adding new builtins. There are already way too many builtins. Just have a look at dir(__builtins__) and try to explain what each and every of them exactly does. People had been using string.join happily without demanding that it is builtin. I'd admit that codecs.open seems wrong also - it is not a codec that is being opened. New builtins are worse, IMO (what is an f, a str, or a bin?). Adding flags to open looks acceptable, though. Regards, Martin From paulp@ActiveState.com Thu Feb 8 02:24:37 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 18:24:37 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <20010207060106.20984.qmail@web102.mail.yahoo.co.jp> <3A814678.2F245D14@lemburg.com> <3A81A585.F0771269@ActiveState.com> <200102080021.f180L0201582@mira.informatik.hu-berlin.de> <3A81F21D.650888C2@ActiveState.com> <200102080137.f181b5101963@mira.informatik.hu-berlin.de> Message-ID: <3A820365.3A351F72@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > I'd admit that codecs.open seems wrong also - it is not a codec that > is being opened. New builtins are worse, IMO (what is an f, a str, or > a bin?). Adding flags to open looks acceptable, though. open already has two optional arguments. I want to add a new mandatory argument. I don't see a way to do it cleanly. Actually, I thought of something which I'll explain in more detail further down. "fopen" stands for "file open". Now that you mention it, "fileopen" is probably the best name -- more descriptive even than today's "open". It would have a mandatory encoding attribute which can be None only if you use the "b" flag to indicate that you want binary data. ---- fileopen (filename, encoding, [mode[, bufsize]])) Return a new file object (described earlier under Built-in Types). The first and third argument are the same as for stdio's fopen(): filename is the file name to be opened, mode indicates how the file is to be opened: 'r' for reading, 'w' for writing (truncating an existing file), and 'a' opens it for appending (which on some Unix systems means that all writes append to the end of the file, regardless of the current seek position). Modes 'r+', 'w+' and 'a+' open the file for updating (note that 'w+' truncates the file). If the file cannot be opened, IOError is raised. If mode is omitted, it defaults to 'r'. The encoding attribute should be a string indicating the encoding of the file. Common values are "ASCII" (for English-only text), "ISO Latin 1" for most Western scripts. "UTF-8" and "UTF-16" are often used for mixed language documents. "Shift-JIS" and "Big5" are typically used to read Eastern scripts. The special value "RAW" means that the file object should return bytes as-is with no translation into a "byte string". The optional bufsize argument specifies the file's desired buffer size: 0 means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size. A negative bufsize means to use the system default, which is usually line buffered for for tty devices and fully buffered for other files. If omitted, the system default is used. --- "open" could actually be extended to be like "fileopen" if we look at the second parameter and interpret it according to its contents. If it matches the regexp [rwa]+?b? then we treat it as the "deprecated form." Otherwise we treat it as an encoding. I don't think we have to worry about an encoding whose name matches that pattern any time soon! So in documentation encoding would NOT be optional but in practice there would be a period in which it would be optional so that people could migrate their code. Paul Prescod From paulp@ActiveState.com Thu Feb 8 02:40:29 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 18:40:29 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <3A800EE5.A8122B3C@ActiveState.com> <200102062049.f16Kngq01092@mira.informatik.hu-berlin.de> <3A80A10B.1E978B30@ActiveState.com> <200102070806.f1786eg01079@mira.informatik.hu-berlin.de> <3A81B1A7.4E1D022C@ActiveState.com> <200102080108.f1818uG01762@mira.informatik.hu-berlin.de> Message-ID: <3A82071D.812227F1@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > public String(byte[] ascii, int hibyte); // in class java.lang.String > > It would use the ascii array, and fill it with hibyte in-between; > hibyte was typically 0. The documentation now says > > # Deprecated. This method does not properly convert bytes into > # characters. That's right. This function could generate invalid Unicode. That's totally different than what I'm proposing! > ... > It just works for the English programmer by coincidence; that > programmer should really tell apart text and byte strings in source as > well. Are you really saying that if you were a writing a Python book you would say that the appropriate way to write a "Hello World" program is: print _("Hello World") Please give some thought to usability! I love Python because it is syntactically clean and semantically simple. I can show people Python code and they immediately understand it. If you are right, then Python is a scripting language that truly has a simpler syntax for "byte strings" than it does for "character strings". If that's so then there is something seriously broken in the language and we need to figure out how to fix it. Paul Prescod From paulp@ActiveState.com Thu Feb 8 03:04:50 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 19:04:50 -0800 Subject: [I18n-sig] Re: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> Message-ID: <3A820CD2.25C3F978@ActiveState.com> "Martin v. Loewis" wrote: > > > > > I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R data > > in a string literal. PythonWin and Tk expect Unicode. How could they > > display the characters correctly? > > No, PythonWin and Tk both tell apart Unicode and byte strings > (although Tk uses quite a funny algorithm to do so). If they see a > byte string, they convert it using the platform encoding (which is > user-settable on both Windows and Unix) to a Unicode string, and > display that. And if they read in a file from a Frenchmen then they get random Russian characters on their screen. Or they crash the third-party software because it couldn't decode properly. Or ... This is what we need to move away from. The first step is to get people to stop accidently passing around character strings as byte strings. To do that we need to make it as easy as possible to get properly decoded strings into Python. > > ... > > What do you think *should* happen? These are the only choices I can > > think of: > > > > 1. DOM encodes it as UTF-8 > > 2. DOM blindly passes it through and creates illegal XML > > 3. (correct) User explicitly decodes data into Unicode charset. > > What users expect to happen is 2; blindly pass-through. They think > they can get it right; given enough control, this is feasible. It was > even common practice in the absence of Unicode objects, so a lot of > code depends on libraries passing things through as-is. Surely you agree with me that it is inappropriate for a user to *expect* a DOM implementation to pass on binary data unmolested. That some particular DOM may do so (like minidom) is probably just a performance optimizatoin quirk that could go away at any time. Why would we go out of our way to support people making this mistake? > > If you don't know the encoding of your data sources then you should say > > that explicitly in code rather than using the same functions as people > > who *do* know what their encoding is. Explicit is better than implicit, > > right? Our current default is totally implicit. > > No, it's not. The current default is: always produce byte strings. A "byte string" is not something you'll find defined in the Python tutorial, language reference or library reference. People who use open() do not know that they are making a choice. If you ask a hundred Python programmers whether the result of open() is a character stream or a byte stream, most will say character stream. The same goes for string literals. The section of the Python language reference describing string literals does not mention the word "byte" once. It mentions the world character on almost every other line. > In > many applications, people certainly *should* use character strings, > but they have to change their code for that. Telling everybody to use > fopen for everything is wrong; telling them to use codecs.open for > character streams is right. In another message you admitted that the codec mechanism is somewhat user unfriendly...so I hope we agree that we need something better. People need to start making a choice and we have to make that as easy for them as possible! Paul Prescod From uche.ogbuji@fourthought.com Thu Feb 8 03:22:12 2001 From: uche.ogbuji@fourthought.com (Uche Ogbuji) Date: Wed, 07 Feb 2001 20:22:12 -0700 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: Message from "M.-A. Lemburg" of "Wed, 07 Feb 2001 13:58:32 +0100." <3A814678.2F245D14@lemburg.com> Message-ID: <200102080322.UAA03196@localhost.localdomain> > Hooper Brian wrote: > > ... > > What about adding an > > optional encoding argument to the existing open(), > > allowing encoding to be passed to that, and using 'raw' as > > the default format (what it does now)? > > This is what codecs.open() already provides. I think this should be codecs.fopen() to avoid any confusion. -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python From paulp@ActiveState.com Thu Feb 8 03:30:59 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 07 Feb 2001 19:30:59 -0800 Subject: [I18n-sig] Concatenation Message-ID: <3A8212F3.6F7371D2@ActiveState.com> Would anyone out there that would object if this were allowed in Python 2.1: >>> u"abc"+"\245" u"abc\245" I can vaguely (only vaguely) understand the arguments about casting when passing high-bit data to a C-API but I wonder if anyone would argue that the code above is ambiguous in its intent. Paul Prescod From mal@lemburg.com Thu Feb 8 10:01:38 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 08 Feb 2001 11:01:38 +0100 Subject: [I18n-sig] Re: [Python-Dev] unichr References: <3A81B24B.6AE348A9@ActiveState.com> Message-ID: <3A826E82.446C68F9@lemburg.com> Paul Prescod wrote: > > Ka-Ping Yee wrote: > > > > ... > > > > At the moment, since the default encoding is ASCII, something like > > > > u"abc" + chr(200) > > > > would cause an exception because 200 is outside of the ASCII range. > > Yes, this is another mistake in Python's current handling of strings. > there is absolutely nothing special about the 128-255 range of > characters. We shouldn't start throwing exceptions until we get to 256. You are forgetting that the range 128-255 is used by many codepages to support language specific characters. chr(0xE0) will give different characters in the US than e.g. in Russia. If we were to simply let these conversions slip through, then people would find garbled data in their text files. Of course, if a user explicitly sets the default encoding to Latin-1, then everything will be fine, but for ASCII (which is the base of most character encodings in use today) there is little other we can do except to raise an exception. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tdickenson@geminidataloggers.com Thu Feb 8 10:26:00 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Thu, 8 Feb 2001 10:26:00 -0000 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model Message-ID: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> > > There is already a large body of code that mixes text and > binary data > > in the same type. If we have separate text/binary types, > then we need > > to plan a transition period to allow code to distinguish between the > > two uses. > > I think the current Unicode implementation has this property: Unicode > is the type for representing character strings; the string type the > one for representing byte strings. The problem isnt so much in the current implementation; its in the code that has been written to that implementation. At the moment it is unnatural to write print u"hello world" rather than the easier print "hello world" even though the message is clearly text. I think we agree that, eventually, we would like the simple notation for a string literal to create a unicode string. What Im not sure about is whether we can make that change soon. How often are string literals used to create what is logically just binary data? From tdickenson@geminidataloggers.com Thu Feb 8 11:03:16 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Thu, 08 Feb 2001 11:03:16 +0000 Subject: [I18n-sig] Concatenation In-Reply-To: <3A8212F3.6F7371D2@ActiveState.com> References: <3A8212F3.6F7371D2@ActiveState.com> Message-ID: On Wed, 07 Feb 2001 19:30:59 -0800, Paul Prescod wrote: >Would anyone out there that would object if this were allowed in Python >2.1: In 2.1, yes. =46or as long as we have text data stored in a mix of string and unicode objects, this rule is a good way of picking up encoding-assumption bugs early. (for 2.0 I argued against this, but today I can recognise its usefulness) >>>> u"abc"+"\245" >u"abc\245" Of course, this should work once type(u"abc")=3D=3Dtype("\245"). I think we agree this is the long term goal. >I wonder if anyone would argue that >the code above is ambiguous in its intent. A small variation: >>> x =3D 'd' >>> print u"abc"+x abcd >>> x =3D "\245" >>> print u"abc"+x Traceback (most recent call last): File "", line 1, in ? UnicodeError: ASCII decoding error: ordinal not in range(128) Toby Dickenson tdickenson@geminidataloggers.com From mal@lemburg.com Thu Feb 8 11:29:11 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 08 Feb 2001 12:29:11 +0100 Subject: [I18n-sig] Concatenation References: <3A8212F3.6F7371D2@ActiveState.com> Message-ID: <3A828307.3AD1504C@lemburg.com> Paul Prescod wrote: > > Would anyone out there that would object if this were allowed in Python > 2.1: > > >>> u"abc"+"\245" > u"abc\245" > > I can vaguely (only vaguely) understand the arguments about casting when > passing high-bit data to a C-API but I wonder if anyone would argue that > the code above is ambiguous in its intent. Please see my other reply on this subject. We can't simply ignore the default encoding here or else people will lose data ! -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Feb 8 12:40:19 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 08 Feb 2001 13:40:19 +0100 Subject: [I18n-sig] Move to codecs.open() as builtin open() (Pre-PEP: Proposed Python Character Model) References: <200102080322.UAA03196@localhost.localdomain> Message-ID: <3A8293B3.D8D4A2B3@lemburg.com> Uche Ogbuji wrote: > > > Hooper Brian wrote: > > > ... > > > What about adding an > > > optional encoding argument to the existing open(), > > > allowing encoding to be passed to that, and using 'raw' as > > > the default format (what it does now)? > > > > This is what codecs.open() already provides. > > I think this should be codecs.fopen() to avoid any confusion. Isn't the need to import it from codecs enough to notice the difference ? from codecs import open as fopen also does the trick in 2.1, BTW. Perhaps we should make codecs.open the new open() in 2.2 ?! (the API would have to tweaked a bit though to make the argument order match the open() API) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Feb 8 13:26:02 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 08 Feb 2001 14:26:02 +0100 Subject: [I18n-sig] Re: Pre-PEP: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808339.7B2BD5D6@ActiveState.com> <200102062350.f16Noqc02391@mira.informatik.hu-berlin.de> <3A8091D3.F45F666A@ActiveState.com> <200102070725.f177P4X00905@mira.informatik.hu-berlin.de> <3A81A927.FAE4303D@ActiveState.com> Message-ID: <3A829E6A.5129C048@lemburg.com> Paul Prescod wrote: > > "Martin v. Loewis" wrote: > > > > ... > > > > So every s and s# conversion would trigger a copying of the > > string. How is that implemented? Currently, every Unicode object has a > > reference to a string object that is produced by converting to the > > default character set. Would it grow another reference to a string > > object that is carrying the Latin-1-conversion? > > I'm not clear on the status of the concept of "default charater set." > First, I think you mean "default character encoding". Second, I thought > that that idea was removed from user-view at least, wasn't it? I was > thinking that we would use that slot to hold the char->ord->char > conversion (which you can interpret as Latin-1 or not depending on your > philosophy). The extra slot is a merely needed to implement s and s# conversions since these pass back references to a real C char buffer. Let's *not* do more of those... > > Certainly. Applications expect to write to the resulting memory, and > > expect to change the underlying string; this is valid only if one had > > been passing NULL to PyString_FromStringAndSize. > > The documentation says that the PyString_AsString and PyString_AS_STRING > buffers must never be modified. I forgot that the "real" protocol is > that that buffer can be modified. We'll need to copy its contents back > to the Unicode string before the next operation that uses the Unicode > value. Not rocket science but somewhat tedious. Paul, please have a look at the es and es# conversions -- I think these do what you have in mind here. Writing to buffers returned by s or s# is never permitted, you'd have to use w# to get at a writeable C buffer. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Feb 8 13:34:07 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 08 Feb 2001 14:34:07 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> Message-ID: <3A82A04F.5A03CAB2@lemburg.com> Toby Dickenson wrote: > > > > There is already a large body of code that mixes text and > > binary data > > > in the same type. If we have separate text/binary types, > > then we need > > > to plan a transition period to allow code to distinguish between the > > > two uses. > > > > I think the current Unicode implementation has this property: Unicode > > is the type for representing character strings; the string type the > > one for representing byte strings. > > The problem isnt so much in the current implementation; its in the code that > has been written to that implementation. At the moment it is unnatural to > write > > print u"hello world" > > rather than the easier > > print "hello world" > > even though the message is clearly text. Sure, but how is Python going to deduce this information from the string ? I once proposed to use a new qualifier for binary data, e.g. b"binary data" or d"binary data". Don't remember the outcome though as this was during the heated debate over how to do Unicode right earlier last year. Perhaps the only new type we need is an easy to manage binary data type that behaves very much like the old-school strings. In Py3K we can then all fit them into a new class hierarchie to come close to unification: binary data string | | text data string | | | | Unicode string encoded 8-bit string (with encoding information !) > I think we agree that, eventually, we would like the simple notation for a > string literal to create a unicode string. What Im not sure about is whether > we can make that change soon. How often are string literals used to create > what is logically just binary data? Often enough to make "python -U" fail badly... -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tdickenson@geminidataloggers.com Thu Feb 8 14:09:15 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Thu, 8 Feb 2001 14:09:15 -0000 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model Message-ID: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER> > I once proposed to use a new qualifier for binary data, e.g. > b"binary data" or d"binary data". Don't remember the outcome though > as this was during the heated debate over how to do Unicode right > earlier last year. > > Perhaps the only new type we need is an easy to manage > binary data type that behaves very much like the old-school > strings. Yes, that all sounds like a good idea. I think changing some "strings" to b"strings" is a necessary step on the way to 'python -U'. I would want to avoid the need for a 2.0-style 'default encoding', so I suggest it shouldnt be possible to mix this type with other strings: >>> "1"+b"2" Traceback (most recent call last): File "", line 1, in ? TypeError: cannot add type "binary" to string >>> "3"==b"3" 0 From paulp@ActiveState.com Thu Feb 8 15:16:01 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 8 Feb 2001 07:16:01 -0800 (PST) Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER> Message-ID: I really like the idea of the b"..." prefix Is anyone opposed? ------ I think we are in sight of agreement on 1. [file]?open(filename, encoding, ...) 2. b"..." 3. an encoding declaration at the top of files 4. that concatenating Python strings and Unicode strings should do the "obvious" thing for charcters from 127-255 and nothing for characters beyond. 5. a bytestring type that behaves in every way shape and form like our current string type but has a different type() and repr(). These would all be small but important incremental moves to a better Python. As time goes by we can deprecate more and more "ambiguous" usages like: * regular string literals that use non-ASCII characters when there is no encoding declaration * open() calls that do not specify an encoding (or "RAW") Paul Prescod From paulp@ActiveState.com Thu Feb 8 15:22:51 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 8 Feb 2001 07:22:51 -0800 (PST) Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: Message-ID: On Thu, 8 Feb 2001, Paul Prescod wrote: > > 4. that concatenating Python strings and Unicode strings should do the > "obvious" thing for charcters from 127-255 and nothing for characters > beyond. Sorry, I see now that this is still controversial... Paul Prescod From paulp@ActiveState.com Thu Feb 8 15:31:21 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 8 Feb 2001 07:31:21 -0800 (PST) Subject: [I18n-sig] Re: [Python-Dev] unichr In-Reply-To: <3A826E82.446C68F9@lemburg.com> Message-ID: On Thu, 8 Feb 2001, M.-A. Lemburg wrote: > You are forgetting that the range 128-255 is used by many codepages > to support language specific characters. No, I'm not forgetting that. I just don't think it is relevant. > chr(0xE0) will give different > characters in the US than e.g. in Russia. If we were to simply > let these conversions slip through, then people would find garbled > data in their text files. People in Russia understand the concept of code pages. They know that if they put "special" characters in their files they will be interpreted on other platforms as Western European characters. If we make it easy for them to explicitly state their encoding then the will do so and get better behavior then they did before. We can also simplify Python and remove an arbitrary restriction at the same time. > Of course, if a user explicitly sets the default encoding to > Latin-1, then everything will be fine, but for ASCII (which is > the base of most character encodings in use today) there is > little other we can do except to raise an exception. I don't think the "default encoding" is a relevant concept. Most people came out strongly against it on the Python lists and it was hidden from user view for that reason. It is a terrible idea to encourage people to write software that works right on their computer but not on anyone else's. I think that we should view the "default encoding" as an implementation artifact and nothing more. We need to define portable rules that will consistently make sense everywhere. Paul Prescod From tdickenson@geminidataloggers.com Thu Feb 8 15:33:48 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Thu, 08 Feb 2001 15:33:48 +0000 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: References: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER> Message-ID: On Thu, 8 Feb 2001 07:16:01 -0800 (PST), Paul Prescod wrote: >5. a bytestring type that behaves in every way shape and form like our >current string type but has a different type() and repr(). I mentioned some other desirable differences too: >I would want to avoid the need for a 2.0-style 'default encoding', so I >suggest it shouldnt be possible to mix this type with other strings: > >>>> "1"+b"2" >Traceback (most recent call last): > File "", line 1, in ? >TypeError: cannot add type "binary" to string >>>> "3"=3D=3Db"3" >0 Without a default encoding would need to change its str() too. Toby Dickenson tdickenson@geminidataloggers.com From mal@lemburg.com Thu Feb 8 16:33:49 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 08 Feb 2001 17:33:49 +0100 Subject: [I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python Character Model) References: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER> Message-ID: <3A82CA6D.313D5E39@lemburg.com> Toby Dickenson wrote: > > > I once proposed to use a new qualifier for binary data, e.g. > > b"binary data" or d"binary data". Don't remember the outcome though > > as this was during the heated debate over how to do Unicode right > > earlier last year. > > > > Perhaps the only new type we need is an easy to manage > > binary data type that behaves very much like the old-school > > strings. > > Yes, that all sounds like a good idea. I think changing some "strings" to > b"strings" is a necessary step on the way to 'python -U'. > > I would want to avoid the need for a 2.0-style 'default encoding', so I > suggest it shouldnt be possible to mix this type with other strings: > > >>> "1"+b"2" > Traceback (most recent call last): > File "", line 1, in ? > TypeError: cannot add type "binary" to string > >>> "3"==b"3" > 0 Right. This will cause people to rethink whether they are using the object for text data or binary data. I still think that at the interface level, b"" and "" should be treated the same (except that b""-strings should not implement the char buffer interface). OTOH, these b""-strings should implement the same methods as the array type and probably seemlessly interact with it too. I don't know which type should be considered "better" in coercion though, b""-strings or arrays (I guess b""-strings). (Waiting for someone to tear down the idea... ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tdickenson@geminidataloggers.com Thu Feb 8 16:41:09 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Thu, 8 Feb 2001 16:41:09 -0000 Subject: [I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python Character Model) Message-ID: <9FC702711D39D3118D4900902778ADC83244A4@JUPITER> > OTOH, these b""-strings should implement the same methods as the > array type and probably seemlessly interact with it too. I don't > know which type should be considered "better" in coercion > though, b""-strings or arrays (I guess b""-strings). That raises the question of mutability... instances of the array type are mutable. On every occasion that I have needed a mutable string type, it has been when the string was holding binary data. Do we want b"strings" to be mutable? Do we want b"strings" to be the same as the array type? (if not now, maybe at the same time we unify plain "strings" and u"strings") From mal@lemburg.com Thu Feb 8 16:45:21 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 08 Feb 2001 17:45:21 +0100 Subject: [I18n-sig] Re: [Python-Dev] unichr References: Message-ID: <3A82CD21.8F47E109@lemburg.com> Paul Prescod wrote: > > On Thu, 8 Feb 2001, M.-A. Lemburg wrote: > > > You are forgetting that the range 128-255 is used by many codepages > > to support language specific characters. > > No, I'm not forgetting that. I just don't think it is relevant. It is not irrelevant as you describe below... > > chr(0xE0) will give different > > characters in the US than e.g. in Russia. If we were to simply > > let these conversions slip through, then people would find garbled > > data in their text files. > > People in Russia understand the concept of code pages. They know that > if they put "special" characters in their files they will be interpreted > on other platforms as Western European characters. If we make it easy for > them to explicitly state their encoding then the will do so and get better > behavior then they did before. We can also simplify Python and remove an > arbitrary restriction at the same time. Well, we can remove the restriction for string literals, but the same coercion happens for generated strings and these are not under control of some source encoding parameter. I once suggested that strings (the 8-bit ones) get an .encoding attribute to carry along that information, but it quickly showed that the idea would not be of much use because of the generation problem and because the only coercion from a string with encoding information and one without that information is to produce a new string without encoding information (or maybe not coerce them at all). See the python-dev archives for more on this idea (early last year). > > Of course, if a user explicitly sets the default encoding to > > Latin-1, then everything will be fine, but for ASCII (which is > > the base of most character encodings in use today) there is > > little other we can do except to raise an exception. > > I don't think the "default encoding" is a relevant concept. Most people > came out strongly against it on the Python lists and it was hidden from > user view for that reason. It is a terrible idea to encourage people to > write software that works right on their computer but not on anyone > else's. I think that we should view the "default encoding" as an > implementation artifact and nothing more. We need to define portable rules > that will consistently make sense everywhere. That is exactly why we made as hard as possible for people to *change* the default. It is pretty obvious that they are on their own when trying to fiddle with site.py or sitecustomize.py. Still, I believe its a valid idea. Back when I wrote the proposal for Unicode integration I had fixed the default encoding to UTF-8. As the first working patches appeared, there was a long and heated discussion about what encoding to choose as default (people didn't like UTF-8). There were basically two camps: UTF-8 and Latin-1. We then decided to make the encoding a variable for have people try out different encodings. Next, the idea of a locale based default encoding was brought up. Fredrik and I then implemented the needed magic to figure out the platform specific default encoding, but subsequently the idea was dropped by our BDFL in favour of ASCII which is what we see now. The support code was left in the distribution... and Pythoneers quickly found it ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Thu Feb 8 16:54:27 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 08 Feb 2001 17:54:27 +0100 Subject: [I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python Character Model) References: <9FC702711D39D3118D4900902778ADC83244A4@JUPITER> Message-ID: <3A82CF42.C1B2F426@lemburg.com> Toby Dickenson wrote: > > > OTOH, these b""-strings should implement the same methods as the > > array type and probably seemlessly interact with it too. I don't > > know which type should be considered "better" in coercion > > though, b""-strings or arrays (I guess b""-strings). > > That raises the question of mutability... instances of the array type are > mutable. On every occasion that I have needed a mutable string type, it has > been when the string was holding binary data. > > Do we want b"strings" to be mutable? No way -- that would confuse the hell out of newbies and everyone else ;-) I basically want to reuse the string and/or buffer implementation and add some more useful methods to it, eg. things from the struct module and array module. > Do we want b"strings" to be the same as the array type? (if not now, maybe > at the same time we unify plain "strings" and u"strings") Not really. Arrays should still be the right type for mutable binary data chunks, even at that point. (This idea clearly needs some more thought... :) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paulp@ActiveState.com Thu Feb 8 17:44:08 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 8 Feb 2001 09:44:08 -0800 (PST) Subject: [I18n-sig] Strawman Proposal: Binary Strings Message-ID: A binary string is a string that is declared by the user to be a carrier of binary data and not (directly) of textual data (Unicode characters). In order to get a rapid adoption of binary strings, they are designed to be as similar to Python strings as is possible. This means that they have all of the same methods, are immutable and so forth. They also follow Python's existing string->Unicode coercion rules. These rules are arguably too "loose" but experience shows that coercion rules are often highly personal and the arguments one way or the other tend to be philosophical rather than practical. For example, Java and JavaScript automatically coerce objects to strings when they are added to strings. Python does not. Neither choice seems a large mistake. Binary strings differ from regular strings in the following ways: a) they have a unique type object named types.BinaryString b) they are constucted in Python code in one of three ways: 1. using a "b" prefix on string literals 2. using a function called binary() 3. from some other C-coded function such as a file i/o library c) they repr() themselves with a b"" prefix as per Unicode strings One reason to add the binary data type is because at some point in the future may deprecate the construction of binary data in ordinary string literals. Although details remain to be worked out, it is a goal that in the future string literals will always be interpreted as character strings. That might mean that non-ASCII characters will some day be disallowed or that they wil be interpeted according to a declared Unicode transformation encoding. Conventions for binary file I/O will be worked out in a separate proposal. From tdickenson@geminidataloggers.com Thu Feb 8 18:16:39 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Thu, 08 Feb 2001 18:16:39 +0000 Subject: [I18n-sig] Strawman Proposal: Binary Strings In-Reply-To: References: Message-ID: On Thu, 8 Feb 2001 09:44:08 -0800 (PST), Paul Prescod wrote: > b.2. using a function called binary() You say that precise coercion rules are a personal preference, but adding a coercion function just helps this ambiguity to persist. What if string.encode() returned a binary string.... would we need a 'binary()' builtin at all? >They also follow >Python's existing string->Unicode coercion rules. I agree any explicit coecion should follow the same rules as Unicode. Im not sure we agree on whether that coercion happens automatically and implicitly, as it does with Unicode strings; I feel fairly strongly that it shouldnt. (Ill justify that tomorrow if we do disagree). An extra difference: d) The str() is the same as the repr(). I think this makes sense. The library reference says str() returns "a nicely printable representation of an object" - and raw binary data definitely isnt. It gives users a chance to think about what they are storing in the string. Also, having repr the same as str is the same as lists, dicts, and other 'data container' types. Toby Dickenson tdickenson@geminidataloggers.com From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 19:29:36 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 20:29:36 +0100 Subject: [I18n-sig] Re: Python Character Model In-Reply-To: <3A820CD2.25C3F978@ActiveState.com> (message from Paul Prescod on Wed, 07 Feb 2001 19:04:50 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> Message-ID: <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> > And if they read in a file from a Frenchmen then they get random Russian > characters on their screen. Or they crash the third-party software > because it couldn't decode properly. Or ... > > This is what we need to move away from. Move-away-from, perhaps. Outright force moving by breaking people's code, no. > Surely you agree with me that it is inappropriate for a user to > *expect* a DOM implementation to pass on binary data > unmolested. That some particular DOM may do so (like minidom) is > probably just a performance optimizatoin quirk that could go away at > any time. Why would we go out of our way to support people making > this mistake? Because of backwards compatibility. Breaking people's programs is not good - even if they are using a style or an algorithm that you despise. > In another message you admitted that the codec mechanism is somewhat > user unfriendly...so I hope we agree that we need something better. No, I admitted that it is inconsequential if read as English. It is no more or less friendly than a module that's called, say, file, so you'd use file.open. In either case, the user will have to learn what to use. Many Python users won't guess the right meaning into codec, just as many people won't guess what "modem" stands for - yet they are fully capable of using it (despite its demodulating nature :). Regards, Martin From paulp@ActiveState.com Thu Feb 8 20:11:12 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 08 Feb 2001 12:11:12 -0800 Subject: [I18n-sig] Re: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> Message-ID: <3A82FD60.EFB38FAD@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > Move-away-from, perhaps. Outright force moving by breaking people's > code, no. Guido has been very clear that breaking incorrect code is not necessarily a problem. Remember the two-arg socket issue? Anyhow, I was against the two-arg socket change and I would be against a string/unicode unification *today*. But I strongly believe that we should announce that that is the direction we are going so that people can fix their code to conform with the coming world order. We have a deprecation/warning mechanism precisely so that we can change the language gradually -- even in backwards-incompatible ways. > > In another message you admitted that the codec mechanism is somewhat > > user unfriendly...so I hope we agree that we need something better. > > No, I admitted that it is inconsequential if read as English. It is no > more or less friendly than a module that's called, say, file, so you'd > use file.open. In either case, the user will have to learn what to use. I'll repeat the point that when you make the recommended way to do things harder than the non-recommended (in this case implicit) way, people will be slow to move if they ever move at all. Usability! > Many Python users won't guess the right meaning into codec, just as > many people won't guess what "modem" stands for - yet they are fully > capable of using it (despite its demodulating nature :). You seem to agree above that file is a better name so I'm not sure why we're still talking about "codec"! The only question is whether it should be file.open, fileopen or reusing our existing open. Do you have any problems with reusing open with a quasi-mandatory (HIGHLY RECOMMENDED!) encoding argument? Paul Prescod From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 19:58:20 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 20:58:20 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> (message from Toby Dickenson on Thu, 8 Feb 2001 10:26:00 -0000) References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> Message-ID: <200102081958.f18JwKd01167@mira.informatik.hu-berlin.de> > print u"hello world" > > rather than the easier > > print "hello world" > > even though the message is clearly text. You can easily have the latter being Unicode by invoking Python with the -U option. If the pragma PEP is ever implemented, one pragma should be reserved to declare the source file encoding, and another one to declare all strings as Unicode in this file. > I think we agree that, eventually, we would like the simple notation > for a string literal to create a unicode string. What Im not sure > about is whether we can make that change soon. How often are string > literals used to create what is logically just binary data? Let's have a look. Excluding __doc__ strings (which can be recognized syntactically), performing grep '"' in the Python library, I get BaseHTTPServer.py:__version__ = "0.2" BaseHTTPServer.py:__all__ = ["HTTPServer", "BaseHTTPRequestHandler"] Both are "protocol" in some sense, i.e. not meant to be human-readable. +2 for binary data BaseHTTPServer.py:DEFAULT_ERROR_MESSAGE = """\ This is text, giving +1 for binary data. Actually, it is HTML, so when transferring it, it needs to be encoded in some encoding; so it *could* be considered as the encoded message instead BaseHTTPServer.py: sys_version = "Python/" + string.split(sys.version)[0] BaseHTTPServer.py: server_version = "BaseHTTP/" + __version__ BaseHTTPServer.py: self.request_version = version = "HTTP/0.9" # Default BaseHTTPServer.py: self.send_error(400, "Bad request version (%s)BaseHTTPServer.py: "Bad HTTP/0.9 request type (%s BaseHTTPServer.py: self.send_error(400, "Bad request syntax (%s)" % ` BaseHTTPServer.py: self.send_error(501, "Unsupported method (%s)" % ` Part of the HTTP protocol, thus binary data. +9 BaseHTTPServer.py: self.log_error("code %d, message %s", code, message) Log file; this is text, so +8 self.wfile.write("%s %s %s\r\n" % HTTP protocol, +9 There are a few more. In total, BaseHTTPServer.py contains more binary strings than text strings. For other files, the ratio may vary. In general, I believe "binary" strings in source code, as many of the strings are typically processed by some other program which expects a specific byte sequence, rather than a character string. Human-readable strings or probably more common in GUI applications. One should think about i18n here, which means that the actual localized message catalogs must be separate from the program logic. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 20:09:26 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 21:09:26 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <3A82A04F.5A03CAB2@lemburg.com> (mal@lemburg.com) References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> <3A82A04F.5A03CAB2@lemburg.com> Message-ID: <200102082009.f18K9QI01197@mira.informatik.hu-berlin.de> > encoded 8-bit string (with encoding > information !) I'd like to point out that this is something that Bill Janssen always wanted to see. In CORBA, they number encodings for efficient representation; that's something that Python could do as well. CORBA took the OSF charset registry. That was a mistake, they think about using the IANA registry now. This registry provides both textual and numeric identifiers for encodings (numeric in the form of MIBEnum values). Regards, Martin From paulp@ActiveState.com Thu Feb 8 20:24:49 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 08 Feb 2001 12:24:49 -0800 Subject: [I18n-sig] Strawman Proposal: Binary Strings References: Message-ID: <3A830091.3D855EDD@ActiveState.com> Toby Dickenson wrote: > > ... > > You say that precise coercion rules are a personal preference, but > adding a coercion function just helps this ambiguity to persist. > > What if string.encode() returned a binary string.... would we need a > 'binary()' builtin at all? I guess not. But the encode method might already be in use. If we combine your restrictive coercion suggestion with this suggestion we might break some (admittedly newish) code. How about "str.binencode(encoding)". Also, it isn't entirely unbelievable that someone might want to encode from a string to a string. e.g. base64 (do we call that an encoding??) So having an binencode() seperate from encode() might be a good idea. Alternate names are "binary", "asbinary", "tobinary", "getbinary" and any underscore-separated variant. > ... > I agree any explicit coecion should follow the same rules as Unicode. > Im not sure we agree on whether that coercion happens automatically > and implicitly, as it does with Unicode strings; I feel fairly > strongly that it shouldnt. (Ill justify that tomorrow if we do > disagree). If we were inventing something from whole cloth I would agree with you. But I want people to quickly port their string-using applications over to binary-strings and if we require a bunch more explicit conversions then they will move more slowly. Nevertheless, I'm not willing to fight about the issue. There are two votes against coercion already and if the response is similarly anti-coercion then I'll agree. > An extra difference: > > d) The str() is the same as the repr(). That sounds okay with me... Paul Prescod From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 20:46:16 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 21:46:16 +0100 Subject: [I18n-sig] Re: Python Character Model In-Reply-To: <3A82FD60.EFB38FAD@ActiveState.com> (message from Paul Prescod on Thu, 08 Feb 2001 12:11:12 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> Message-ID: <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> > > Move-away-from, perhaps. Outright force moving by breaking people's > > code, no. > > Guido has been very clear that breaking incorrect code is not > necessarily a problem. For that, a requirement is/should be that the code was documented as incorrect (which I think was the case in the socket calls - even though examples had it wrong). In this case, I think the code currently clearly *is* correct. Nowhere in the Python reference manual is using bytes > 128 in a string declared as incorrect. Instead, the documentation says # 8-bit characters may be used in string literals and comments but # their interpretation is platform dependent; the proper way to insert # 8-bit characters in string literals is by using octal or hexadecimal # escape sequences. So while people should be aware that their scripts many not work on other platforms, I think they are granted permission to use that, and can expect that this continues to work in the same platform-dependent way in the future (if used on the same platform). > > Many Python users won't guess the right meaning into codec, just as > > many people won't guess what "modem" stands for - yet they are fully > > capable of using it (despite its demodulating nature :). > > You seem to agree above that file is a better name so I'm not sure why > we're still talking about "codec"! Because that is how it works in Python 2.0. I'm against moving functions around like that, as this gives a clear message that python-dev can and will break anything anytime. Please see a recent message from Fredrik Lundh who was complaining about this very problem: people seem to be fond of breaking other people's code "for their own good". > Do you have any problems with reusing open with a quasi-mandatory > (HIGHLY RECOMMENDED!) encoding argument? No, that is certainly better than another builtin. It'd depend on the exact signature, though - as you point out, open has already two optional arguments. Adding a third one won't fly; it has to be a keyword argument, then. Regards, Martin From paulp@ActiveState.com Thu Feb 8 21:35:12 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 08 Feb 2001 13:35:12 -0800 Subject: [I18n-sig] Re: Python Character Model References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> Message-ID: <3A831110.6AADE590@ActiveState.com> "Martin v. Loewis" wrote: > > > ... > > You seem to agree above that file is a better name so I'm not sure why > > we're still talking about "codec"! > > Because that is how it works in Python 2.0. I'm against moving > functions around like that, as this gives a clear message that > python-dev can and will break anything anytime. Aliasing a function into builtins would not break anything! > Please see a recent > message from Fredrik Lundh who was complaining about this very > problem: people seem to be fond of breaking other people's code "for > their own good". That will happen in the language's evolution. We sometimes make mistakes that must be fixed later. I think most of us agree that in the year 2001 it is a mistake to have literal strings map to byte strings instead of character strings. We're going to have to break some code to fix that mistake. The only question is whether we do it a little at a time like K&R C->ANSI C->C++ or in a big bang like C++ -> Java or VB to VB.NET. I prefer the former. > > Do you have any problems with reusing open with a quasi-mandatory > > (HIGHLY RECOMMENDED!) encoding argument? > > No, that is certainly better than another builtin. It'd depend on the > exact signature, though - as you point out, open has already two > optional arguments. Adding a third one won't fly; it has to be a > keyword argument, then. At the bottom of one of my messages I proposed that we insert it as the second argument. Although the encoding and mode are both strings there is no syntactic overlap between [rwa][+]?[tb]+ and the set of existent or proposed encodings. If we merely outlaw encodings with that name then we can quickly figure out whether the second argument is a mode or an encoding. So the documented syntax would be open(filename, encoding, [[mode], bytes]) And the documentation would say: "There is an obsolete variant that does not require an encoding string. This may cause a warning in future versions of Python and be removed sometime after that." Paul Prescod From paulp@ActiveState.com Thu Feb 8 23:14:40 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 08 Feb 2001 15:14:40 -0800 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> <200102081958.f18JwKd01167@mira.informatik.hu-berlin.de> Message-ID: <3A832860.B5D15B3D@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > For other files, the ratio may vary. In general, I believe "binary" > strings in source code, as many of the strings are typically processed > by some other program which expects a specific byte sequence, rather > than a character string. I think that your counting methodology is highly suspect. I consider a binary string to be a string that contains elements that the author did not think of in terms of some subset of Unicode. So for example: sys_version = "Python/" + string.split(sys.version)[0] Nobody would ever expect sys_version to have anything other than Unicode characters in it. The pattern of strings produced here will always be composed only of Unicode-legal elements. A GIF file is binary because most bytes are not intended to be Unicode characters. According to your definition, an XML document comprising a SOAP message is "binary" rather than "text" despite what the XML specification says. After all, what could be more "protocol" than SOAP. Things like the Python version and SOAP messages are designed to be both protocol and text. Thats a major part of what distinguishes SOAP from DCOM or IIOP for example. Paul Prescod From paulp@ActiveState.com Thu Feb 8 23:23:50 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 08 Feb 2001 15:23:50 -0800 Subject: [I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python Character Model) References: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER> <3A82CA6D.313D5E39@lemburg.com> Message-ID: <3A832A86.71833150@ActiveState.com> I've thought about this coercion issue more...I think we need to auto-coerece these binary strings using some well-defined rule (NOT a default encoding!). "M.-A. Lemburg" wrote: > > > ... > > > > I would want to avoid the need for a 2.0-style 'default encoding', so I > > suggest it shouldnt be possible to mix this type with other strings: > > > > >>> "1"+b"2" > > Traceback (most recent call last): > > File "", line 1, in ? > > TypeError: cannot add type "binary" to string > > >>> "3"==b"3" > > 0 > > Right. This will cause people to rethink whether they are > using the object for text data or binary data. I still think that > at the interface level, b"" and "" should be treated the same (except > that b""-strings should not implement the char buffer interface). If C functions auto-convert these things then people will coerce them by passing them through C functions. e.g. the regular expression engine or null encoding functions or whatever. If we do NOT auto-coerce these things then they will not be compatible with many parts of the Python infrastructure, the regular expression engine and codecs being the most important examples. A clear requirement from Andy Robinson was that string-like code should work on binary data because often binary strings are "really" un-decoded strings. I think he is speaking on behalf of a lot of serious internationalizers there. > OTOH, these b""-strings should implement the same methods as the > array type and probably seemlessly interact with it too. I don't > know which type should be considered "better" in coercion > though, b""-strings or arrays (I guess b""-strings). Let's keep arrays separate. Arrays are mutable! If users ask for some particular features from arrays to be also implemented on byte strings, so be it. Let's only add magic after we know we really need it. Paul Prescod From paulp@ActiveState.com Thu Feb 8 23:45:26 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 08 Feb 2001 15:45:26 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes Message-ID: <3A832F96.B6FEEB51@ActiveState.com> Python source files may declare their encoding in the first few lines. An encoding declaration must be found before the first statement in the source file. The encoding declaration is not a pragma. It does not show up in the parse tree and has no semantic meaning for the compiler itself. It is conceptually handled in a pre-compile "encoding sniffing" step. This step is done using the Latin 1 encoding. The encoding declaration has the following basic syntax: #?encoding="" is the encoding name and must be associated with a registered codec. The appropriate codec is used to decode the source file. The decoded result is passed to the compiler. Once the decoding is done, the encoding declaration has no other effect. In other words, it does not further affect the interpretation of string literals with non-ASCII characters or anything else. The encoding declaration SHOULD be present in all Python source files encoded in any character encoding other than 7-bit ASCII. Some future version of Python may make this an absolute requirement. From paulp@ActiveState.com Fri Feb 9 00:24:39 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 08 Feb 2001 16:24:39 -0800 Subject: [I18n-sig] Strawman Proposal: Smart String Test Message-ID: <3A8338C7.6824679C@ActiveState.com> The types module will contain a new function called isstring(obj) types.isstring returns true if the object could be directly interpreted as a string. This is defined as: "implements the string interface and is compatible with the re regular expression engine." At the moment no user types fit this criteria so there is no mechanism for extending the behavior of the types.isstring function yet. It's initial behavior is: def isstring(obj): return type(obj) in (StringType, UnicodeType) Paul Prescod From martin@loewis.home.cs.tu-berlin.de Thu Feb 8 22:03:23 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 8 Feb 2001 23:03:23 +0100 Subject: [I18n-sig] Re: Python Character Model In-Reply-To: <3A831110.6AADE590@ActiveState.com> (message from Paul Prescod on Thu, 08 Feb 2001 13:35:12 -0800) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> Message-ID: <200102082203.f18M3N105616@mira.informatik.hu-berlin.de> > At the bottom of one of my messages I proposed that we insert it as the > second argument. Although the encoding and mode are both strings there > is no syntactic overlap between [rwa][+]?[tb]+ and the set of existent > or proposed encodings. I missed that; that is a good approach. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Fri Feb 9 08:24:08 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 9 Feb 2001 09:24:08 +0100 Subject: [I18n-sig] Pre-PEP: Proposed Python Character Model In-Reply-To: <3A832860.B5D15B3D@ActiveState.com> (message from Paul Prescod on Thu, 08 Feb 2001 15:14:40 -0800) References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> <200102081958.f18JwKd01167@mira.informatik.hu-berlin.de> <3A832860.B5D15B3D@ActiveState.com> Message-ID: <200102090824.f198O8901238@mira.informatik.hu-berlin.de> > So for example: > > sys_version = "Python/" + string.split(sys.version)[0] > > Nobody would ever expect sys_version to have anything other than Unicode > characters in it. My point is that sys_version is used in self.send_header('Server', self.version_string()) That is, it is sent following a specific transfer syntax of the underlying protocol (HTTP), and that transfer syntax is defined in terms of byte sequences. There is a constraint in the protocol that most of the bytes must be restricted to the printable characters of ASCII, though. Suppose we raise exceptions at some time if something other than bytes are written into a byte stream which has no associated encoding. Then, I suspect, that fragment should rewritten as sys_version = b"Python/" + string.split(sys.version)[0].encode("ASCII") The Server: header that we send will be a byte sequence, not a text message. > According to your definition, an XML document comprising a SOAP message > is "binary" rather than "text" despite what the XML specification says. > After all, what could be more "protocol" than SOAP. It depends. If it goes through an encoding before being transmitted, then it should be represented as a character string. If it is written to a socket directly, e.g. with msg = "some SOAP specific elements I don't know" s.write(msg) Then certainly, yes, that document is represented in a binary string. Please note that some XML document can be represented in many ways: character strings, binary strings, DOM trees, SAX event sequences, etc. The "XML document comprising a SOAP message", in itself, has no inherent representation; whether a specific representation ought to be treated as text or binary primarily depends on whether there is encoding or not. Regards, Martin From tdickenson@geminidataloggers.com Fri Feb 9 09:46:12 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Fri, 09 Feb 2001 09:46:12 +0000 Subject: [I18n-sig] Strawman Proposal: Binary Strings In-Reply-To: <3A830091.3D855EDD@ActiveState.com> References: <3A830091.3D855EDD@ActiveState.com> Message-ID: On Thu, 08 Feb 2001 12:24:49 -0800, Paul Prescod wrote: >> What if string.encode() returned a binary string.... would we need a >> 'binary()' builtin at all? > >I guess not. But the encode method might already be in use. If we >combine your restrictive coercion suggestion with this suggestion we >might break some (admittedly newish) code. How about >"str.binencode(encoding)". > >Also, it isn't entirely unbelievable that someone might want to encode >from a string to a string. e.g. base64 (do we call that an encoding??) >So having an binencode() seperate from encode() might be a good idea. >Alternate names are "binary", "asbinary", "tobinary", "getbinary" and >any underscore-separated variant. Yes, the type of value returned from string.encode(x) depends on x. I intended to suggest that string.encode('latin1') would be the best way to convert from string to binary. However, I now see that wont work for plain strings: their .encode() method always goes via unicode, using the default encoding. So: Im happy with you .binary() method on strings. Add it bstrings too (as a 'return self'), but not unicode strings. >> I agree any explicit coecion should follow the same rules as Unicode. >> Im not sure we agree on whether that coercion happens automatically >> and implicitly, as it does with Unicode strings; I feel fairly >> strongly that it shouldnt. (Ill justify that tomorrow if we do >> disagree). > >If we were inventing something from whole cloth I would agree with you. >But I want people to quickly port their string-using applications over >to binary-strings and if we require a bunch more explicit conversions >then they will move more slowly. > >Nevertheless, I'm not willing to fight about the issue. There are two >votes against coercion already and if the response is similarly >anti-coercion then I'll agree. Waaaaaah. There are some backward-compatability issues that complicate my comparison proposal..... Consider some old code that print md5('some stuff').digest() =3D=3D 'reference' We want this to do the right thing after: * changing .digest() to return a string * changing 'reference' to b'reference' * changing both Therefore we have to allow string/bstring comparisons. However, raising an exception on unicode/bstring comparison still makes sense. Toby Dickenson tdickenson@geminidataloggers.com From mal@lemburg.com Fri Feb 9 10:10:33 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 09 Feb 2001 11:10:33 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A832F96.B6FEEB51@ActiveState.com> Message-ID: <3A83C219.86D3F355@lemburg.com> Paul Prescod wrote: > > Python source files may declare their encoding in the first few lines. > An encoding declaration must be found before the first statement in the > source file. > > The encoding declaration is not a pragma. It does not show up in the > parse tree and has no semantic meaning for the compiler itself. It is > conceptually handled in a pre-compile "encoding sniffing" step. This > step is done using the Latin 1 encoding. I'd rather restrict this to ASCII since codec names must be ASCII and this would also allow detecting wrong formats of the source file in addition to make UTF-16 detection possible. > The encoding declaration has the following basic syntax: > > #?encoding="" > > is the encoding name and must be associated with a > registered codec. The appropriate codec is used to decode the source > file. Decode to what other format ? Unicode, the current locale's encoding ? What would happen if the decoding step fails ? > The decoded result is passed to the compiler. Once the decoding is > done, the encoding declaration has no other effect. In other words, it > does not further affect the interpretation of string literals with > non-ASCII characters or anything else. But if it doesn't affect the interpretation of string literals then what benefits do we gain from knowing the encoding ? > The encoding declaration SHOULD be present in all Python source files > encoded in any character encoding other than 7-bit ASCII. Some future > version of Python may make this an absolute requirement. I think that such a scheme is indeed possible, but not until we have made all strings default to Unicode. Then decoding to Unicode would be the proper thing to do. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Feb 9 10:26:07 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 09 Feb 2001 11:26:07 +0100 Subject: [I18n-sig] Strawman Proposal: Smart String Test References: <3A8338C7.6824679C@ActiveState.com> Message-ID: <3A83C5BF.30606DD2@lemburg.com> Paul Prescod wrote: > > The types module will contain a new function called > > isstring(obj) > > types.isstring returns true if the object could be directly interpreted > as a string. This is defined as: "implements the string interface and is > compatible with the re regular expression engine." re compatibility is given by read buffer compatibility; it is not restricted to strings. In fact re works on buffers and mmap'ed files too. > At the moment no user > types fit this criteria so there is no mechanism for extending the > behavior of the types.isstring function yet. It's initial behavior is: > > def isstring(obj): > return type(obj) in (StringType, UnicodeType) Looks like a hack which would only serve a temporary need... The right thing to do would be to add a new abstract string type object and then have isinstance() return 1 for StringType and UnicodeType when asked for the new abstract type. The problem with this approach is that we would be constructing a forward compatible mechanism before having designed the class hierarchie (see my other post) for these types, e.g. binary data string (BinaryDataType) | | text data string (TextDataType) | | | | Unicode string encoded 8-bit string (with encoding (UnicodeType) (StringType) information !) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Feb 9 11:56:53 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 09 Feb 2001 12:56:53 +0100 Subject: [I18n-sig] BOF session on IPC9 DevDay about this ? Message-ID: <3A83DB05.15628F36@lemburg.com> Would anyone here like to talk through these proposals on DevDay ? Perhaps we could arrange some BOF-session for it, or integrate it into one of the scheduled sessions ?! (Don't know what the procedure for this is, that's why I put Jeff Bauer, the chair of the DevDay on CC). Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paulp@ActiveState.com Fri Feb 9 15:29:54 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 9 Feb 2001 07:29:54 -0800 (PST) Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: <3A83C219.86D3F355@lemburg.com> Message-ID: On Fri, 9 Feb 2001, M.-A. Lemburg wrote: > ...> > I'd rather restrict this to ASCII since codec names must be ASCII > and this would also allow detecting wrong formats of the source file > in addition to make UTF-16 detection possible. That's fine with me. > > is the encoding name and must be associated with a > > registered codec. The appropriate codec is used to decode the source > > file. > > Decode to what other format ? Unicode, the current locale's encoding ? > What would happen if the decoding step fails ? We would decode to Unicode. If Decoding fails you get some kind of EncodingException error. This would be trapped in import machinery to be raised as an ImportError for imported modules. > > The decoded result is passed to the compiler. Once the decoding is > > done, the encoding declaration has no other effect. In other words, it > > does not further affect the interpretation of string literals with > > non-ASCII characters or anything else. > > But if it doesn't affect the interpretation of string literals then > what benefits do we gain from knowing the encoding ? Let's say that you have a string literal: a="XX" XX are bytes representing a character. If the character represented has an ordinal less than 255 then this would work. More often you would say: a=u"XX" The system would treat those examples no differently than this one:t XX="a" This keeps the model very simple and allows us to evolve to wide-character variable names some day. > I think that such a scheme is indeed possible, but not until we > have made all strings default to Unicode. Then decoding to Unicode > would be the proper thing to do. Making all strings default to Unicode is a good idea but it is a separate project. I think that my proposal above is still useful. It means that a Russian can type Unicode characters into their document using their KOI8-R editor. They can't type those Unicode characters directly into a string literal but why would they want to now that we have Unicode? If there is some reason they want to keep typing wide chars into string literals then there must be some problem with our Unicode support and we should work that out. Until we work that out, they probably just wouldn't use our encoding declaration feature. Paul Prescod From paulp@ActiveState.com Fri Feb 9 15:39:02 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 9 Feb 2001 07:39:02 -0800 (PST) Subject: [I18n-sig] Strawman Proposal: Smart String Test In-Reply-To: <3A83C5BF.30606DD2@lemburg.com> Message-ID: On Fri, 9 Feb 2001, M.-A. Lemburg wrote: > > types.isstring returns true if the object could be directly interpreted > > as a string. This is defined as: "implements the string interface and is > > compatible with the re regular expression engine." > > re compatibility is given by read buffer compatibility; it is > not restricted to strings. In fact re works on buffers and mmap'ed > files too. There are two conditions listed above. mmap'd files (for example) do not support the string interface. There is no join(), search() etc. > > At the moment no user > > types fit this criteria so there is no mechanism for extending the > > behavior of the types.isstring function yet. It's initial behavior is: > > > > def isstring(obj): > > return type(obj) in (StringType, UnicodeType) > > Looks like a hack which would only serve a temporary need... So? Sometimes a temporary hack is the right thing to do. If you want to revive the types sig to figure out a hierarchical interface concept then go ahead. I trying to solve a very simple, localized and pervasive problem: type(foo)==type("") You proposed that we should handle it by having a tuple or list called StringTypes in the types module. I tried to make a solution that is more forward-thinking because we can bring in your interface hierarchy magic later. Code will just keep working when we figure out how to "subclass strings" because the determination will be made by the isstring abstraction. Is there a practical problem with this solution? Paul Prescod From paulp@ActiveState.com Fri Feb 9 15:40:57 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 9 Feb 2001 07:40:57 -0800 (PST) Subject: [I18n-sig] BOF session on IPC9 DevDay about this ? In-Reply-To: <3A83DB05.15628F36@lemburg.com> Message-ID: Are those BOFs ever useful? Everybody goes in with good intentions and leaves with good intentions and nothing happens...a good email war culminating in one or more PEPs seems more useful to me...the result is concrete. Paul Prescod From tdickenson@geminidataloggers.com Fri Feb 9 16:15:21 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Fri, 09 Feb 2001 16:15:21 +0000 Subject: [I18n-sig] Strawman Proposal: Smart String Test In-Reply-To: References: <3A83C5BF.30606DD2@lemburg.com> Message-ID: <4o588t4683cpu32srcmp928b1m5dr003i3@4ax.com> >Is there a practical problem with this solution? def isstring(obj): return type(obj) in (StringType, UnicodeType) or isinstance(obj, UserString) Toby Dickenson tdickenson@geminidataloggers.com From paulp@ActiveState.com Fri Feb 9 16:40:28 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 9 Feb 2001 08:40:28 -0800 (PST) Subject: [I18n-sig] Strawman Proposal: Smart String Test In-Reply-To: <4o588t4683cpu32srcmp928b1m5dr003i3@4ax.com> Message-ID: On Fri, 9 Feb 2001, Toby Dickenson wrote: > Paul Prescod wrote: > >Is there a practical problem with this solution? > > def isstring(obj): > return type(obj) in (StringType, UnicodeType) or isinstance(obj, > UserString) Are you saying that there is a problem with isstring? Or proposing a slightly different formulation? If the latter: does UserString really behave enough like a string to "work"? I've never tried it. In particular, does passing a UserString to a regexp do what you expect? Can you pass a UserString to the open() command as a filename, etc.? Paul Prescod From mal@lemburg.com Fri Feb 9 17:24:45 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 09 Feb 2001 18:24:45 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: Message-ID: <3A8427DD.2BABEF53@lemburg.com> Paul Prescod wrote: > > On Fri, 9 Feb 2001, M.-A. Lemburg wrote: > > > ...> > > I'd rather restrict this to ASCII since codec names must be ASCII > > and this would also allow detecting wrong formats of the source file > > in addition to make UTF-16 detection possible. > > That's fine with me. > > > > is the encoding name and must be associated with a > > > registered codec. The appropriate codec is used to decode the source > > > file. > > > > Decode to what other format ? Unicode, the current locale's encoding ? > > What would happen if the decoding step fails ? > > We would decode to Unicode. If Decoding fails you get some kind of > EncodingException error. This would be trapped in import machinery to be > raised as an ImportError for imported modules. > > > > The decoded result is passed to the compiler. Once the decoding is > > > done, the encoding declaration has no other effect. In other words, it > > > does not further affect the interpretation of string literals with > > > non-ASCII characters or anything else. > > > > But if it doesn't affect the interpretation of string literals then > > what benefits do we gain from knowing the encoding ? > > Let's say that you have a string literal: > > a="XX" > > XX are bytes representing a character. If the character represented has an > ordinal less than 255 then this would work. More often you would say: > > a=u"XX" > > The system would treat those examples no differently than this one:t > > XX="a" > > This keeps the model very simple and allows us to evolve to wide-character > variable names some day. > > > I think that such a scheme is indeed possible, but not until we > > have made all strings default to Unicode. Then decoding to Unicode > > would be the proper thing to do. > > Making all strings default to Unicode is a good idea but it is a separate > project. I think that my proposal above is still useful. It means that a > Russian can type Unicode characters into their document using their KOI8-R > editor. > > They can't type those Unicode characters directly into a string literal > but why would they want to now that we have Unicode? If there is some > reason they want to keep typing wide chars into string literals then there > must be some problem with our Unicode support and we should work that out. > Until we work that out, they probably just wouldn't use our encoding > declaration feature. Ah, ok. The encoding information will only be applied to literal Unicode strings (u"text"), right ? That's in line with what we have already discussed here or on python-dev some time ago. Only then we tried to achive this using some form of pragma statement. So what this strawman suggest is in summary: 1. add an encoding identifier to the top of a source code file 2. use that encoding information to decode u"..." literals into Unicode 3. leave all other literals and text alone Sounds ok, even though it should probably made clear that only the u"" literals actually use the encoding information (perhaps the name should be #?unicode-encoding="" ?) and nothing else. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Feb 9 17:31:38 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 09 Feb 2001 18:31:38 +0100 Subject: [I18n-sig] BOF session on IPC9 DevDay about this ? References: Message-ID: <3A84297A.1F9CAB19@lemburg.com> Paul Prescod wrote: > > Are those BOFs ever useful? Everybody goes in with good intentions and > leaves with good intentions and nothing happens...a good email war > culminating in one or more PEPs seems more useful to me...the result is > concrete. I just think that it might be a good idea since people usually don't have the time to read all the mails on high traffic threads like these. It is still useful to get some feedback from them, since the few participants in this thread are not representative for the typical i18n Python user and it is very likely that some aspects are simply forgotten due to a limited view on the implications of a proposal (believe me, I've made that experience more than once during the Unicode integration phase). Anyway, was just a thought. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paulp@ActiveState.com Fri Feb 9 18:10:38 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 09 Feb 2001 10:10:38 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> Message-ID: <3A84329E.7B7CE012@ActiveState.com> "M.-A. Lemburg" wrote: > > ... > > Ah, ok. The encoding information will only be applied to literal > Unicode strings (u"text"), right ? No, that's very different than what I am suggesting. The encoding is applied to the *text file*. In the initial version, the only place Python would allow Unicode characters is in Unicode literals so currently the only USEFUL place to put those special characters is in Unicode literals. But Python may one day allow Unicode variable names or "simple string literals" and this mechanism will not change its definition or behavior. Only the Python grammar will change. The interpretation of string literals is a totally separate issue. Paul Prescod From mal@lemburg.com Fri Feb 9 19:30:33 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 09 Feb 2001 20:30:33 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> Message-ID: <3A844559.8C284F7A@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > > ... > > > > Ah, ok. The encoding information will only be applied to literal > > Unicode strings (u"text"), right ? > > No, that's very different than what I am suggesting. > > The encoding is applied to the *text file*. -1 The parser has no idea of what to do with Unicode input... this would mean that we would have to make it Unicode aware and this opens a new can of worms; not only in the case where this encoding specifier is used. Also, string literals ("text") would have to translate the Unicode input passed to the parser back to ASCII (or whatever the default encoding is) and this would break code which currently uses strings for data or some specific text encoding. The result would be way to much breakage. > In the initial version, the > only place Python would allow Unicode characters is in Unicode literals > so currently the only USEFUL place to put those special characters is in > Unicode literals. But Python may one day allow Unicode variable names or > "simple string literals" and this mechanism will not change its > definition or behavior. Only the Python grammar will change. Sorry, Paul, but this will never happen. Python is an ASCII programming language and does good at it. > The interpretation of string literals is a totally separate issue. See above. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paulp@ActiveState.com Fri Feb 9 20:04:23 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 09 Feb 2001 12:04:23 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> Message-ID: <3A844D47.8DACEE2B@ActiveState.com> "M.-A. Lemburg" wrote: > > ... > > The parser has no idea of what to do with Unicode input... > this would mean that we would have to make it Unicode > aware and this opens a new can of worms; not only in the case > where this encoding specifier is used. Obviously the parser cannot be made unicode aware for Python 2.1 but why not for Python 2.2? What's so difficult about it? There's no rocket science. Also, if we wanted a quick hack, couldn't we implement it at first by "decoding" to UTF-8? Then the parser could look for UTF-8 in Unicode string literals and translate those into real Unicode. > Also, string literals ("text") would have to translate the > Unicode input passed to the parser back to ASCII (or whatever > the default encoding is) and this would break code which currently > uses strings for data or some specific text encoding. It would only break code that adds the encoding declaration. If you don't add the declaration you don't break any code! Plus, we all agree that passing binary data in literal strings should be a deprecated usage eventually. That's why we're inventing binary strings. > ... > Sorry, Paul, but this will never happen. Python is an ASCII > programming language and does good at it. I am amazed to hear you say that. Why SHOULDN'T we allow Chinese variables names some day? This is the 21st century. If we don't go after Asian markets someone else will! I've gotta admit that that kind of Euro-centric attitude sort of annoys me... Paul Prescod From martin@loewis.home.cs.tu-berlin.de Fri Feb 9 20:23:44 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 9 Feb 2001 21:23:44 +0100 Subject: [I18n-sig] Strawman Proposal: Smart String Test In-Reply-To: (message from Paul Prescod on Fri, 9 Feb 2001 07:39:02 -0800 (PST)) References: Message-ID: <200102092023.f19KNic00993@mira.informatik.hu-berlin.de> > So? Sometimes a temporary hack is the right thing to do. If you want to > revive the types sig to figure out a hierarchical interface concept then > go ahead. I trying to solve a very simple, localized and pervasive > problem: > > type(foo)==type("") > > You proposed that we should handle it by having a tuple or list called > StringTypes in the types module. I tried to make a solution that is more > forward-thinking because we can bring in your interface hierarchy magic > later. I'm in favour of adding types.isstring. I know that I have added try: StringTypes = [types.StringType, types.UnicodeType] except AttributeError: StringType = [types.StringType] ... if type(foo) in StringTypes: into many places, and that I had considered suggesting types.StringTypes as a standard feature. I did not provide a patch since it won't help for programs that need to be backwards-compatible. types.isstring won't help for backwards compatibility, either, but it is enough simplification over the original test (type(foo) == type("")), and has a great chance of being forwards-compatible. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Fri Feb 9 20:27:14 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 9 Feb 2001 21:27:14 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: <3A8427DD.2BABEF53@lemburg.com> (mal@lemburg.com) References: <3A8427DD.2BABEF53@lemburg.com> Message-ID: <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> > So what this strawman suggest is in summary: > > 1. add an encoding identifier to the top of a source code file > 2. use that encoding information to decode u"..." literals into > Unicode > 3. leave all other literals and text alone I think the proposal was to do 3. raise an error if another literal uses bytes > 127 instead. Since users need to actively change their source to use the encoding declaration, they'll combine this with putting u in front of every affected string. If they then still have strings with bytes >127, they need to use the \x notation, as the string should not contain text. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Fri Feb 9 20:47:26 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 9 Feb 2001 21:47:26 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: <3A844D47.8DACEE2B@ActiveState.com> (message from Paul Prescod on Fri, 09 Feb 2001 12:04:23 -0800) References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> Message-ID: <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de> > > Sorry, Paul, but this will never happen. Python is an ASCII > > programming language and does good at it. >=20 > I am amazed to hear you say that. Why SHOULDN'T we allow Chinese > variables names some day? This is the 21st century. If we don't go after > Asian markets someone else will! I've gotta admit that that kind of > Euro-centric attitude sort of annoys me... I'm not sure it is Euro-centric; many European languages have characters that can't be used in identifiers, either. People have learned to accept this restriction.=20 Furthermore, there have been experiments with allowing arbitrary characters in identifiers, so users could use their language for identifiers. Turns out that this is nonsense, since it gives a mix of English and local language; ie. from string import atoi def z=E4hle(): global Z=E4hler try: Z=E4hler =3D Z=E4hler + 1 print >>Datei, atoi(Z=E4hler) except IOError: raise Fehler() does not read very well. People have then attempted to translate the keywords as well, which would roughly give Aus Zeichenkette importiere AzuG # ASCII zu Ganzzahl def z=E4hle(): globaler Z=E4hler # oder vielleicht besser: globales Z=E4hler? versuche: Z=E4hler =3D Z=E4hler + 1 print >>Datei, atoi(Z=E4hler) au=DFer EinAusFehler: wirf Fehler() which is clearly nonsense: for one thing, most people will have difficulties to recognize the logic, even if they know both Python and German. In addition, the constructs read well only in English - other languages have different grammatical structures, for which you'd pick different syntactical rules in your programming language. So Python's syntax and standard library is already English-centric; allowing additional identifiers won't fix that. I'd really like to know whether speakers of non-Roman and non-Germanic feel different about this issue: Should it be possible to pick Kanji characters as identifiers? Regards, Martin P.S. Something than can and should be done about Python itself is translating the doc strings. Any volunteers that are interested in doing so, please contact me - or just start translating the message catalog that sits in the non-dist branch of the Python CVS. From mal@lemburg.com Fri Feb 9 21:39:42 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 09 Feb 2001 22:39:42 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> Message-ID: <3A84639E.B86D57F2@lemburg.com> "Martin v. Loewis" wrote: > > > So what this strawman suggest is in summary: > > > > 1. add an encoding identifier to the top of a source code file > > 2. use that encoding information to decode u"..." literals into > > Unicode > > 3. leave all other literals and text alone > > I think the proposal was to do > > 3. raise an error if another literal uses bytes > 127 > > instead. Since users need to actively change their source to use the > encoding declaration, they'll combine this with putting u in front of > every affected string. If they then still have strings with bytes > >127, they need to use the \x notation, as the string should not > contain text. Hmm, are you sure this would make the encoding declaration a popular tool ? If we would just allow ASCII-supersets as source file encoding, then we wouldn't have to make that restriction, since only the Unicode literal handling in the parser would have to be adjusted (and this is easy to do). This would make UTF-16 encodings impossible, but I think that two-byte encodings not the right approach to maintainable programs anyways ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Feb 9 21:55:46 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 09 Feb 2001 22:55:46 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> Message-ID: <3A846762.C0735C2F@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > > ... > > > > The parser has no idea of what to do with Unicode input... > > this would mean that we would have to make it Unicode > > aware and this opens a new can of worms; not only in the case > > where this encoding specifier is used. > > Obviously the parser cannot be made unicode aware for Python 2.1 but why > not for Python 2.2? What's so difficult about it? There's no rocket > science. > > Also, if we wanted a quick hack, couldn't we implement it at first by > "decoding" to UTF-8? Then the parser could look for UTF-8 in Unicode > string literals and translate those into real Unicode. I don't want to do "quick hacks", so this is a non-option. Making the parser Unicode aware is non-trivial as it requires changing lots of the internals which expect 8-bit C char buffers. It will eventually happen, but is not high priority since it only servers a convenience and not a real need. > > Also, string literals ("text") would have to translate the > > Unicode input passed to the parser back to ASCII (or whatever > > the default encoding is) and this would break code which currently > > uses strings for data or some specific text encoding. > > It would only break code that adds the encoding declaration. If you > don't add the declaration you don't break any code! If we change the parser to use Unicode, then we would have to decode *all* program text into Unicode and this is very likely to fail for people who put non-ASCII characters into their string literals. > Plus, we all agree that passing binary data in literal strings should be > a deprecated usage eventually. That's why we're inventing binary > strings. Yes, but this move needs time... binary strings are meant as easy to use alternative, so that programmers can easily make the required changes to their code (adding a few b's in front of their string literals won't hurt that much). > > ... > > Sorry, Paul, but this will never happen. Python is an ASCII > > programming language and does good at it. > > I am amazed to hear you say that. Why SHOULDN'T we allow Chinese > variables names some day? This is the 21st century. If we don't go after > Asian markets someone else will! I've gotta admit that that kind of > Euro-centric attitude sort of annoys me... ASCII is not Euro-centric at all since it is a common subset of very many common encodings which are in use today. Latin-1 would be, though... which is why ASCII was chosen as standard default encoding. The added flexibility in choosing identifiers would soon turn against the programmers themselves. Others have tried this and failed badly (e.g. look at the language specific versions of Visual Basic). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Fri Feb 9 22:13:36 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 9 Feb 2001 23:13:36 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: <3A84639E.B86D57F2@lemburg.com> (mal@lemburg.com) References: <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> Message-ID: <200102092213.f19MDaY01399@mira.informatik.hu-berlin.de> > If we would just allow ASCII-supersets as source file encoding, > then we wouldn't have to make that restriction, since only the > Unicode literal handling in the parser would have to be adjusted > (and this is easy to do). That would work, I'm not feeling to strongly either way. > This would make UTF-16 encodings impossible, but I think that > two-byte encodings not the right approach to maintainable programs > anyways ;-) I certainly agree. I think Python should assume UTF-8 for Unicode strings in the long run unless declared otherwise, as that seems to be the winning encoding - unless MS can talk everybody into putting a BOM into every file. Regards, Martin From paulp@ActiveState.com Sat Feb 10 03:07:39 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 09 Feb 2001 19:07:39 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <3A846762.C0735C2F@lemburg.com> Message-ID: <3A84B07B.24834996@ActiveState.com> "M.-A. Lemburg" wrote: > > > ... > > Also, if we wanted a quick hack, couldn't we implement it at first by > > "decoding" to UTF-8? Then the parser could look for UTF-8 in Unicode > > string literals and translate those into real Unicode. > > I don't want to do "quick hacks", so this is a non-option. If it works and it is easy, there should not be a problem! > Making the parser Unicode aware is non-trivial as it requires > changing lots of the internals which expect 8-bit C char buffers. Are you talking about the Python internals or the parser internals. If the former, then I do not think you are correct. Only the parser needs to change. > If we change the parser to use Unicode, then we would > have to decode *all* program text into Unicode and this is very > likely to fail for people who put non-ASCII characters into their > string literals. Files with no declaration could be interpreted byte for char just as they are today! > .... > ASCII is not Euro-centric at all since it is a common subset > of very many common encodings which are in use today. Oh come on! The ASCII characters are sufficient to encode English and a very few other languages. > Latin-1 > would be, though... which is why ASCII was chosen as standard > default encoding. We could go back and forth on this but let me suggest you type in a program with Latin 1 in your Unicode literals and try and see what happens. Python already "recognizes" that there is a single logical translation from "old style strings" to Unicode strings and vice versa. > The added flexibility in choosing identifiers would soon turn > against the programmers themselves. Others have tried this and > failed badly (e.g. look at the language specific versions of > Visual Basic). That's a totally different and unrelated issue. Nobody is talking about language specific Pythons. We're talking about allowing people to name variables in their own languages. I think that anything else is Euro-centric. Paul Prescod From paulp@ActiveState.com Sat Feb 10 03:11:33 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 09 Feb 2001 19:11:33 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> Message-ID: <3A84B165.5F1D20D4@ActiveState.com> "M.-A. Lemburg" wrote: > > ... > > Hmm, are you sure this would make the encoding declaration a > popular tool ? > > If we would just allow ASCII-supersets as source file encoding, > then we wouldn't have to make that restriction, since only the > Unicode literal handling in the parser would have to be adjusted > (and this is easy to do). We have always said that only ASCII-supersets should be legal source file encodings. The compromise is to make the use of non-ASCII bytes only legal inside of Unicode literals. Then in the future we can either go "my way" (decode the whole file) or "your way" (decode only literals). Is that acceptable? Paul Prescod From paulp@ActiveState.com Sat Feb 10 04:01:36 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 09 Feb 2001 20:01:36 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de> Message-ID: <3A84BD20.3042C87A@ActiveState.com> "Martin v. Loewis" wrote: >=20 > ... > > Furthermore, there have been experiments with allowing arbitrary > characters in identifiers, so users could use their language for > identifiers. Turns out that this is nonsense, since it gives a mix of > English and local language; ie. >=20 > ... >=20 > def z=E4hle(): > global Z=E4hler I have seen this kind of code on Python-list. Maybe the examples did not use "funny characters" but they certainly used other languages. I see no reason to restrict it, or only to English-like languages. You and I may or may not agree that it is a great idea but do you really feel comfortable saying: "this is technically possible and maybe someone has even submitted a patch to allow it but we won't support it because we think everyone should code in English." Paul Prescod From tim.one@home.com Sat Feb 10 04:45:45 2001 From: tim.one@home.com (Tim Peters) Date: Fri, 9 Feb 2001 23:45:45 -0500 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: <3A84BD20.3042C87A@ActiveState.com> Message-ID: [Paul Prescod, to Martin] > ... do you really feel comfortable saying: "this is technically > possible and maybe someone has even submitted a patch to allow > it but we won't support it because we think everyone should code > in English." I'm comfortable with saying that, regardless of tech issues. The keywords and builtins and standard libraries are all written with English names no matter what, and that maximizes readability regardless of native tongue. Readability is important with Python. It's not like Python is unique here, either: from Algol to Ruby, virtually everyone from Europe to Asia who invents their own programming language sticks to English too. Now Java has supported Unicode source code since its beginning. If someone can dig up evidence that this has done more than complicate their compilers, *that* would be good to hear. I simply see no (zilch, nada) demand for this, outside of a handful of Gallic purists who would abandon the language anyway as soon as they realized Guido was Dutch <0.9 wink>. From martin@loewis.home.cs.tu-berlin.de Sat Feb 10 06:27:45 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 10 Feb 2001 07:27:45 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: <3A84BD20.3042C87A@ActiveState.com> (message from Paul Prescod on Fri, 09 Feb 2001 20:01:36 -0800) References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de> <3A84BD20.3042C87A@ActiveState.com> Message-ID: <200102100627.f1A6Rj400812@mira.informatik.hu-berlin.de> > you really feel comfortable saying: "this is technically possible > and maybe someone has even submitted a patch to allow it but we > won't support it because we think everyone should code in English." I would not feel comfortable, because I doubt it is technically possible. It may work for identifiers of variables (including functions and classes), however, it will surely fail for the names of packages and modules. As I said before, I tried to write such a program in Java, which is intended to support this. It failed, because the interpreter could not find the class file (which must have the name of the class in Java, unfortunately). Regards, Martin From martin@loewis.home.cs.tu-berlin.de Sat Feb 10 06:55:12 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 10 Feb 2001 07:55:12 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: <3A84B165.5F1D20D4@ActiveState.com> (message from Paul Prescod on Fri, 09 Feb 2001 19:11:33 -0800) References: <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> <3A84B165.5F1D20D4@ActiveState.com> Message-ID: <200102100655.f1A6tCT01130@mira.informatik.hu-berlin.de> > We have always said that only ASCII-supersets should be legal source > file encodings. That may be a bit too restrictive. I understand that people use all of EUC-JP, Shift-JIS, and ISO-2022-JP to encode Japanese text. I'm not certain whether iso-2022 is used in source code, but the first two certainly are (euc-jp on Unix, shift-jis on Windows). My understanding is that only EUC-JP is an ASCII superset (*) (i.e. all bytes representing JIS characters are >127); in Shift-JIS, the encoding of a character is two bytes, of which only the first byte is always >128. Since Shift-JIS is quite common, it should be supported as a file encoding. Regards, Martin (*) ignoring the question whether \x24 is the DOLLAR SIGN or the YEN SIGN. From mal@lemburg.com Sat Feb 10 12:14:57 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 10 Feb 2001 13:14:57 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> <3A84B165.5F1D20D4@ActiveState.com> Message-ID: <3A8530C1.479544D2@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > > ... > > > > Hmm, are you sure this would make the encoding declaration a > > popular tool ? > > > > If we would just allow ASCII-supersets as source file encoding, > > then we wouldn't have to make that restriction, since only the > > Unicode literal handling in the parser would have to be adjusted > > (and this is easy to do). > > We have always said that only ASCII-supersets should be legal source > file encodings. Right. > The compromise is to make the use of non-ASCII bytes only legal inside > of Unicode literals. Then in the future we can either go "my way" > (decode the whole file) or "your way" (decode only literals). > > Is that acceptable? No, it's too restrictive and would break programs written using non-ASCII characters in normal string literals. We could agree on this though: 1. programs which do not use the encoding declaration are free to use non-ASCII bytes in literals; Unicode literals must use Latin-1 (for historic reasons) 2. programs which do make use of the encoding declaration may only use non-ASCII bytes in Unicode literals; these are then interpreted using the given encoding information and decoded into Unicode during the compilation step Part 1 assures backward compatibility. Part 2 assures that programmers start to think about where they have to use Unicode and which program literals are allowed to go into string literals. Part 1 is already implemented, part 2 is easy to do, since only the compiler will have to be changed (in two places). How's that for a compromise ? -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sat Feb 10 12:37:06 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 10 Feb 2001 13:37:06 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <3A846762.C0735C2F@lemburg.com> <3A84B07B.24834996@ActiveState.com> Message-ID: <3A8535F2.B56642B6@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > > > ... > > > Also, if we wanted a quick hack, couldn't we implement it at first by > > > "decoding" to UTF-8? Then the parser could look for UTF-8 in Unicode > > > string literals and translate those into real Unicode. > > > > I don't want to do "quick hacks", so this is a non-option. > > If it works and it is easy, there should not be a problem! This is how I started into the Unicode debate (making UTF-8 the default encoding). It doesn't work out... let's not restart that discussion. > > Making the parser Unicode aware is non-trivial as it requires > > changing lots of the internals which expect 8-bit C char buffers. > > Are you talking about the Python internals or the parser internals. If > the former, then I do not think you are correct. Only the parser needs > to change. The parser would have to accept Py_UNICODE strings and work on these. The compiler needs to be able to convert Py_UNICODE back to char for e.g. string literals. We'd also have to provide external interfaces which convert char input for the parser into Unicode. This would introduce many new locations of possible breakage (please remember that variable length encodings are *very* touchy about wrong byte sequences). > > If we change the parser to use Unicode, then we would > > have to decode *all* program text into Unicode and this is very > > likely to fail for people who put non-ASCII characters into their > > string literals. > > Files with no declaration could be interpreted byte for char just as > they are today! Then we'd have to write two sets of parsers and compilers: one for Py_UNICODE and one for char... no way ;-) > > .... > > ASCII is not Euro-centric at all since it is a common subset > > of very many common encodings which are in use today. > > Oh come on! The ASCII characters are sufficient to encode English and a > very few other languages. Paul, programs have been written in ASCII for many many years. Are you trying to tell me that 30+ years of common usage should be ignored ? Programmers have gotten along with ASCII quite well, not only English speaking ones -- ASCII can be used to approximate quite a few other languages as well (provided you ignore accents and the like). For most other languages there are transliterations into ASCII which are in common use. For other good arguments, see Tim's post on the subject. > > Latin-1 > > would be, though... which is why ASCII was chosen as standard > > default encoding. > > We could go back and forth on this but let me suggest you type in a > program with Latin 1 in your Unicode literals and try and see what > happens. Python already "recognizes" that there is a single logical > translation from "old style strings" to Unicode strings and vice versa. Fact is, I would never use Latin-1 characters outside of literals. All my programs are written in (more or less ;) English, even the comments and doc-strings. If you ever write applications which programmers from around the world are supposed to comprehend and maintain, then English is the only reasonable common base, at least IMHO. > > The added flexibility in choosing identifiers would soon turn > > against the programmers themselves. Others have tried this and > > failed badly (e.g. look at the language specific versions of > > Visual Basic). > > That's a totally different and unrelated issue. Nobody is talking about > language specific Pythons. We're talking about allowing people to name > variables in their own languages. I think that anything else is > Euro-centric. Funny, how you always refer to "Euro"-centric... ASCII is an American standard ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sat Feb 10 14:56:10 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 10 Feb 2001 15:56:10 +0100 Subject: [I18n-sig] Strawman Proposal: Binary Strings References: <3A830091.3D855EDD@ActiveState.com> Message-ID: <3A85568A.5B694917@lemburg.com> Toby Dickenson wrote: > > On Thu, 08 Feb 2001 12:24:49 -0800, Paul Prescod > wrote: > > >> What if string.encode() returned a binary string.... would we need a > >> 'binary()' builtin at all? binary() is needed one way or another. It is standard Python philosophy that all types need to have an exposed constructor and these should do some form of implicit or explicit but well-defined coercion from other data types to binary strings. About changing .encode() or the existing codecs to return binary strings instead of normal strings: I'm -1 on this one since it will break existing code. The outcome of .encode() is totally up the codec doing the work, BTW (and this is by design), so new codecs could choose to return binary strings. For converting strings or Unicode to binary data, I'd suggest to add a "binary" codec which then returns the raw bytes of the string ior Unicode object in question as binary string. Note that changing e.g. .encode('latin-1') to return a binary string doesn't really make sense, since here we know the encoding ! Instead, strings should probably carry along the encoding information in an additional attribute (it is not always useful, but can help in a few situations) provided that it is known. This would give us three string types: 1. standard 8-bit strings with encoding attribute 2. binary 8-bit strings without encoding attribute or a constant value of 'binary' for this attribute 3. Unicode strings which don't need an encoding attribute :-) Hmm, getting all these to properly interoperate without breaking existing code will be troublesome... -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paulp@ActiveState.com Sat Feb 10 15:17:29 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 07:17:29 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: Message-ID: <3A855B89.459A18E4@ActiveState.com> Tim Peters wrote: > > ... > > I'm comfortable with saying that, regardless of tech issues. The keywords > and builtins and standard libraries are all written with English names no > matter what, and that maximizes readability regardless of native tongue. People keep bringing up this issue of keywords. I've never disputed that the keywords should always be English. > Now Java has supported Unicode source code since its beginning. If someone > can dig up evidence that this has done more than complicate their compilers, > *that* would be good to hear. I simply see no (zilch, nada) demand for > this, outside of a handful of Gallic purists who would abandon the language > anyway as soon as they realized Guido was Dutch <0.9 wink>. I'm not personally willing to design in such a limitiation. I have seen a lot of code that mixes other languages with English. e.g.: http://starship.python.net/pipermail/python-de/2000q3/000597.html I don't think this guy is doing anything wrong. If a Japansese person asks me if they could do the same I would say: "Not now, but hopefully someday." There are a lot of people who write code that will never be seen by a speaker of an ASCII-compatible language. Why should they be forced to write it in ASCII? Paul Prescod From paulp@ActiveState.com Sat Feb 10 15:24:02 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 07:24:02 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de> <3A84BD20.3042C87A@ActiveState.com> <200102100627.f1A6Rj400812@mira.informatik.hu-berlin.de> Message-ID: <3A855D12.46E4376E@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > I would not feel comfortable, because I doubt it is technically > possible. It may work for identifiers of variables (including > functions and classes), however, it will surely fail for the names of > packages and modules. It isn't worth arguing about because I'm not even proposing to move to Unicode variable names today. But can we agree not to cut ourselves off from that option? > As I said before, I tried to write such a program in Java, which is > intended to support this. It failed, because the interpreter could not > find the class file (which must have the name of the class in Java, > unfortunately). As you point out, the problem is much more serious in Java because of the classname/filename binding. Anyhow, file systems are getting more and more i18n aware. Paul Prescod From paulp@ActiveState.com Sat Feb 10 15:27:58 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 07:27:58 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> <3A84B165.5F1D20D4@ActiveState.com> <3A8530C1.479544D2@lemburg.com> Message-ID: <3A855DFE.6FA48392@ActiveState.com> "M.-A. Lemburg" wrote: > > > ... > > We could agree on this though: > > 1. programs which do not use the encoding declaration are free > to use non-ASCII bytes in literals; Unicode literals must > use Latin-1 (for historic reasons) > > 2. programs which do make use of the encoding declaration may > only use non-ASCII bytes in Unicode literals; these are then > interpreted using the given encoding information and decoded > into Unicode during the compilation step I thought that's what I suggested! I am comfortable with that design. Paul Prescod From guido@digicool.com Sat Feb 10 15:32:19 2001 From: guido@digicool.com (Guido van Rossum) Date: Sat, 10 Feb 2001 10:32:19 -0500 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: Your message of "Sat, 10 Feb 2001 07:24:02 PST." <3A855D12.46E4376E@ActiveState.com> References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de> <3A84BD20.3042C87A@ActiveState.com> <200102100627.f1A6Rj400812@mira.informatik.hu-berlin.de> <3A855D12.46E4376E@ActiveState.com> Message-ID: <200102101532.KAA27642@cj20424-a.reston1.va.home.com> > As you point out, the problem is much more serious in Java because of > the classname/filename binding. And Python doesn't have this problem? --Guido van Rossum (home page: http://www.python.org/~guido/) From paulp@ActiveState.com Sat Feb 10 15:37:04 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 07:37:04 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> <3A84B165.5F1D20D4@ActiveState.com> <200102100655.f1A6tCT01130@mira.informatik.hu-berlin.de> Message-ID: <3A856020.7D55951C@ActiveState.com> "Martin v. Loewis" wrote: > > ... > > My understanding is that only EUC-JP is an ASCII superset (*) > (i.e. all bytes representing JIS characters are >127); in Shift-JIS, > the encoding of a character is two bytes, of which only the first byte > is always >128. Since Shift-JIS is quite common, it should be > supported as a file encoding. I don't think it is reasonable in the short term to support characte sets that cannot be lexed with the current Python lexer. I think we should design with Shift-JIS in mind for the future but for now I think we should limit our list of supported encodings to those that don't require large Python parser changes. Paul Prescod From paulp@ActiveState.com Sat Feb 10 15:46:32 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 07:46:32 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <3A84329E.7B7CE012@ActiveState.com> <3A844559.8C284F7A@lemburg.com> <3A844D47.8DACEE2B@ActiveState.com> <200102092047.f19KlQO01127@mira.informatik.hu-berlin.de> <3A84BD20.3042C87A@ActiveState.com> <200102100627.f1A6Rj400812@mira.informatik.hu-berlin.de> <3A855D12.46E4376E@ActiveState.com> <200102101532.KAA27642@cj20424-a.reston1.va.home.com> Message-ID: <3A856258.46EFCF68@ActiveState.com> Guido van Rossum wrote: > > > As you point out, the problem is much more serious in Java because of > > the classname/filename binding. > > And Python doesn't have this problem? The problem is not as serious in Python because of "import as". I suspect that clever use of introspective features would also allow you to map classnames back and forth between filesystem ASCII and your native language. Nevertheless, I want to point out that I am not advocating that Python support full Unicode source files in the short term. I will carefully scrutinize any new language feature designed with the assumption that Python will be "forever ASCII." I see that as Y2K thinking. Also, our i18n migration issues are painful enough right now to make me "twice shy" about assumptions. I don't want to go through this again in five years. Paul Prescod From paulp@ActiveState.com Sat Feb 10 15:58:22 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 07:58:22 -0800 Subject: [I18n-sig] Strawman Proposal: Encoding Declaration V2 Message-ID: <3A85651E.C11C7B2B@ActiveState.com> The encoding declaration controls the interpretation of non-ASCII bytes in the Python source file. The declaration manages the mapping of non-ASCII byte strings into Unicode characters. A source file with an encoding declaration must only use non-ASCII bytes in places that can legally support Unicode characters. In Python 2.x the only place is within a Unicode literal. This restriction may be lifted in future versions of Python. In Python 2.x, the initial parsing of a Python script is done in terms of the file's byte values. Therefore it is not legal to use any byte sequence that has a byte that would be interpreted as a special character (e.g. quote character or backslash) according to the ASCII character set. This restriction may be lifted in future versions of Python. The encoding declaration must be found before the first statement in the source file. The declaration is not a pragma. It does not show up in the parse tree and has no semantic meaning for the compiler itself. It is conceptually handled in a pre-compile "encoding sniffing" step. This step is also done using the ASCII encoding. The encoding declaration has the following basic syntax: #?encoding="" is the encoding name and must be associated with a registered codec. The codec is used to interpret non-ASCII byte sequences. The encoding declaration should be present in all Python source files containing non-ASCII bytes. Some future version of Python may make this an absolute requirement. Paul Prescod From andy@reportlab.com Sat Feb 10 16:13:37 2001 From: andy@reportlab.com (Andy Robinson) Date: Sat, 10 Feb 2001 16:13:37 -0000 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? Message-ID: This reminds me a lot of another debating going on close to home :-) - people who are in favour assume everyone else is, and that the only question is how to get there - people who are against are just plain worried but can't say why - the government stays very quiet and avoids asking for a referendum I want to re-ask the big question: is it desirable that the standard string type should become a Unicode string one day? To my knowledge, all the pressure for making Unicode strings the fundamental data type comes from Americans and Westerm Europeans who think they are doing the right thing. This is far from proven. Please consider these points: 1. To my knowledge we have not seen posts from anyone outside the ISO-Latin-1 zone in this thread. 2. I have been told that there are angry mumblings on the Python-Japan mailing list that such a change would break all their existing Python programs; I'm trying to set up my tools to ask out loud in that forum. 3. Ruby was designed in Japan and that's where most of its users are. They have a few conversion functions and seem perfectly happy. 4. Visual Basic running under Windows 2000 with every international option I can find will accept unicode characters in string literals but will not accept characters outside of ISO-Latin-1 in 5. All the Japanese-written code I have seen (not much of it is in Python, lots in Java and VB) either uses english variable names or the romanized japanese ones ('total'='gokei'). No one I know of has complained about this limitation. I do NOT want to kill off this discussion, which is producing an interesting proposal and I am in favour of many point it raises. However, I think we should make a real effort to see what the market actually wants and if the implied goal is right. It would be tragic to break old code one day for improvements nobody cares about - or, worse, to alienate exactly the people we are trying to cater for. Now, who can we ask outide our own community who could have insights into this? My shortlist so far is: - Frank Chen (wrote our Chinese and Korean Codecs) - Tamito Kajiyama (wrote our Japanese Codecs) - Ruby mailing lists - Python Japan Mailing List - Basis Technologies (Tom, are you there?) - Digital Garage and recent escapees (Brian?) - the CTO of a Kuwaiti bank I know - Ken Lunde (author of that CJKV book) - Tony Graham (author, of Unicode: A Primer and a member of the Unicode consortium) - James Clark (XML fame, lives in Thailand) I'm going to try to think up a questionnaire. If anyone can suggest other domain experts, or mailing lists of user groups in other language zones, I will be happy to try and pursue them and get some real hard data. Best Regards, Andy Robinson From andy@reportlab.com Sat Feb 10 16:58:39 2001 From: andy@reportlab.com (Andy Robinson) Date: Sat, 10 Feb 2001 16:58:39 -0000 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? In-Reply-To: Message-ID: > Now, who can we ask outide our own community who could > have insights into this? My shortlist so far is: > [snip] > - Frank Chen (wrote our Chinese and Korean Codecs) > - Tamito Kajiyama (wrote our Japanese Codecs) > - Basis Technologies (Tom, are you there?) > - Digital Garage and recent escapees (Brian?) Sorry, really bad wording. The above are definitely valued parts of our community - no offence intended! - Andy Robinson From paulp@ActiveState.com Sat Feb 10 19:17:19 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 11:17:19 -0800 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? References: Message-ID: <3A8593BF.8AFCEBB3@ActiveState.com> Andy, I think that part of the reason that Westerners push harder for Unicode than Japanese is because we are pressured (rightly) to right software that works world-wide and it is simply not sane to try to do that by supporting multiple character sets. Multiple encodings maybe. Multiple character sets? Forget it. I don't know of any commercial software written in Japan but used in the west so I think that they probably have less I18N pressure than we do. Unicode is only interesting when you want the same software to run in multiple character set environments! Andy Robinson wrote: > > ... > > 2. I have been told that there are angry mumblings on the > Python-Japan mailing list that such a change would break all > their existing Python programs; I'm trying to set up my tools to > ask out loud in that forum. I don't think it is posssible to say in the abstract that a move to Unicode would break code. Depending on implementation strategy it might. But I can't imagine there is really a ton of code that would break merely from widening the character. > 3. Ruby was designed in Japan and that's where most of its users are. > They have a few conversion functions and seem perfectly happy. Don't know enough to comment except to point out that Ruby has a command line option to set the character set to Kanji. > 4. Visual Basic running under Windows 2000 with every international > option I can find will accept unicode characters in string literals > but will not accept characters outside of ISO-Latin-1 in The VB in Visual Studio 7 will happily accept wide characters (e.g. U+652B: CJK Unified Ideograph) on Windows 2000. Of course you need to set your font to have the right character. Compared to where we were a few years ago (better install DOS-J!) this is a real miracle. Of course Unix systems will move over more slowly (grumble..). Nevertheless its coming: http://www.li18nux.net/li18nux2k/ > > I'm going to try to think up a questionnaire. If anyone can suggest > other domain experts, or mailing lists of user groups in other > language > zones, I will be happy to try and pursue them and get some real hard > data. I like your list but I don't know that there is really a reasonable question we can ask. What does it mean for Python's "standard string type" to be "Unicode?" If you ask the question as: "Should Python's standard string type support ordinal values beyond 255?", who would say no? If you say: "Should Python standardize on the Unicode character set" you might get different answers. As you yourself point out, it depends on whether that means that you would LOSE the ability to do string-like things on byte-strings. Paul Prescod From paulp@ActiveState.com Sat Feb 10 19:45:58 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 11:45:58 -0800 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? References: Message-ID: <3A859A76.D4C30372@ActiveState.com> Andy Robinson wrote: > > ... > > 4. Visual Basic running under Windows 2000 with every international > option I can find will accept unicode characters in string literals > but will not accept characters outside of ISO-Latin-1 in The more I look at I18N in VB.NET, the more impressed I am. It has no language restrictions on variable names etc. Protected Sub Form1_Click(ByVal sender As Object, ByVal e As System.EventArgs) Dim ? As String Dim font As New System.Drawing.Font("Batang", 10) ? = "??" TextBox1.Text = ? End Sub Each "?" is an ideograph. It seems to "just work". Paul Prescod From tree@basistech.com Sat Feb 10 21:17:47 2001 From: tree@basistech.com (Tom Emerson) Date: Sat, 10 Feb 2001 16:17:47 -0500 Subject: [I18n-sig] Random thoughts on Unicode and Python Message-ID: <14981.45051.945099.633730@cymru.basistech.com> Andy has raised some important and interesting points. I'd like to chime in with some random thoughts. > 2. I have been told that there are angry mumblings on the > Python-Japan mailing list that such a change would break all > their existing Python programs; I'm trying to set up my tools to > ask out loud in that forum. Both Shift-JIS and EUC-JP are 8-bit, multibyte encodings. You can use them on systems that are 8-bit clean and things "just work". You don't need to worry about embedded nulls or any other such noise. While you can't use len() to get the number of *characters* in a Shift-JIS/EUC-JP encoded string, you can find out how many "octets" are in it so you can loop over it and calculate the character length. In essence the Japanese (and Chinese and Koreans) are using the existing Python string type as a raw-byte string, and imposing the semantics over that. The Ruby string class is a byte-string. You can specify how the bytes are to be treated for operations such as regular expression searches and such. It supports EUC-JP, Shift JIS, UTF-8, or just plan bytes. You can set the default when you configure the sources, on the command-line when you invoke the interpreter, or (I believe) at runtime. Ruby also contains a library with a replacement String class for dealing with EUC-JP and Shift-JIS encoded strings. ----- The internal representation used for strings is an orthogonal issue to how raw bytes are interpreted for string operations. This is what Emacs 20 does: in essence it uses ISO 2022 internally to allow characters from multiple character sets to be represented. ----- The interpretation of strings and the interpretation of bytes in a source file are different things: Dylan, for example, supports Unicode and byte strings, but the language definition requires identifiers and keywords to be in the US-ASCII range. Java, on the other hand, specifies Unicode as language's character set: even source files are encoded in UTF-8, allowing identifiers to be in the user's language. IMHO either is fine. Note that if the language allows identifiers to include 8-bit characters then users can already use identifiers in their local language: it "just works". ----- Japanese and Chinese arguments against Unicode are often ideological: "It doesn't contain all of the characters we need." Of course they forget to mention that the character sets in regular use in these locales, JIS X 0201-1990, JIS X 0212-1990, GB 2312-80, and Big Five, are all represented in Unicode. The same is true for Korean: all of the hanja in KS C 5601 et al. are available in Unicode, as are the precomposed han'gul. -- Tom Emerson Basis Technology Corp. Stringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Sat Feb 10 21:56:09 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 10 Feb 2001 22:56:09 +0100 Subject: [I18n-sig] Random thoughts on Unicode and Python References: <14981.45051.945099.633730@cymru.basistech.com> Message-ID: <3A85B8F9.1F494BF8@lemburg.com> Tom Emerson wrote: > > Andy has raised some important and interesting points. I'd like to > chime in with some random thoughts. > > > 2. I have been told that there are angry mumblings on the > > Python-Japan mailing list that such a change would break all > > their existing Python programs; I'm trying to set up my tools to > > ask out loud in that forum. > > Both Shift-JIS and EUC-JP are 8-bit, multibyte encodings. You can use > them on systems that are 8-bit clean and things "just work". You don't > need to worry about embedded nulls or any other such noise. While you > can't use len() to get the number of *characters* in a > Shift-JIS/EUC-JP encoded string, you can find out how many "octets" > are in it so you can loop over it and calculate the character length. > > In essence the Japanese (and Chinese and Koreans) are using the > existing Python string type as a raw-byte string, and imposing the > semantics over that. > > The Ruby string class is a byte-string. You can specify how the bytes > are to be treated for operations such as regular expression searches > and such. It supports EUC-JP, Shift JIS, UTF-8, or just plan > bytes. You can set the default when you configure the sources, on the > command-line when you invoke the interpreter, or (I believe) at > runtime. > > Ruby also contains a library with a replacement String class for > dealing with EUC-JP and Shift-JIS encoded strings. How does Ruby (which seems to be the direct Python-competitor in Japan) deal with the difference between binary data and text data ? I think that much concern about these proposals lies in a misunder- standing of the general idea behind the proposed move to Unicode for text data: We are trying to tell people that storing text data is better done in Unicode than in a raw data buffer like Python's current string data type. This doesn't mean that working with text encoded in such a binary data buffer will somehow fail in a future Python version, it only means that the programmer will sooner or later have to decide whether she wants to store text data or binary and then choose the proper type of storage to be able to take advantage of the advanced features which a text data type can provide over a binary data buffer. The module which we are currently talking about can be outlined as follows: binary data string *) | | text data string | | | | Unicode string encoded 8-bit string (with encoding *) information !) *) these are implemented in Python 1.6-2.1. How does this compare to e.g. Ruby ? -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sat Feb 10 22:03:51 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 10 Feb 2001 23:03:51 +0100 Subject: [I18n-sig] Storing string encoding information (Pre-PEP: Proposed Python Character Model) References: <9FC702711D39D3118D4900902778ADC81287BF@JUPITER> <3A82A04F.5A03CAB2@lemburg.com> <200102082009.f18K9QI01197@mira.informatik.hu-berlin.de> Message-ID: <3A85BAC7.303460B1@lemburg.com> "Martin v. Loewis" wrote: > > > encoded 8-bit string (with encoding > > information !) > > I'd like to point out that this is something that Bill Janssen always > wanted to see. In CORBA, they number encodings for efficient > representation; that's something that Python could do as well. CORBA > took the OSF charset registry. That was a mistake, they think about > using the IANA registry now. This registry provides both textual and > numeric identifiers for encodings (numeric in the form of MIBEnum > values). I was thinking of using plain integers which map into a list of currently used encodings. Every time a new encodings is used, the new encoding is appended to the list and the new index is used in the generated string objects. This allows us to separate the internal representation of the encoding from an outside view, e.g. there could be translators which map the integers into IANA identifiers or OSF charset numbers. We'd have to find a way to store this encoding information in Python pickles and the marshal format, though... a job for our compression experts ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sat Feb 10 22:08:06 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 10 Feb 2001 23:08:06 +0100 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> Message-ID: <3A85BBC6.BBAA8D70@lemburg.com> Paul Prescod wrote: > > At the bottom of one of my messages I proposed that we insert it as the > second argument. Although the encoding and mode are both strings there > is no syntactic overlap between [rwa][+]?[tb]+ and the set of existent > or proposed encodings. If we merely outlaw encodings with that name then > we can quickly figure out whether the second argument is a mode or an > encoding. So the documented syntax would be > > open(filename, encoding, [[mode], bytes]) > > And the documentation would say: > > "There is an obsolete variant that does not require an encoding string. > This may cause a warning in future versions of Python and be removed > sometime after that." Any reason why we cannot use a keyword argument for encoding and put it at the end of the argument list ? The result is: 1. no ambiguity 2. backward compatibility 3. good visibility of what the argument stands for (without having to look up the manual for e.g. the meaning of 'mbcs') -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sat Feb 10 22:16:02 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 10 Feb 2001 23:16:02 +0100 Subject: [I18n-sig] Binary data b"strings" (Pre-PEP: Proposed Python Character Model) References: <9FC702711D39D3118D4900902778ADC83244A2@JUPITER> <3A82CA6D.313D5E39@lemburg.com> <3A832A86.71833150@ActiveState.com> Message-ID: <3A85BDA2.BBE72C77@lemburg.com> Paul Prescod wrote: > > I've thought about this coercion issue more...I think we need to > auto-coerece these binary strings using some well-defined rule (NOT a > default encoding!). > > "M.-A. Lemburg" wrote: > > > > > ... > > > > > > I would want to avoid the need for a 2.0-style 'default encoding', so I > > > suggest it shouldnt be possible to mix this type with other strings: > > > > > > >>> "1"+b"2" > > > Traceback (most recent call last): > > > File "", line 1, in ? > > > TypeError: cannot add type "binary" to string > > > >>> "3"==b"3" > > > 0 > > > > Right. This will cause people to rethink whether they are > > using the object for text data or binary data. I still think that > > at the interface level, b"" and "" should be treated the same (except > > that b""-strings should not implement the char buffer interface). > > If C functions auto-convert these things then people will coerce them by > passing them through C functions. e.g. the regular expression engine or > null encoding functions or whatever. > > If we do NOT auto-coerce these things then they will not be compatible > with many parts of the Python infrastructure, the regular expression > engine and codecs being the most important examples. A clear requirement > from Andy Robinson was that string-like code should work on binary data > because often binary strings are "really" un-decoded strings. I think he > is speaking on behalf of a lot of serious internationalizers there. b""-strings will expose all necessary APIs to be compatible with the re-engine, with codecs and most other C level interfaces which use the s or s# parser marker. In reality the only breakage will be for code which explicitly requests a string object and these instances should really be modified to work using the above parser markers instead. Given these semantics, auto-conversion is not really necessary for b""-strings. Note that I see b""-string as replacement for our current 8-bit strings in the context of handling non-text data. 8-bit strings should still remain intact and available (even after making "" produce Unicode strings), but should be extended to provide additional encoding information (see the small image I posted on the "Storing encoding information" thread). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sat Feb 10 22:20:05 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 10 Feb 2001 23:20:05 +0100 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: <3A8427DD.2BABEF53@lemburg.com> <200102092027.f19KREW00995@mira.informatik.hu-berlin.de> <3A84639E.B86D57F2@lemburg.com> <3A84B165.5F1D20D4@ActiveState.com> <3A8530C1.479544D2@lemburg.com> <3A855DFE.6FA48392@ActiveState.com> Message-ID: <3A85BE95.565653D1@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > > > ... > > > > We could agree on this though: > > > > 1. programs which do not use the encoding declaration are free > > to use non-ASCII bytes in literals; Unicode literals must > > use Latin-1 (for historic reasons) > > > > 2. programs which do make use of the encoding declaration may > > only use non-ASCII bytes in Unicode literals; these are then > > interpreted using the given encoding information and decoded > > into Unicode during the compilation step > > I thought that's what I suggested! I am comfortable with that design. Well, not quite since 2. doesn't decode the whole file, but only the Unicode literals. The restriction on the rest of the file could be made a convention or be actually checked to assure forward compatibility. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tree@basistech.com Sat Feb 10 22:45:25 2001 From: tree@basistech.com (Tom Emerson) Date: Sat, 10 Feb 2001 17:45:25 -0500 Subject: [I18n-sig] Random thoughts on Unicode and Python In-Reply-To: <3A85B8F9.1F494BF8@lemburg.com> References: <14981.45051.945099.633730@cymru.basistech.com> <3A85B8F9.1F494BF8@lemburg.com> Message-ID: <14981.50309.44552.652650@cymru.basistech.com> M.-A. Lemburg writes: > How does Ruby (which seems to be the direct Python-competitor > in Japan) deal with the difference between binary data and > text data ? Strings are strings. The interpretation of the bytes in a string is affected by the setting of the KCODE built-in variable. > I think that much concern about these proposals lies in a misunder- > standing of the general idea behind the proposed move to Unicode for > text data: Agreed. > The module which we are currently talking about can be outlined > as follows: > > binary data string *) > | > | > text data string > | | > | | > Unicode string encoded 8-bit string (with encoding > *) information !) > > *) these are implemented in Python 1.6-2.1. > > How does this compare to e.g. Ruby ? As I said, Ruby has a String type, and an override for Japanese-encoded strings. The above is much more similar to the model used by Dylan. -tree -- Tom Emerson Basis Technology Corp. Stringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From guido@digicool.com Sat Feb 10 22:23:54 2001 From: guido@digicool.com (Guido van Rossum) Date: Sat, 10 Feb 2001 17:23:54 -0500 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) In-Reply-To: Your message of "Sat, 10 Feb 2001 23:08:06 +0100." <3A85BBC6.BBAA8D70@lemburg.com> References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> Message-ID: <200102102223.RAA28498@cj20424-a.reston1.va.home.com> > Paul Prescod wrote: > > > > At the bottom of one of my messages I proposed that we insert it as the > > second argument. Although the encoding and mode are both strings there > > is no syntactic overlap between [rwa][+]?[tb]+ and the set of existent > > or proposed encodings. If we merely outlaw encodings with that name then > > we can quickly figure out whether the second argument is a mode or an > > encoding. So the documented syntax would be > > > > open(filename, encoding, [[mode], bytes]) > > > > And the documentation would say: > > > > "There is an obsolete variant that does not require an encoding string. > > This may cause a warning in future versions of Python and be removed > > sometime after that." I am appalled at this lack of respect for existing conventions, when a simple and obvious alternative (see below) is easily available. I will have a hard time not to take this into account when I finally get to reading up on your proposals. > Any reason why we cannot use a keyword argument for encoding > and put it at the end of the argument list ? The result is: > > 1. no ambiguity > 2. backward compatibility > 3. good visibility of what the argument stands for (without having > to look up the manual for e.g. the meaning of 'mbcs') Of course this is what should be done when adding a new argument to an existing API. --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Sat Feb 10 22:26:10 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 10 Feb 2001 23:26:10 +0100 Subject: [I18n-sig] Strawman Proposal: Encoding Declaration V2 References: <3A85651E.C11C7B2B@ActiveState.com> Message-ID: <3A85C002.60873564@lemburg.com> Paul Prescod wrote: > > The encoding declaration controls the interpretation of non-ASCII bytes > in the Python source file. The declaration manages the mapping of > non-ASCII byte strings into Unicode characters. > > A source file with an encoding declaration must only use non-ASCII bytes > in places that can legally support Unicode characters. In Python 2.x the > only place is within a Unicode literal. This restriction may be lifted > in future versions of Python. > > In Python 2.x, the initial parsing of a Python script is done in terms > of the file's byte values. Therefore it is not legal to use any byte > sequence that has a byte that would be interpreted as a special > character (e.g. quote character or backslash) according to the ASCII > character set. This restriction may be lifted in future versions of > Python. > > The encoding declaration must be found before the first statement in the > source file. The declaration is not a pragma. It does not show up in the > parse tree and has no semantic meaning for the compiler itself. It is > conceptually handled in a pre-compile "encoding sniffing" step. This > step is also done using the ASCII encoding. > > The encoding declaration has the following basic syntax: > > #?encoding="" > > is the encoding name and must be associated with a > registered codec. The codec is used to interpret non-ASCII byte > sequences. > > The encoding declaration should be present in all Python source files > containing non-ASCII bytes. Some future version of Python may make this > an absolute requirement. Sounds overly complicated to me; even though the resulting semantics seem to be the same as those which I summarized in the last mail on the original "Encoding Declaration" thread: """ 1. programs which do not use the encoding declaration are free to use non-ASCII bytes in literals; Unicode literals must use Latin-1 (for historic reasons) 2. programs which do make use of the encoding declaration may only use non-ASCII bytes in Unicode literals; these are then interpreted using the given encoding information and decoded into Unicode during the compilation step Part 1 assures backward compatibility. Part 2 assures that programmers start to think about where they have to use Unicode and which program literals are allowed to go into string literals. Part 1 is already implemented, part 2 is easy to do, since only the compiler will have to be changed (in two places). """ If you want to keep your version, please add an explicit section about 1. to it. Otherwise it will cause unnecessary confusion. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sat Feb 10 22:32:03 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 10 Feb 2001 23:32:03 +0100 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? References: Message-ID: <3A85C163.4CFAAE4@lemburg.com> Andy Robinson wrote: > > This reminds me a lot of another debating going on close to home :-) > > - people who are in favour assume everyone else is, and that the only > question is how to get there > - people who are against are just plain worried but can't say why > - the government stays very quiet and avoids asking for a referendum > > I want to re-ask the big question: is it desirable that the > standard string type should become a Unicode string one day? Note that we are not moving to *one* new string type, but instead make use of object orientation and fit the current use of strings into different subclasses of a binary string type: binary data string *) | | text data string | | | | Unicode string encoded 8-bit string (with encoding *) information !) *) these are implemented in Python 1.6-2.1. The basic idea here is to differentiate between text data and binary data. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@reportlab.com Sat Feb 10 23:43:04 2001 From: andy@reportlab.com (Andy Robinson) Date: Sat, 10 Feb 2001 23:43:04 -0000 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? In-Reply-To: <3A859A76.D4C30372@ActiveState.com> Message-ID: > The more I look at I18N in VB.NET, the more impressed I am. > It has no language restrictions on variable names etc. > > Protected Sub Form1_Click(ByVal sender As Object, ByVal e As > System.EventArgs) > Dim ? As String > Dim font As New System.Drawing.Font("Batang", 10) > > ? = "??" > > TextBox1.Text = ? > End Sub > > Each "?" is an ideograph. It seems to "just work". That is good news. I'm still on Visual Studio 6, MS Office 2000 and Win2000, and was still busy being impressed that I could write Word docs in Japanese. I'd better do some catching up! - Andy From andy@reportlab.com Sat Feb 10 23:43:06 2001 From: andy@reportlab.com (Andy Robinson) Date: Sat, 10 Feb 2001 23:43:06 -0000 Subject: [I18n-sig] Random thoughts on Unicode and Python In-Reply-To: <14981.45051.945099.633730@cymru.basistech.com> Message-ID: > Both Shift-JIS and EUC-JP are 8-bit, multibyte encodings. > You can use > them on systems that are 8-bit clean and things "just > work". You don't > need to worry about embedded nulls or any other such noise. > While you > can't use len() to get the number of *characters* in a > Shift-JIS/EUC-JP encoded string, you can find out how many "octets" > are in it so you can loop over it and calculate the > character length. > > In essence the Japanese (and Chinese and Koreans) are using the > existing Python string type as a raw-byte string, and imposing the > semantics over that. That's my concern, and the thing I want to poll people on. If Python "just works" for these users, and if we already offer Unicode strings and a good codec library for people to use when they want to, is there really a need to go further? > Japanese and Chinese arguments against Unicode are often > ideological: > "It doesn't contain all of the characters we need." Of course they > forget to mention that the character sets in regular use in these > locales, JIS X 0201-1990, JIS X 0212-1990, GB 2312-80, and Big Five, > are all represented in Unicode. The same is true for Korean: all of > the hanja in KS C 5601 et al. are available in Unicode, as are the > precomposed han'gul. That's interesting. I have never heard that objection voiced before and agree that it is unfounded. I have seen objections based on two specific families of problems: (1) user defined characters: the big three Japanese encodings use the Kuten space of 94x94 characters. There are lots of slight venddor variations on the basic JIS0208 character set, as well as people adding new Gaiji in their office workgroups. Generic conversion routines from, say, EUC to Shift-JIS still work perfectly whether you use Shift-JIS, cp932, or cp932 plus ten extra in-house characters. Conversions to Unicode involve selecting new codecs, or even making new ones, for all these situations. (2) slightly corrupt data: Let's say you are dealing with files or database fields containing some truncated kanji. If you use 8-bit-clean strings and no conversion, the data will not be corrupted or changed; if you try to magically convert it to Unicode you will get error messages or possibly even more corruption. Maybe you're writing an app whose job is to get text from machine A to machine B without changing it; suddenly it will stop working. I know people who spent weeks debugging a VB print spooler which was cutting up Postscript files containing kanji. Suddenly upgrading to a new version of Python where all your data undergoes invisible transformations to Unicode and back is going to cause trouble for quite a few people. Arguably, it is GOOD trouble which will force them to standardise their character sets, document their extensions and clean their data - but it it still going to be trouble. It's a bit different in a language like Java which was defined to be Unicode-based from day one. - Andy From paulp@ActiveState.com Sun Feb 11 03:44:35 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 19:44:35 -0800 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> Message-ID: <3A860AA3.655F4207@ActiveState.com> Guido van Rossum wrote: > > ... > > open(filename, encoding, [[mode], bytes]) > > > > And the documentation would say: > > > > "There is an obsolete variant that does not require an encoding string. > > This may cause a warning in future versions of Python and be removed > > sometime after that." > > I am appalled at this lack of respect for existing conventions, You're the one who told everyone to move from string functions to string methods. This is a move of similar scope but for a much more important purpose than merely changing coding style. > when a > simple and obvious alternative (see below) is easily available. I > will have a hard time not to take this into account when I finally get > to reading up on your proposals. There is an important reason that we did not use a keyword argument. We (at least some subset of the people in the i18n-sig) want every single new instance of the "open" function to declare an encoding. Right now we allow a lot of "ambiguous data" into the system. We do not know whether the user meant it to be binary or textual data and so we don't know the correct/valid coercions, conversions and operations. We are trying to retroactively make an open function that strongly encourages (and perhaps finally forces) people to make their intent known. The open extension is a backwards compatible way to allow people to move from the "old" ambiguous form to the new form. I considered it pretty well thought out in terms of backwards and forwards compatibility. We could also just invent a new function like "file" or "fileopen" but upgrading "open" seemed to show the *most* respect for existing conventions (and clutters up builtins the least). Paul Prescod From tree@basistech.com Sun Feb 11 04:06:01 2001 From: tree@basistech.com (Tom Emerson) Date: Sat, 10 Feb 2001 23:06:01 -0500 Subject: [I18n-sig] Random thoughts on Unicode and Python In-Reply-To: References: <14981.45051.945099.633730@cymru.basistech.com> Message-ID: <14982.4009.542031.914222@cymru.basistech.com> Andy Robinson writes: > (1) user defined characters: the big three Japanese encodings > use the Kuten space of 94x94 characters. There are lots of slight > venddor variations on the basic JIS0208 character set, as well > as people adding new Gaiji in their office workgroups. Generic > conversion routines from, say, EUC to Shift-JIS still work > perfectly whether you use Shift-JIS, cp932, or cp932 plus > ten extra in-house characters. Conversions to Unicode involve > selecting new codecs, or even making new ones, for all these > situations. There is no reason that we couldn't provide a set of unified codecs for EUC-JP, Shift JIS, ISO-2022-JP, and CP932 that provide appropriate mappings between the EUDC sections in the legacy character sets and the PUA of Unicode, such that these conversions work. > (2) slightly corrupt data: Let's say you are dealing with files > or database fields containing some truncated kanji. If you > use 8-bit-clean strings and no conversion, the data will not > be corrupted or changed; if you try to magically convert > it to Unicode you will get error messages or possibly even > more corruption. Maybe you're writing an app whose job is > to get text from machine A to machine B without changing it; > suddenly it will stop working. I know people who spent > weeks debugging a VB print spooler which was cutting up > Postscript files containing kanji. Yes, this is a problem that I cannot suggest a good answer to: reality raises its ugly head. > Suddenly upgrading to a new version of Python where all > your data undergoes invisible transformations to Unicode > and back is going to cause trouble for quite a few people. Absolutely. -tree -- Tom Emerson Basis Technology Corp. Stringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From paulp@ActiveState.com Sun Feb 11 04:01:02 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 20:01:02 -0800 Subject: [I18n-sig] Random thoughts on Unicode and Python References: Message-ID: <3A860E7E.A58E37BC@ActiveState.com> Andy Robinson wrote: > > ... > > That's my concern, and the thing I want to poll people on. > If Python "just works" for these users, and if we already offer > Unicode strings and a good codec library for people to use when they > want to, is there really a need to go further? Let me point out again that while I don't want to discount the needs of these people, the fact is that over here in the West we need to use Unicode ourselves! I've already figured out how the Unicode works and how it interacts with "ordinary strings" but I don't think that everybody I hire to work at ActiveState should have to figure that out themselves. Obviously the Unicode source file issue is separate but the "Unicode as basic string literal" helps all of us. In a year, a lot of my work will involve XML on a Unicode-enabled operating system. I'll only have to think about 8-bit extended ASCII because Python forces me to sometimes. Now I know most people are not going to be moving to full Unicode as quickly as I am but that is the future and we need to start laying the groundwork now. >... > (2) slightly corrupt data: Let's say you are dealing with files > or database fields containing some truncated kanji. If you > use 8-bit-clean strings and no conversion, the data will not > be corrupted or changed; if you try to magically convert > it to Unicode you will get error messages or possibly even > more corruption. I think we've all agreed that Python should never, ever, magically convert binary data to Unicode. I think that most people's fears about Unicode are precisely that it will some day magically covert binary data to Unicode. But we all agree that that should never happen. Even in my original proposal when I said that the standard string should be widened to Unicode, I never, ever, suggested that binary data should be converted to Unicode. Rather I said that in some cases Unicode characters could be a transport -- a representation layer -- for binary data. Just as in some cases integers are a transport for characters or (shudder pointers). > Suddenly upgrading to a new version of Python where all > your data undergoes invisible transformations to Unicode > and back is going to cause trouble for quite a few people. But I do not believe that anyone has ever suggested that! I understand where the misunderstanding comes from but it is nevertheless a misunderstanding. Paul Prescod From brian@tomigaya.shibuya.tokyo.jp Sun Feb 11 05:58:44 2001 From: brian@tomigaya.shibuya.tokyo.jp (Brian Takashi Hooper) Date: Sun, 11 Feb 2001 14:58:44 +0900 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? In-Reply-To: <3A8593BF.8AFCEBB3@ActiveState.com> References: <3A8593BF.8AFCEBB3@ActiveState.com> Message-ID: <20010211140545.49DF.BRIAN@tomigaya.shibuya.tokyo.jp> Hi there, Brian in Tokyo again, On Sat, 10 Feb 2001 11:17:19 -0800 Paul Prescod wrote: > Andy, I think that part of the reason that Westerners push harder for > Unicode than Japanese is because we are pressured (rightly) to right > software that works world-wide and it is simply not sane to try to do > that by supporting multiple character sets. Multiple encodings maybe. > Multiple character sets? Forget it. I think this is a true and valid point (that Westerners are more likely to want to make internationalized software), but it sounds here like because Westerners want to make it easier to internationalize software, that that is a valid reason to make it harder to make software that has no particular need for internationalization, in non-Western languages, and change the _meaning_ of such a basic data type as the Python string. If in fact, as the proposal proposes, usage of open() without an encoding, for example, is at some point deprecated, then if I am manipulating non-Unicode data in "" strings, then I think I _do_ at some point have to port them over. b"" then becomes different from "", because "" is now automatically being interpreted behind the scenes into an internal Unicode representation. If the blob of binary data actually happened to be in Unicode, or some Unicode-favored representation (like UTF-8), then I might be happy about this - but if it wasn't, I think that this result would instead be rather dismaying. The current Unicode support is more explicit about this - the meaning of the string literal itself has not changed, so I can continue to ignore Unicode in cases where it serves no useful purpose. I realize that it would be nicer from a design perspective, more consistent, to have Python string mean only character data, but right now, it does sometimes mean binary and sometimes mean characters. The only one who can distinguish which is the programmer - if at some point "" means only Unicode character strings, then the programmer _does_, I think, have to go through all their programs looking for places where they are using strings to hold non-Unicode character data, or binary data, and explicitly convert them over. I have difficulty seeing how we would be able to provide a smooth upgrade path - maybe a command-line backwards compatibility option? Maybe defaults? I've heard a lot of people voicing dislike for default encodings, but from my perspective, something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are, strictly speaking, not supersets of ASCII because the ASCII ranges are usually interpreted as JIS-Roman, which contains about 4 different characters) is functionally a default encoding... Requiring encoding declarations, as the proposal suggests, is nice for people working in the i18n domain, but is an unnecessary inconvenience for those who are not. > > I don't know of any commercial software written in Japan but used in the > west so I think that they probably have less I18N pressure than we do. > Unicode is only interesting when you want the same software to run in > multiple character set environments! That's exactly true. The point I would like to make is that a lot, probably the majority of Python software and libraries that are out there today, don't have any need to run in multiple character set environments. Python is useful for a lot more things than just for commercial development of products designed for international markets. > > Andy Robinson wrote: > > > > ... > > > > 2. I have been told that there are angry mumblings on the > > Python-Japan mailing list that such a change would break all > > their existing Python programs; I'm trying to set up my tools to > > ask out loud in that forum. > > I don't think it is posssible to say in the abstract that a move to > Unicode would break code. Depending on implementation strategy it might. > But I can't imagine there is really a ton of code that would break > merely from widening the character. See above. I think there is, at least outside of Europe. Is it a higher priority for Python to make it easier for Western users to internationalize, or to save people who currently use Python strings to manipulate binary data the trouble of having to port their applications to support the new conventions? I guess my own personal preference is not to change things too much, because from my perspective, the Unicode support is fine - if it's not broken, don't fix it. Maybe it would be instructive to take the current proposal and any others that come out, and without actually implementing, pretend-apply the changes to parts of the existing code base to try to see how big the effect would be? That way, neither of us has to accept just on faith that changing so-and-so would or would not break existing code... --Brian From frank63@ms5.hinet.net Sun Feb 11 13:20:10 2001 From: frank63@ms5.hinet.net (Frank Chen) Date: Sun, 11 Feb 2001 13:20:10 -0000 Subject: [I18n-sig] =?BIG5?B?UmU6U3RyYXdtYW4gUHJvcG9zYWw6IEVuY29kaW5nIERlY2xhcmF0aW9uIFY=?= =?BIG5?B?Mg==?= Message-ID: <200102110601.OAA14610@ms5.hinet.net> > Date: Sat, 10 Feb 2001 07:58:22 -0800 > From: Paul Prescod > Organization: ActiveState > To: "i18n-sig@python.org" > Subject: [I18n-sig] Strawman Proposal: Encoding Declaration V2 > > > A source file with an encoding declaration must only use non-ASCII bytes > in places that can legally support Unicode characters. In Python 2.x the > only place is within a Unicode literal. This restriction may be lifted > in future versions of Python. So, if one day I declare Big5 as the encoding, I cannot use any ASCII character in my Python script? Does it mean this? if I set a = "characters='abc'", in the future it doesn't work? I need to use Big5 characters as identifiers and also the contents of strings when encoding declaraction is set to Big5? > > The encoding declaration must be found before the first statement in the > source file. The declaration is not a pragma. It does not show up in the > parse tree and has no semantic meaning for the compiler itself. It is > conceptually handled in a pre-compile "encoding sniffing" step. This > step is also done using the ASCII encoding. > Like a preprocessor, to convert local encoding characters into Unicode first? And then feed it to the compiler? Frank Chen From frank63@ms5.hinet.net Sun Feb 11 14:02:45 2001 From: frank63@ms5.hinet.net (Frank Chen) Date: Sun, 11 Feb 2001 14:02:45 -0000 Subject: [I18n-sig] Re: All this Unicode discussion Message-ID: <200102110601.OAA14633@ms5.hinet.net> > Brian and I are worried about all these proposals flying around. > Americans seem to feel that having Unicode everywhere is > 'the right thing'. But we have not heard from enough people > in Japan or in Chinese-speaking countries, and the list has > NEVER had input from e.g. Arabic speakers or Eastern Europe. > In fact, some people in mainland China look like to arguly object Unicode in Chinese softwares. The Han Unification for CJK reveals their unknowns about CJK ideography. If in the future, the UCS4 can deploy a complete allocation area for each written language, especially for CJK, I think it is fine to use Unicode as the internal data type. I am even thinking is there a chance to embrace ancient Egyptian hieroglyphics into Unicode, but it was a dead script though. > Is it really desirable, long term, to have Unicode strings as the > default > type in Python? Do we need separate Unicode file and Binary > file annd socket types? Or are we better with what we have now - > no fundamental changes, but with codecs and Unicode strings > when you want them? I see the proposal, it seems not to treat Unicode as pivot internally, but an add-on when an encoding declaration is set. If there is no encoding declaration setting, it should function like before, right? Or if it is set to Latin-1, it should work like current Python, right? For now, I can put Big5 characters in Python strings, and the Windows or Chinese emulator can interpret Big5 strings correctly when Python displays them on the screen. I think the future version should keep this alive. But I am worries about the conversion time when mapping to Unicode. The Python start-up time for initialization may take too long. > > In addition, are there any benefits or problems when you > deal with double-byte data in Java, VB, or any other languages > you are familiar with? > I think the reason that Java or Windows use Unicode in internal processing is mainly for quick universal delivering. And the reason why Unicode raises is the same, for many local encodings slow down the productivity when the product is world-widely spreaded. So, if Python wants to ship with i18n & 10n (then it can display local encoding message with its environment in different areas and the like), it surely can use Unicode for delivering efficiency. Frank Chen From paulp@ActiveState.com Sun Feb 11 07:05:03 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sat, 10 Feb 2001 23:05:03 -0800 Subject: [I18n-sig] Re:Strawman Proposal: Encoding Declaration V2 References: <200102110601.OAA14610@ms5.hinet.net> Message-ID: <3A86399F.F4B2C6E5@ActiveState.com> Frank Chen wrote: > > ... > > So, if one day I declare Big5 as the encoding, I cannot use any ASCII > character in my Python script? > Does it mean this? > if I set a = "characters='abc'", in the future it doesn't work? I need to > use Big5 characters > as identifiers and also the contents of strings when encoding declaraction > is set to Big5? I'm pretty sure that ASCII characters are Big5 characters and they are encoded in the same way as in pure ASCII. So yes, you can continue to use ASCII characters in Big5-encoded scripts. The current proposal only has any "effect" on Unicode literals anyhow. The only danger is that just as today you must not use a Big5 character with a second byte that would confuse an ASCII-based parser. The second byte must never equate to ASCII "\" or '"'. I presume you are already careful about that. > ... > > Like a preprocessor, to convert local encoding characters into Unicode > first? > And then feed it to the compiler? *Conceptually* this is how I think of it. That *could* one day allow identifiers to be in any language. It also means that we could one day get rid of the silly restrictions on the second byte of two-byte characters. Others think of it as just a post-parse transformation on ONLY Unicode (u"") literals. Until the issue of non-ASCII identifiers comes up, there is no practical difference. So you can think of it either way. The first implementation will likely be a post-parse transformation because it is easier to implement in a non-Unicode parser. Paul Prescod From paulp@ActiveState.com Sun Feb 11 08:05:22 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 00:05:22 -0800 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? References: <3A8593BF.8AFCEBB3@ActiveState.com> <20010211140545.49DF.BRIAN@tomigaya.shibuya.tokyo.jp> Message-ID: <3A8647C2.398822CB@ActiveState.com> Brian Takashi Hooper wrote: > > ... > > I think this is a true and valid point (that Westerners are more likely > to want to make internationalized software), but it sounds here like > because Westerners want to make it easier to internationalize software, > that that is a valid reason to make it harder to make software that has > no particular need for internationalization, in non-Western languages, > and change the _meaning_ of such a basic data type as the Python string. I do not think that any of the proposals make it much harder to make non-internationlized software. We are merely asking people to be explicit about their assumptions so that code will have a better chance of working on other people's computers. That means adding an encoding declaration here, prepending a "b" prefix there and so forth. Asians understand encoding issues and I do not think that they will be confused by these changes. If you ask an Asian "what is Python's character set" they will either answer Latin 1 (which looks bad) or "Python has no native character set, only binary strings of bytes." If they think of strings as strings of bytes then what is the harm in prefixing a "b" to make that assumption explicit? > If in fact, as the proposal proposes, usage of open() without an > encoding, for example, is at some point deprecated, then if I am > manipulating non-Unicode data in "" strings, then I think I _do_ at some > point have to port them over. No, those would be two unrelated changes. In order to get open() to have its old behavior you would say something like: open( "filename", "raw") or open( "filename", "binary") > b"" then becomes > different from "", because "" > is now automatically being interpreted behind the scenes into an > internal Unicode representation. Yes, this is a separate proposal for some time down the road. Sometime down the road is likely at least two years because the deployment of new versions of Python is very slow and it would be wrong to quickly deprecate a usage which is "recommended practice" in Python 2.x. > If the blob of binary data actually > happened to be in Unicode, or some Unicode-favored representation (like > UTF-8), then I might be happy about this - but if it wasn't, I think > that this result would instead be rather dismaying. The vast majority of the world's encodings are "Unicode-favored" at some level. As long as the character set is compatible with Unicode and you add an encoding declaration, everything should just work. If you do NOT want to work with Unicode then you would have to prepend a "b" prefix to your literal strings. As I've described, you will have several years to choose which path you want to take. And the "fixups" are easy. I don't see why this is a cause for alarm. > The current Unicode support is more explicit about this - the meaning of > the string literal itself has not changed, so I can continue to ignore > Unicode in cases where it serves no useful purpose. Python is EXPLICIT about the fact that the character set is NOT Unicode. Python is NOT explicit about the fact that the character set is Latin 1 or "binary data" -- depending on your point of view. If you take the former point of view then Python is Western centric. If you take the latter point of view then it is just plain confusing to use the term "character string" as the name for your "binary data" container. You acknowledge this below: > I realize that it > would be nicer from a design perspective, more consistent, to have > Python string mean only character data, but right now, it does sometimes > mean binary and sometimes mean characters. The only one who can > distinguish which is the programmer - if at some point "" means only > Unicode character strings, then the programmer _does_, I think, have to > go through all their programs looking for places where they are using > strings to hold non-Unicode character data, or binary data, and > explicitly convert them over. I have difficulty seeing how we would be > able to provide a smooth upgrade path - maybe a command-line backwards > compatibility option? It is my personal opinion that time itself is an "upgrade path." If you tell people where things are going then in the course of basic software maintenance they will change their software. This is how we managed the transition from K&R C to ANSI C to C++. Yes, a command-line backwards compatibility option is another way of extending the amount of "change-over" time people have. > Maybe defaults? I've heard a lot of people > voicing dislike for default encodings, but from my perspective, > something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are, > strictly speaking, not supersets of ASCII because the ASCII ranges are > usually interpreted as JIS-Roman, which contains about 4 different > characters) is functionally a default encoding... Requiring encoding > declarations, as the proposal suggests, is nice for people working in > the i18n domain, but is an unnecessary inconvenience for those who are > not. One of the things I like about Python is that it encourages me to write software in ways that allow my simple scripts to grow into complex programs. Perl programmers consider many of these "encouragements" to be unnecessary inconveniences. Similarly, I think Python should help me (and encourage me) to write software that works on computers that are configured differently than mine. Think of it also as an investment in the unification of the Python world. Wouldn't it be great if Chinese programmers could email Guido and say: "Here's a cool Python program I wrote. Give it a whirl?" Is it possible that we duplicate more code than we need to because it is too hard to share programs right now? Obviously spoken language barriers are not going away but at least our code can be portable. Also, think of all of the great software being written in Python. Maybe the next killer Python app will work better in Japan and China because we made it easier to internationalize code. And if Python itself can distinguish between textual and binary information then we can do a lot of things more intelligently: coercions, exceptions, concatenations, extension library integration etc. Explicit is better than implicit! Finally, I think it is in the best interests of even people who do not want i18n to have the Python language be more explicit and consistent. When Python is taught in a Japananese school they can say: "See, this character 'b' means that the string contains binary data. We choose to use a binary string for reason X, Y and Z." or "See, this string contains Unicode characters. That means len() works as you would expect on a per-character basis and the software works just as well with Chinese text as Japanese text and ..." > > I don't think it is posssible to say in the abstract that a move to > > Unicode would break code. Depending on implementation strategy it might. > > But I can't imagine there is really a ton of code that would break > > merely from widening the character. > See above. I think there is, at least outside of Europe. Note that we are discussing three or four or five different proposals as if they are one. I think it would be easy to demonstrate that there is little code that would break based ONLY on the change that Python strings could contain characters with ordinals greater than 255. If we added a single character to the range at position 256, would that break much Python code? Ignore Unicode. Just extend the range by one character. Now keep extending it until you get to the size of Unicode. The separate proposal that tries to clean up the interpretation of literals with non-Unicode bytes WOULD break code (if only some time far in the future and after a long changeover period). > ... > Maybe it would be instructive to take the current proposal and any > others that come out, and without actually implementing, pretend-apply > the changes to parts of the existing code base to try to see how big the > effect would be? That way, neither of us has to accept just on faith > that changing so-and-so would or would not break existing code... Python changes are always implemented as patches which are tested and then backed-out if they break things. Nevertheless, you are right that there are some of us with the goal of having string literals directly contain Unicode characters one day. Guido may or may not have an opinion on the issue. Either way, Guido wouldn't make the change if it were going to break a lot of code. So the immediate issue is whether the explicitness requirements of b"" strings and an encoding declaration are too onerous. Anyhow, at this point we are not even talking about adding any mandatory features or turning new features into recommended practice. We are just talking about ALLOWING people to be explicit about the distinction between binary and text data and ALLOWING people to directly enter Unicode text data. I haven't tried to hide where I think things should go but still these new features deserve to be evaluated on their own. They are good ideas even if we never deprecate the other ways of doing things. I know I started this discussion with my single big-bang proposal but I'd like to take a more incremental approach now. I don't think that the current proposals make anyone's life harder yet. Paul Prescod From paulp@ActiveState.com Sun Feb 11 08:16:50 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 00:16:50 -0800 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> Message-ID: <3A864A72.B18E5C31@ActiveState.com> "M.-A. Lemburg" wrote: > > ... > > Any reason why we cannot use a keyword argument for encoding > and put it at the end of the argument list ? The result is: > > 1. no ambiguity > 2. backward compatibility > 3. good visibility of what the argument stands for (without having > to look up the manual for e.g. the meaning of 'mbcs') I would like to have the option of one day making it a required argument without having to also make mode and bytes required. Mode would be a minor inconvenience but bytes would be major. Paul Prescod From andy@reportlab.com Sun Feb 11 08:22:44 2001 From: andy@reportlab.com (Andy Robinson) Date: Sun, 11 Feb 2001 08:22:44 -0000 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? In-Reply-To: <3A85C163.4CFAAE4@lemburg.com> Message-ID: [Marc-Andre] > Note that we are not moving to *one* new string type, but instead > make use of object orientation and fit the current use of strings > into different subclasses of a binary string type: > > binary data string *) > | > | > text data string > | | > | | > Unicode string encoded 8-bit string (with encoding > *) information !) > > *) these are implemented in Python 1.6-2.1. > > The basic idea here is to differentiate between text data and > binary data. > Thanks. It's finally starting to make sense to me. - Andy From tim.one@home.com Sun Feb 11 08:34:32 2001 From: tim.one@home.com (Tim Peters) Date: Sun, 11 Feb 2001 03:34:32 -0500 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: <3A855B89.459A18E4@ActiveState.com> Message-ID: [Paul Prescod] > I'm not personally willing to design in such a limitiation. I have seen > a lot of code that mixes other languages with English. e.g.: > > http://starship.python.net/pipermail/python-de/2000q3/000597.html > > I don't think this guy is doing anything wrong. If a Japansese person > asks me if they could do the same I would say: "Not now, but hopefully > someday." But of course they could: "this guy" you point to as evidence used plain 7-bit ASCII, writing an approximation to German in that. *That's* certainly widespread, in and out of the Python world. But more than that isn't. Again, pick a language that already supports what you suggest and find some evidence that it's *used*. As I said before, I've seen no evidence that it is, and the evidence of languages designed by non-Euros suggests it's rare even for them to cater to these complications (and, yes, the Java Character class's .isIdentifierIgnorable(), .isUnicodeIdentifierPart(), .isUnicodeIdentifierStart() etc methods are indeed complications: write a regexp to match a valid Unicode identifier; write a UserDict that manages to collapse valid Unicode identifiers that differ only in ignorable characters into a single key; etc; explain to users that their little source-munging tools need to take all of that into account in the New World). > ... > People keep bringing up this issue of keywords. I've never disputed that > the keywords should always be English. What about the names of builtins and std library names and the names of classes and functions and methods and attributes in the std libraries? I mentioned keywords in the context of all of those. > There are a lot of people who write code that will never be > seen by a speaker of an ASCII-compatible language. Why should they be > forced to write it in ASCII? "Forced" presumes it's against their will. That's what I question. There is nothing more Eurocentric than to embark on unilateral crusades for the purported benefit of non-Euros who aren't asking for help <0.7 wink>. it's-a-programming-language-not-a-word-processor-ly y'rs - tim From tim.one@home.com Sun Feb 11 08:50:17 2001 From: tim.one@home.com (Tim Peters) Date: Sun, 11 Feb 2001 03:50:17 -0500 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? In-Reply-To: <3A8647C2.398822CB@ActiveState.com> Message-ID: [Paul Prescod] > ... > If you ask an Asian "what is Python's character set" they will either > answer Latin 1 (which looks bad) or "Python has no native character set, > only binary strings of bytes." The Python Reference Manual says (chapter 2, "Lexical analysis"): Python uses the 7-bit ASCII character set for program text and string literals. That was Guido's intent, and it's actually a bug that the parser uses isalpha() etc (it wasn't intended to vary according to locale; locale was an ANSI invention Guido didn't have in mind when that stuff was coded; and, e.g., in some locales even characters like "|" meet the isalpha() test). From andy@reportlab.com Sun Feb 11 09:18:57 2001 From: andy@reportlab.com (Andy Robinson) Date: Sun, 11 Feb 2001 09:18:57 -0000 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) In-Reply-To: <3A864A72.B18E5C31@ActiveState.com> Message-ID: > > Any reason why we cannot use a keyword argument for encoding > > and put it at the end of the argument list ? The result is: > > > > 1. no ambiguity > > 2. backward compatibility > > 3. good visibility of what the argument stands for (without having > > to look up the manual for e.g. the meaning of 'mbcs') > > I would like to have the option of one day making it a > required argument > without having to also make mode and bytes required. Mode would be a > minor inconvenience but bytes would be major. > > Paul Prescod I can see three separate proposals going on here. Here's what I think: (1) introduce b"whatever". I'm 100% in favour - breaks nothing, adds clarity, and having it early may ease the pain if we ever do break old code in a few years. (2) widen the string representation so they can hold single or multi-byte data but without implying their semantics. I'm not sure on this one - it goes further than any other language and the extra power may lead to new classes of errors. Alongside the proposal, we need a bunch of examples of how this could be used, and of how it could be abused, and then I think we all need to sit on it for a while. Which is what you've been saying too. (3) changing open(). This should be contingent on (2). As long as u"hello" and "hello" have a different type, our current solution is exactly right - we have wrappers classes around files which handle Unicode strings, but files themselves always do I/O in bytes. We've actually got the explicit position you favour right now - to write Unicode to a file, I need to explicitly create a wrapper with an encoding. If you go to (2), it becomes possible to write a string containing unicode straight to a file object, and therefore it is desirable to let the file object handle conversion, so you need a way to specify it etc. I am still not sure this is right. The stackable streams concept is well understood from Java and gives a lot of power. - Andy From andy@reportlab.com Sun Feb 11 09:23:52 2001 From: andy@reportlab.com (Andy Robinson) Date: Sun, 11 Feb 2001 09:23:52 -0000 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: Message-ID: > "Forced" presumes it's against their will. That's what I > question. There > is nothing more Eurocentric than to embark on unilateral > crusades for the > purported benefit of non-Euros who aren't asking for help > <0.7 wink>. > Beatifully put. This is the empirical question and one I am determined to get real answers to. - Andy From paulp@ActiveState.com Sun Feb 11 09:34:31 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 01:34:31 -0800 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? References: Message-ID: <3A865CA7.249910F1@ActiveState.com> Tim Peters wrote: > > > ... > > The Python Reference Manual says (chapter 2, "Lexical analysis"): > > Python uses the 7-bit ASCII character set for program text and > string literals. > > That was Guido's intent, That may be the rule but try enforcing it. It is so widely violated as to be irrelevant. I would love it if you did try to enforce it in Python 2.1. You would take the heat for breaking everyone's non-ASCII programs and then I could come in and propose the draconian rule be eased with the encoding declaration. The wide violation of this rule should inform our discussions about where Python source code is going in the future... Paul Prescod From paulp@ActiveState.com Sun Feb 11 09:46:18 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 01:46:18 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: Message-ID: <3A865F6A.6CC12CC3@ActiveState.com> Tim Peters wrote: > > ... > > Again, pick a language that already supports what you suggest and find some > evidence that it's *used*. We will see. Before Unicode it would have been very hard to do this and yet achieve source code portability between systems. Unicode and the tools and languages that use it are just being deployed. There is no need to move aggressively in that direction. But I'll say again that I think it would be a big mistake to add any further impedements to getting there. > it's-a-programming-language-not-a-word-processor-ly y'rs - tim I don't understand your fundamental point. We agree that German people want to use German variable names. If it was *just as easy* for them to use non-ASCII German characters, why wouldn't they? What's magical about ASCII? And if Japanese people are more like German people than they are different from them (carbon based, bipedal, etc.) then why wouldn't they want to write code using their special characters? Why would they choose to approximate and translate? I'm not claiming it's a burning need, but I don't see why a Japanese teenager learning to program for the first time would choose to use a language that requires English variable names over one that offered choice. There's nothing magical about ASCII. Hell, American teenagers would probably love to put happy faces and summation signs into their variable names. I use a teenager as an example of a person coming to the computer world fresh without ASCII brain-damage. Where's Greg Wilson when I need him? Paul Prescod From paulp@ActiveState.com Sun Feb 11 09:59:08 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 01:59:08 -0800 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: Message-ID: <3A86626C.AFFF32B0@ActiveState.com> Andy Robinson wrote: > > .... > > I can see three separate proposals going on here. Here's what I > think: > > (1) introduce b"whatever". > > I'm 100% in favour - breaks nothing, adds clarity, and having it early > may ease the pain if we ever do break old code in a few years. > > (2) widen the string representation so they can hold single or > multi-byte > data but without implying their semantics. This is not a short-term proposal because it involves more implementation work than the others. > I'm not sure on this one - it goes further than any other language > and the extra power may lead to new classes of errors. Actually, the way you describe it, it sounds alot like wchar. > (3) changing open(). > > This should be contingent on (2). As long as u"hello" and "hello" > have a different type, our current solution is exactly right - > we have wrappers classes around files which handle Unicode strings, > but files themselves always do I/O in bytes. We've actually got > the explicit position you favour right now - to write Unicode to a > file, I need to explicitly create a wrapper with an encoding. I don't follow why this should be contingent on widening the basic string representation! Given a Unicode type, we need to read and write Unicode data today. In my personal opinion, wrappers are too obscure and too optional. The average programmer is not going to even know they exist. > If you go to (2), it becomes possible to write a string containing > unicode straight to a file object, and therefore it is desirable > to let the file object handle conversion, so you need a way to > specify it etc. We already have Unicode strings that we need to write to files! > I am still not sure this is right. The stackable > streams concept is well understood from Java and gives a lot of > power. The stackable streams will still exist. But Python is "flatter" than Java in general. Java's IO libraries are in my opinion almost incomprehensible. Yes ,very powerful once you understand them but a lot to learn to do basic things. I would not be embarrassed to tell a newbie Python programmer that they should write: file = open("/etc/passwd.txt", "ASCII") It's pretty clear what's going on and they don't need any understanding of Unicode. What's the Java equivalent? Paul Prescod From fredrik@effbot.org Sun Feb 11 10:14:25 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 11:14:25 +0100 Subject: [I18n-sig] Re: Pre-PEP: Python Character Model Message-ID: <012101c09413$a5b5d2e0$e46940d5@hagrid> (trying to catch up from the archives; just realized that I wasn't subscribed to i18n) > > I'm lost here. Let's say I'm using Python 1.5. I have some KOI8-R dat= a > > in a string literal. PythonWin and Tk expect Unicode. How could they > > display the characters correctly? > > No, PythonWin and Tk both tell apart Unicode and byte strings > (although Tk uses quite a funny algorithm to do so). If they see a > byte string, they convert it using the platform encoding (which is > user-settable on both Windows and Unix) to a Unicode string, and > display that. Not quite true for Tk: Tcl's 8-bit to Unicode conversion expects UTF-8. When it sees a lead byte with not enough trailbytes, the lead byte is copied as is. Naked trail bytes are also copied as is. Under Latin-1, the following three Python strings all result in the same Tcl string value: str =3D "=E5=E4=F6" str =3D u"=E5=E4=F6".encode("utf-8") str =3D u"=E5=E4=F6" But under a hypothetical platform encoding where "=E5" looks like a UTF-8 lead byte, and "=E4" like a trail byte, this will fail (if you think that's unlikely, feel tree to replace "=E5" and "=E4" with other characters...). Cheers /F From fredrik@effbot.org Sun Feb 11 10:34:32 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 11:34:32 +0100 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model Message-ID: <013401c09416$881b0f40$e46940d5@hagrid> > > In my opinion there should be *no* encoding default. New code should > > always specify an encoding. Old code should continue to work the same. > > However, matter-of-factually, you propose that ISO-8859-1 is the > default encoding, as this is the encoding that is used when converting > character strings to char* in the C API. I'd certainly call it a > default. It's not an encoding. It's the subset of Unicode that you can store in an 8-bit character. (If you have a problem with that, complain to the Unicode designers) Cheers /F From fredrik@effbot.org Sun Feb 11 10:46:09 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 11:46:09 +0100 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model Message-ID: <013c01c09417$e11c49a0$e46940d5@hagrid> > I really like the idea of the > > b"..." prefix > > Is anyone opposed? yes. > 1. [file]?open(filename, encoding, ...) you mean (?:file)?open, right? I still think we can reuse the builtin "open" primitive (and don't forget the text vs. binary mode issue -- binary files never have encodings). > 2. b"..." -0 (I'm sceptical) > 3. an encoding declaration at the top of files +1 > 4. that concatenating Python strings and Unicode strings should do the > "obvious" thing for charcters from 127-255 and nothing for characters > beyond. +1 > 5. a bytestring type that behaves in every way shape and form like our > current string type but has a different type() and repr(). almost: it shouldn't implement text-related method. isupper, upper, etc doesn't make sense here. (but like in SRE, the *source* code should be reused) Cheers /F From fredrik@effbot.org Sun Feb 11 10:51:29 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 11:51:29 +0100 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model Message-ID: <014d01c09419$07dfbad0$e46940d5@hagrid> > >I would want to avoid the need for a 2.0-style 'default encoding', so I > >Suggest it shouldnt be possible to mix this type with other strings: > > > >>>> "1"+b"2" > >Traceback (most recent call last): > > File "", line 1, in ? > >TypeError: cannot add type "binary" to string > >>>> "3"=3D=3Db"3" > >0 a more pragmatic approach would be to assume ASCII en- codings for binary data, and choke on non-ASCII chars. >>> "1" + b"2" 12 >>> "1" + buffer("2") 12 >>> "1" + b"\xff" ValueError: ASCII decoding error: ordinal not in range(128) Cheers /F From fredrik@effbot.org Sun Feb 11 11:00:44 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 12:00:44 +0100 Subject: [I18n-sig] Re: Strawman Proposal: Binary Strings Message-ID: <017f01c0941f$4b13a6d0$e46940d5@hagrid> > About changing .encode() or the existing codecs to return binary > strings instead of normal strings: I'm -1 on this one since it > will break existing code. -1. core features shouldn't return binary data in text strings. foo.upper() shouldn't work if "foo" isn't known to contain text. if this breaks code (not sure it does), the binary data type needs more work. > Instead, strings should probably carry along the encoding > information in an additional attribute (it is not always useful, > but can help in a few situations) provided that it is known. -1. evil. Cheers /F From fredrik@effbot.org Sun Feb 11 11:08:08 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 12:08:08 +0100 Subject: [I18n-sig] Re: Strawman Proposal (2): Encoding attributes Message-ID: <018001c0941f$4c1140b0$e46940d5@hagrid> > > > Ah, ok. The encoding information will only be applied to literal > > > Unicode strings (u"text"), right ? > > > > No, that's very different than what I am suggesting. > > > > The encoding is applied to the *text file*. > > -1 and -1 on your -1. MAL, you're stuck in a "unicode strings are something special" modus operandi. the goal should be to get rid of u"foo" strings, not continue to make Python more and more dependent on this artificial distinction. > The result would be way to much breakage. I doubt it. Cheers /F From fredrik@effbot.org Sun Feb 11 11:20:09 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 12:20:09 +0100 Subject: [I18n-sig] Re: Strawman Proposal (2): Encoding attributes Message-ID: <018801c0941f$4ce42110$e46940d5@hagrid> > "Forced" presumes it's against their will. That's what I question. There > is nothing more Eurocentric than to embark on unilateral crusades for the > purported benefit of non-Euros who aren't asking for help <0.7 wink>. if you think that ASCII is good enough for european languages, or that europeans like having to use an approximation of their own language just because american programmers are lazy, I'm not sure you should be on this list at all . Cheers /F From fredrik@effbot.org Sun Feb 11 11:22:58 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 12:22:58 +0100 Subject: [I18n-sig] Re: Strawman Proposal (2): Encoding attributes Message-ID: <018901c0941f$4d3e4f00$e46940d5@hagrid> > > If it works and it is easy, there should not be a problem! > > This is how I started into the Unicode debate (making UTF-8 the default > encoding). It doesn't work out... let's not restart that discussion. this is not the same discussion. Cheers /F From fredrik@effbot.org Sun Feb 11 11:29:27 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 12:29:27 +0100 Subject: [I18n-sig] Re: Strawman Proposal: Smart String Test Message-ID: <018a01c0941f$4d9b8a30$e46940d5@hagrid> > type(foo)==type("") any reason we cannot just make this work, whether foo contains 8-bit or 16-bit data? btw, the preferred syntax is: isinstance(foo, type("")) I think it's okay only the latter works, for now (which can be solved by a simple and stupid hack, while waiting for a real type hierarchy...) Cheers /F From fredrik@effbot.org Sun Feb 11 11:34:12 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 12:34:12 +0100 Subject: [I18n-sig] Re: Strawman Proposal: Encoding Declaration V2 Message-ID: <018b01c0941f$4dd733a0$e46940d5@hagrid> > A source file with an encoding declaration must only use non-ASCII bytes > in places that can legally support Unicode characters. In Python 2.x the > only place is within a Unicode literal make that "in a string literal". if an encoding directive is present, the *entire* file should be assumed to use that encoding. this applies to comments, 8-bit string literals, and 16-bit string literals. Cheers /F From fredrik@effbot.org Sun Feb 11 11:53:50 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 12:53:50 +0100 Subject: [I18n-sig] Re: Python and Unicode == Britain and the Euro? Message-ID: <019e01c09421$c402bfc0$e46940d5@hagrid> tim wrote: > The Python Reference Manual says (chapter 2, "Lexical analysis"): > > Python uses the 7-bit ASCII character set for program text and > string literals. ...and then says "8-bit characters may be used in string literals and comments but their interpretation is platform dependent". for a non-ASCII programmer, that pretty much means "no native character set". Cheers /F From mal@lemburg.com Sun Feb 11 13:13:23 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 11 Feb 2001 14:13:23 +0100 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> Message-ID: <3A868FF3.45EEC501@lemburg.com> [Paul, it would help if you wouldn't always remove important parts of the quoted messages... people who don't read the whole thread won't have a chance to follow up] Paul Prescod wrote: > > Guido van Rossum wrote: > > > > ... > > > open(filename, encoding, [[mode], bytes]) > > > > > > And the documentation would say: > > > > > > "There is an obsolete variant that does not require an encoding string. > > > This may cause a warning in future versions of Python and be removed > > > sometime after that." > > > > I am appalled at this lack of respect for existing conventions, > > You're the one who told everyone to move from string functions to string > methods. This is a move of similar scope but for a much more important > purpose than merely changing coding style. > > > when a > > simple and obvious alternative (see below) is easily available. I > > will have a hard time not to take this into account when I finally get > > to reading up on your proposals. > > There is an important reason that we did not use a keyword argument. > > We (at least some subset of the people in the i18n-sig) want every > single new instance of the "open" function to declare an encoding. This doesn't make sense: not all uses of open() target text information. What encoding information would you put into an open() which wants to read a JPEG image from a file ? > Right > now we allow a lot of "ambiguous data" into the system. We do not know > whether the user meant it to be binary or textual data and so we don't > know the correct/valid coercions, conversions and operations. We are > trying to retroactively make an open function that strongly encourages > (and perhaps finally forces) people to make their intent known. > > The open extension is a backwards compatible way to allow people to move > from the "old" ambiguous form to the new form. I considered it pretty > well thought out in terms of backwards and forwards compatibility. We > could also just invent a new function like "file" or "fileopen" but > upgrading "open" seemed to show the *most* respect for existing > conventions (and clutters up builtins the least). We cannot turn override the mode parameter with an encoding parameter... why do you believe that this is backwards compatible in any way ? (Note that mode is an optional parameter!) The keyword argument approach gives us a much better way to integrate a new argument into the open() call: f = open(filename, encoding='mbcs', mode='w') or f = open(filename, 'w', encoding='mbcs') There's a little more typing required, but the readability is unbeatable... -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sun Feb 11 13:22:53 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 11 Feb 2001 14:22:53 +0100 Subject: [I18n-sig] Random thoughts on Unicode and Python References: <14981.45051.945099.633730@cymru.basistech.com> <14982.4009.542031.914222@cymru.basistech.com> Message-ID: <3A86922D.AB5AB78E@lemburg.com> Tom Emerson wrote: > > Andy Robinson writes: > > (1) user defined characters: the big three Japanese encodings > > use the Kuten space of 94x94 characters. There are lots of slight > > venddor variations on the basic JIS0208 character set, as well > > as people adding new Gaiji in their office workgroups. Generic > > conversion routines from, say, EUC to Shift-JIS still work > > perfectly whether you use Shift-JIS, cp932, or cp932 plus > > ten extra in-house characters. Conversions to Unicode involve > > selecting new codecs, or even making new ones, for all these > > situations. > > There is no reason that we couldn't provide a set of unified codecs > for EUC-JP, Shift JIS, ISO-2022-JP, and CP932 that provide appropriate > mappings between the EUDC sections in the legacy character sets and > the PUA of Unicode, such that these conversions work. Right. > > (2) slightly corrupt data: Let's say you are dealing with files > > or database fields containing some truncated kanji. If you > > use 8-bit-clean strings and no conversion, the data will not > > be corrupted or changed; if you try to magically convert > > it to Unicode you will get error messages or possibly even > > more corruption. Maybe you're writing an app whose job is > > to get text from machine A to machine B without changing it; > > suddenly it will stop working. I know people who spent > > weeks debugging a VB print spooler which was cutting up > > Postscript files containing kanji. > > Yes, this is a problem that I cannot suggest a good answer to: reality > raises its ugly head. We won't be introducing new magic... > > Suddenly upgrading to a new version of Python where all > > your data undergoes invisible transformations to Unicode > > and back is going to cause trouble for quite a few people. > > Absolutely. ...and the move will be slow one for sure :-) I think that a lot of small steps are required to finally get there and I don't want to rush anything. Still, I believe that talking about all this now is not such a bad idea, even though it may cause some concern about the future direction of Python. Python's history has shown that the developers have always tried to maintain backward compatibility whereever possibleand feasable. This won't change, since it is one of the most important factors in Python's success story and there are enough people on python-dev who care about this a lot. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From fredrik@effbot.org Sun Feb 11 13:34:26 2001 From: fredrik@effbot.org (Fredrik Lundh) Date: Sun, 11 Feb 2001 14:34:26 +0100 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <3A868FF3.45EEC501@lemburg.com> Message-ID: <000701c0942f$5d08b780$e46940d5@hagrid> mal wrote: > > We (at least some subset of the people in the i18n-sig) want every > > single new instance of the "open" function to declare an encoding. > > This doesn't make sense: not all uses of open() target text > information. What encoding information would you put into an > open() which wants to read a JPEG image from a file ? how about: file = open("image.jpg", encoding="image/jpeg") image = file.read() # return a PIL image object or perhaps better: file = open("image.jpg", encoding="image/*") image = file.read() > We cannot turn override the mode parameter with an encoding > parameter... why do you believe that this is backwards compatible > in any way ? (Note that mode is an optional parameter!) instead of overriding, why not append the encoding to the mode parameter: "r" # default, read text file, unknown encoding "rb" # read binary file, no encoding" "r,utf-8" # read text file, utf-8 encoding "rb,ascii" # illegal mode (this is in line with C's fopen) Cheers /F From barry@digicool.com Sun Feb 11 14:30:13 2001 From: barry@digicool.com (Barry A. Warsaw) Date: Sun, 11 Feb 2001 09:30:13 -0500 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> Message-ID: <14982.41461.889514.547839@anthem.wooz.org> >>>>> "PP" == Paul Prescod writes: PP> We (at least some subset of the people in the i18n-sig) want PP> every single new instance of the "open" function to declare an PP> encoding. I've barely followed this discussion at all, but what you say here causes my greatest nagging concern to bubble to the surface. I write lots of programs for which i18n isn't a requirement, and may never be. It seems like you saying that you want me to have to confront issues like encodings, character sets, unicode, multiplicity of string types, etc. in even the simplest, most xenophobic programs I write. That would be, IMO, a loss of epic proportions to the simplicity and "brain fitting" nature of Python. I have no problems, and in fact encourage, facilities in Python to help me i18n-ify my programs when I'm ready and need to. But not before. I really hope I'm misunderstanding. -Barry From mal@lemburg.com Sun Feb 11 14:33:48 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 11 Feb 2001 15:33:48 +0100 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? References: <3A8593BF.8AFCEBB3@ActiveState.com> <20010211140545.49DF.BRIAN@tomigaya.shibuya.tokyo.jp> Message-ID: <3A86A2CC.BB64149B@lemburg.com> Brian Takashi Hooper wrote: > > Hi there, Brian in Tokyo again, > > On Sat, 10 Feb 2001 11:17:19 -0800 > Paul Prescod wrote: > > > Andy, I think that part of the reason that Westerners push harder for > > Unicode than Japanese is because we are pressured (rightly) to right > > software that works world-wide and it is simply not sane to try to do > > that by supporting multiple character sets. Multiple encodings maybe. > > Multiple character sets? Forget it. > I think this is a true and valid point (that Westerners are more likely > to want to make internationalized software), but it sounds here like > because Westerners want to make it easier to internationalize software, > that that is a valid reason to make it harder to make software that has > no particular need for internationalization, in non-Western languages, > and change the _meaning_ of such a basic data type as the Python string. > > If in fact, as the proposal proposes, usage of open() without an > encoding, for example, is at some point deprecated, then if I am > manipulating non-Unicode data in "" strings, then I think I _do_ at some > point have to port them over. b"" then becomes > different from "", because "" > is now automatically being interpreted behind the scenes into an > internal Unicode representation. If the blob of binary data actually > happened to be in Unicode, or some Unicode-favored representation (like > UTF-8), then I might be happy about this - but if it wasn't, I think > that this result would instead be rather dismaying. We are certainly not goind to make the encoding parameter mandatory for open(). What type the .read() method returns for a file opened using an encoding is dependent on the codec in use, e.g. a Unicode codec would return Unicod, but other codecs may choose to return an encoded 8-bit string instead (with encoding attribute set accordingly). There's still much to do down that road and I wouldn't take the current proposals too seriously yet. We are still in the idea gathering phase... > The current Unicode support is more explicit about this - the meaning of > the string literal itself has not changed, so I can continue to ignore > Unicode in cases where it serves no useful purpose. I realize that it > would be nicer from a design perspective, more consistent, to have > Python string mean only character data, but right now, it does sometimes > mean binary and sometimes mean characters. The only one who can > distinguish which is the programmer - if at some point "" means only > Unicode character strings, then the programmer _does_, I think, have to > go through all their programs looking for places where they are using > strings to hold non-Unicode character data, or binary data, and > explicitly convert them over. I have difficulty seeing how we would be > able to provide a smooth upgrade path - maybe a command-line backwards > compatibility option? Maybe defaults? I've heard a lot of people > voicing dislike for default encodings, but from my perspective, > something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are, > strictly speaking, not supersets of ASCII because the ASCII ranges are > usually interpreted as JIS-Roman, which contains about 4 different > characters) is functionally a default encoding... Requiring encoding > declarations, as the proposal suggests, is nice for people working in > the i18n domain, but is an unnecessary inconvenience for those who are > not. First, I think that most string literals in programs are in fact text data, so switching to a text data type for "" wouldn't be such a big change. For those few cases, where these literals are used for binary data, switching to b"" doesn't really hurt. Of course, the programmer will have to rethink text vs. binary data, but this is what we are aiming at after all. Since this step can be too much of a burden for the programmer, we'll have to come up with a way which allows Python to maintain the old style behaviour, e.g. by telling Python to use a codec which returns a normal 8-bit string object instead of Unicode... #?encoding="old-style-strings" at the top of the source code would then do the trick. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sun Feb 11 18:26:04 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 11 Feb 2001 19:26:04 +0100 Subject: [I18n-sig] Re: Strawman Proposal: Binary Strings References: <017f01c0941f$4b13a6d0$e46940d5@hagrid> Message-ID: <3A86D93C.66BB0233@lemburg.com> Fredrik Lundh wrote: > > > About changing .encode() or the existing codecs to return binary > > strings instead of normal strings: I'm -1 on this one since it > > will break existing code. > > -1. core features shouldn't return binary data in text strings. > foo.upper() shouldn't work if "foo" isn't known to contain text. > if this breaks code (not sure it does), the binary data type > needs more work. > > > Instead, strings should probably carry along the encoding > > information in an additional attribute (it is not always useful, > > but can help in a few situations) provided that it is known. > > -1. evil. Care to explain why ? (I think that such an attribute could be put to some good use in (re-)unifying strings and Unicode). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Sun Feb 11 18:33:19 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 11 Feb 2001 19:33:19 +0100 Subject: [I18n-sig] Re: Strawman Proposal (2): Encoding attributes References: <018001c0941f$4c1140b0$e46940d5@hagrid> Message-ID: <3A86DAEE.74F44655@lemburg.com> Fredrik Lundh wrote: > > > > > Ah, ok. The encoding information will only be applied to literal > > > > Unicode strings (u"text"), right ? > > > > > > No, that's very different than what I am suggesting. > > > > > > The encoding is applied to the *text file*. > > > > -1 > > and -1 on your -1. > > MAL, you're stuck in a "unicode strings are something special" modus > operandi. the goal should be to get rid of u"foo" strings, not continue > to make Python more and more dependent on this artificial distinction. Unicode strings *are* special: they can only be used for text data. I we were to decode the whole source code file using some encoding, then use of binary data in standard ""-literals could and probably would lead to decoding errors. Some encodings even play with ASCII-characters (just take a look at the codecs in encodings/), so these would break standard program text as well. > > The result would be way to much breakage. > > I doubt it. Anyway, the two bullets I suggested on this thread implement a subset of what you (Paul and Fredrik) have in mind, so I believe it's a good compromise. We can always extend this to full text file decoding at some later stage, if that should become necessary, which I doubt ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tim.one@home.com Sun Feb 11 21:32:07 2001 From: tim.one@home.com (Tim Peters) Date: Sun, 11 Feb 2001 16:32:07 -0500 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? In-Reply-To: <3A865CA7.249910F1@ActiveState.com> Message-ID: [Tim quotes the Ref Man] > Python uses the 7-bit ASCII character set for program text and > string literals. > > That was Guido's intent, ... [Paul Prescod] > That may be the rule but try enforcing it. It is so widely violated > as to be irrelevant. Not news -- why do you suppose it isn't enforced ? > I would love it if you did try to enforce it in Python 2.1. You > would take the heat for breaking everyone's non-ASCII programs > and then I could come in and propose the draconian rule be eased with > the encoding declaration. Your life would indeed be easier then. > The wide violation of this rule should inform our discussions about > where Python source code is going in the future... In theory you'd hope it would aid your case, but in practice I'm afraid it works against you: people with 8-bit character sets covered by C locale gimmicks seemed happier before Unicode was added. Also not news, of course -- Unicode irritates everyone, because it's nobody's national encoding scheme. From tim.one@home.com Sun Feb 11 21:32:08 2001 From: tim.one@home.com (Tim Peters) Date: Sun, 11 Feb 2001 16:32:08 -0500 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) In-Reply-To: <000701c0942f$5d08b780$e46940d5@hagrid> Message-ID: [/F] > ... > instead of overriding, why not append the encoding to > the mode parameter: Bingo. > "r" # default, read text file, unknown encoding > "rb" # read binary file, no encoding" > "r,utf-8" # read text file, utf-8 encoding > "rb,ascii" # illegal mode Don't know why the last should be illegal; whether I want line-end translation done, or want Ctrl-Z to signify EOF, or etc (all the goofy x-platform distinctions made by binary vs text modes) seems independent of how character data is encoded. From tim.one@home.com Sun Feb 11 21:43:41 2001 From: tim.one@home.com (Tim Peters) Date: Sun, 11 Feb 2001 16:43:41 -0500 Subject: [I18n-sig] Re: Python and Unicode == Britain and the Euro? In-Reply-To: <019e01c09421$c402bfc0$e46940d5@hagrid> Message-ID: >> The Python Reference Manual says (chapter 2, "Lexical analysis"): >> >> Python uses the 7-bit ASCII character set for program text and >> string literals. [/F] > ...and then says "8-bit characters may be used in string literals > ad comments but their interpretation is platform dependent". > > for a non-ASCII programmer, that pretty much means "no native > character set". Absolutely. That's why the Ref Man also says: the proper way to insert 8-bit characters in string literals is by using octal or hexadecimal escape sequences Note too that Python opens Python source files in C text mode, and C doesn't guarantee that high-bit characters can be faithfully written to or read back from text-mode files either. What's the point? As I said before, the *intent* was that Python source code use 7-bit ASCII. All we're demonstrating here is the various ways in which the Ref Man is consistent with that intent. Go beyond that, and if "it works" you're seeing a platform accident, albeit a reliable accident on the major Python platforms. From paulp@ActiveState.com Sun Feb 11 21:49:38 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 13:49:38 -0800 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <14982.41461.889514.547839@anthem.wooz.org> Message-ID: <3A8708F2.669B0A2C@ActiveState.com> "Barry A. Warsaw" wrote: > > ... > > I've barely followed this discussion at all, but what you say here > causes my greatest nagging concern to bubble to the surface. I write > lots of programs for which i18n isn't a requirement, and may never be. > It seems like you saying that you want me to have to confront issues > like encodings, character sets, unicode, multiplicity of string types, > etc. in even the simplest, most xenophobic programs I write. That > would be, IMO, a loss of epic proportions to the simplicity and "brain > fitting" nature of Python. file = open("/etc/passwd", "r", "ASCII") Surely that is not such a terrible burden in the interests of making the world a little bit less xenophobic! Once you do that, everything else "just works" and when your program encounters data it can't handle in a text file it will crash in a predictable way at a logical point (the read function) instead of in an unpredictable way at an illogical point (some random string coercion or API call). Paul Prescod From paulp@ActiveState.com Sun Feb 11 21:57:09 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 13:57:09 -0800 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model References: <013c01c09417$e11c49a0$e46940d5@hagrid> Message-ID: <3A870AB5.AE1BE6A2@ActiveState.com> Fredrik Lundh wrote: > > > I really like the idea of the > > > > b"..." prefix > > > > Is anyone opposed? > > yes. Could you please describe your problem? We almost had total agreement on this feature. It was a near miracle! As you probably know, the idea behind it is to allow people to continue to put binary data (especially native encoding data) in some form of string literal and to manipulate that data as binary "automatically." > almost: it shouldn't implement text-related method. isupper, upper, > etc doesn't make sense here. Agree. > (but like in SRE, the *source* code should be reused) Agree. Paul Prescod From paulp@ActiveState.com Sun Feb 11 22:05:00 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 14:05:00 -0800 Subject: [I18n-sig] Re: Python and Unicode == Britain and the Euro? References: Message-ID: <3A870C8C.78066BC@ActiveState.com> Tim Peters wrote: > > ... > > What's the point? As I said before, the *intent* was that Python source > code use 7-bit ASCII. All we're demonstrating here is the various ways in > which the Ref Man is consistent with that intent. Go beyond that, and if > "it works" you're seeing a platform accident, albeit a reliable accident on > the major Python platforms. I still don't understand the point... It's like saying that Vancouver doesn't need drug rehab clinics because drugs are illegal here. We can't move to all-ASCII text files at this point even if we have legal/historical justifications for doing so. The best we can do is try to limit the damage of having the non-ASCII stuff floating around without labels. Paul Prescod From tim.one@home.com Sun Feb 11 22:04:32 2001 From: tim.one@home.com (Tim Peters) Date: Sun, 11 Feb 2001 17:04:32 -0500 Subject: [I18n-sig] Re: Strawman Proposal (2): Encoding attributes In-Reply-To: <018801c0941f$4ce42110$e46940d5@hagrid> Message-ID: [Tim] > "Forced" presumes it's against their will. That's what I > question. There is nothing more Eurocentric than to embark > on unilateral crusades for the purported benefit of non- > Euros who aren't asking for help <0.7 wink>. [/F] > if you think that ASCII is good enough for european languages, No, but programming language identifiers are an artificial language. Python isn't it itself a word processor, and you may as well complain that Python requires "." in numeric literals (rather than ",", or an American Indian glyph meaning "sacred fork between the mighty Integer and Fractional rivers" <0.9 wink>). > or that europeans like having to use an approximation of their > own language Ditto. > just because american programmers are lazy, It's really that Euros are too lazy to learn English . > I'm not sure you should be on this list at all . Unclear whether you're arguing to allow full Unicode in Python identifiers (which is all I'm talking about). You really want getattr() to sort out Unicode in full generality (thinking specifically of "ignorable" characters -- if you don't ignore them, you're screwing somebody else's native tongue) at runtime? I don't want to see Python get anywhere that mess. If you're implementing a word processor *in* Python, fine, you can deal with it and Python should support you. It doesn't need to complicate its own artificial language to do so. From paulp@ActiveState.com Sun Feb 11 22:21:18 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 14:21:18 -0800 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model References: <014d01c09419$07dfbad0$e46940d5@hagrid> Message-ID: <3A87105E.BE3F41D5@ActiveState.com> Fredrik Lundh wrote: > > ... > > a more pragmatic approach would be to assume ASCII en- > codings for binary data, and choke on non-ASCII chars. > > >>> "1" + b"2" > 12 > >>> "1" + buffer("2") > 12 > >>> "1" + b"\xff" > ValueError: ASCII decoding error: ordinal not in range(128) I think that that is the most consistent approach. We should define a "string type" as one that has compatible with the regular expression engine, has some defined set of string-like methods and allows conversion of ordinals less than 128 according to ASCII rules. Paul Prescod From paulp@ActiveState.com Sun Feb 11 22:24:21 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 14:24:21 -0800 Subject: [I18n-sig] Re: Strawman Proposal: Smart String Test References: <018a01c0941f$4d9b8a30$e46940d5@hagrid> Message-ID: <3A871115.FF91B436@ActiveState.com> Fredrik Lundh wrote: > > ... > > isinstance(foo, type("")) > > I think it's okay only the latter works, for now (which can > be solved by a simple and stupid hack, while waiting for a > real type hierarchy...) I have two concerns. First I'm not thrilled with having isinstance have specific knowledge of string types. People will ask us: "How do I set up a type hierarchy like the string hierarchy?" And they can't...an isstring() function is clear about the fact that it is special. My second concern is that this might break a little bit of code. For instance something like this: if issinstance(foo, type("")): print foo elif issinstance(foo, type(u"")): print foo.encode("UTF-8") Paul Prescod From paulp@ActiveState.com Sun Feb 11 22:31:12 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 14:31:12 -0800 Subject: [I18n-sig] Re: Strawman Proposal: Encoding Declaration V2 References: <018b01c0941f$4dd733a0$e46940d5@hagrid> Message-ID: <3A8712B0.E45A503D@ActiveState.com> Fredrik Lundh wrote: > > > A source file with an encoding declaration must only use non-ASCII bytes > > in places that can legally support Unicode characters. In Python 2.x the > > only place is within a Unicode literal > > make that "in a string literal". Yes, I think you're right. If a person needs to get at a Latin 1 character in a string literal they should be able to do so using > if an encoding directive is present, the *entire* file should be > assumed to use that encoding. this applies to comments, 8-bit > string literals, and 16-bit string literals. I've backed off somewhat on having the file be pre-decoded in the short term. My major conceptual problem is if we decode to Unicode-escaped ASCII or something then we mess up the column numbers and the syntax errors will not be right. We might really need to have a Unicode-aware parser before we can do this... Paul Prescod From paulp@ActiveState.com Sun Feb 11 22:35:42 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 14:35:42 -0800 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <3A868FF3.45EEC501@lemburg.com> Message-ID: <3A8713BE.9AC80542@ActiveState.com> "M.-A. Lemburg" wrote: > > [Paul, it would help if you wouldn't always remove important parts > of the quoted messages... people who don't read the whole thread > won't have a chance to follow up] I think we have different interpretations of important... > Paul Prescod wrote: > > > > ... > > There is an important reason that we did not use a keyword argument. > > > > We (at least some subset of the people in the i18n-sig) want every > > single new instance of the "open" function to declare an encoding. > > This doesn't make sense: not all uses of open() target text > information. What encoding information would you put into an > open() which wants to read a JPEG image from a file ? "binary" or "raw" > f = open(filename, 'w', encoding='mbcs') > > There's a little more typing required, but the readability is > unbeatable... Why not go all the way: f = open(filename=filename, mode='w', encoding='mbcs') Keyword attributes are great for optional parameters. I don't see encoding as optional. Anyhow, I like Fredrick's idea of extending the mode string. Paul Prescod From tim.one@home.com Sun Feb 11 22:42:22 2001 From: tim.one@home.com (Tim Peters) Date: Sun, 11 Feb 2001 17:42:22 -0500 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes In-Reply-To: <3A865F6A.6CC12CC3@ActiveState.com> Message-ID: [Tim, continues to question whether Unicode identifiers are market-driven or head-driven] [Paul Prescod] > We will see. Before Unicode it would have been very hard to do this and > yet achieve source code portability between systems. Unicode and the > tools and languages that use it are just being deployed. Java has supported Unicode identifiers since its start, and is far more widely used than Python. If you can't find supporting evidence of actual user demand there (I failed to) ... > ... > But I'll say again that I think it would be a big mistake to add > any further impedements to getting there. Who has proposed adding an impediment? If someone did, I missed it. >> it's-a-programming-language-not-a-word-processor-ly y'rs - tim > I don't understand your fundamental point. Simplicity. I like the ECMAScript (nee JavaScript) rule: identifiers are Unicode. But only a subset of the first 128 Unicode characters are allowed <0.9 wink>. > ... > I'm not claiming it's a burning need, but I don't see why a Japanese > teenager learning to program for the first time would choose to use a > language that requires English variable names over one that offered > choice. Try asking one? For example, ask Yukihiro Matsumoto why Ruby's set of allowed identifiers is the same as Python's. If a Japanese language designer sees no need to support Japanese identifiers, I'm not going to presume I know Japanese programmer needs better than him -- or that you do either. > ... > Where's Greg Wilson when I need him? Doubt he's on this SIG; mailto:gvwilson@nevex.com. From mal@lemburg.com Sun Feb 11 22:47:46 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 11 Feb 2001 23:47:46 +0100 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <3A868FF3.45EEC501@lemburg.com> <3A8713BE.9AC80542@ActiveState.com> Message-ID: <3A871692.3370A468@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > > Paul Prescod wrote: > > > > > > ... > > > There is an important reason that we did not use a keyword argument. > > > > > > We (at least some subset of the people in the i18n-sig) want every > > > single new instance of the "open" function to declare an encoding. > > > > This doesn't make sense: not all uses of open() target text > > information. What encoding information would you put into an > > open() which wants to read a JPEG image from a file ? > > "binary" or "raw" I'm -1 on enforcing this. Encoding is optional and has to be, since 1. existing programs don't provide the parameter and would break 2. the user can't know in advance if the file to be opened is of a certain encoding or type (e.g. image or sound file) 3. not all files contain encoded data for which Python provides a codec for decoding 4. it may not be in the programers intent to have the file decoded even though it uses a certain encoding > > f = open(filename, 'w', encoding='mbcs') > > > > There's a little more typing required, but the readability is > > unbeatable... > > Why not go all the way: > > f = open(filename=filename, mode='w', encoding='mbcs') Now you're being silly... > Keyword attributes are great for optional parameters. I don't see > encoding as optional. Anyhow, I like Fredrick's idea of extending the > mode string. I'm not sure I like it -- it looks like a hack to me and I don't really see what's so bad about an optional keyword argument for open(). At least noone has yet convinced me of any problems with it. Note that Fredrik's idea doesn't make the encoding parameter a requirement either (and this is Goodness). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paulp@ActiveState.com Sun Feb 11 22:56:22 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Sun, 11 Feb 2001 14:56:22 -0800 Subject: [I18n-sig] Strawman Proposal (2): Encoding attributes References: Message-ID: <3A871896.2246587B@ActiveState.com> Tim Peters wrote: > > ... > > Java has supported Unicode identifiers since its start, and is far more > widely used than Python. If you can't find supporting evidence of actual > user demand there (I failed to) ... Java is a programming language for professional programmers. They think it is natural to compare two strings with the "isequal" method. Anyone in that mindset would find romanji natural too! > > ... > > But I'll say again that I think it would be a big mistake to add > > any further impedements to getting there. > > Who has proposed adding an impediment? If someone did, I missed it. There was a suggestion of having the encoding declaration only apply to unicode strings. Special characters in comments and literal strings would be interpreted as Latin 1. Now several years from now we'd have to invent another encoding declaration for non-string stuff. > Try asking one? For example, ask Yukihiro Matsumoto why Ruby's set of > allowed identifiers is the same as Python's. If a Japanese language > designer sees no need to support Japanese identifiers, I'm not going to > presume I know Japanese programmer needs better than him -- or that you do > either. I don't presume to know what they want but I do know that people's needs change and anticipating that is part of systems design in general and language design in particular. > > ... > > Where's Greg Wilson when I need him? > > Doubt he's on this SIG; mailto:gvwilson@nevex.com. Twas more of a joke... Paul Prescod From brian@tomigaya.shibuya.tokyo.jp Mon Feb 12 01:27:23 2001 From: brian@tomigaya.shibuya.tokyo.jp (Brian Takashi Hooper) Date: Mon, 12 Feb 2001 10:27:23 +0900 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? In-Reply-To: <3A86A2CC.BB64149B@lemburg.com> References: <20010211140545.49DF.BRIAN@tomigaya.shibuya.tokyo.jp> <3A86A2CC.BB64149B@lemburg.com> Message-ID: <20010212100732.81C6.BRIAN@tomigaya.shibuya.tokyo.jp> Thanks for the clarifications, Marc-Andre. I have no problem with following new conventions, when they are decided upon, for new programs - I just don't want old programs to break _too_ much; measures like the encoding directives, if they are implemented properly, are needed I feel to ease a transition to a new paradigm. Also, I hadn't yet read Paul's "Guidelines for Language Evolution" PEP, which I just now _did_ read, if the change is gradual and provides warning messages for deprecated constructs, then that makes this proposal seem less scary (does this mean that it might also be time to start thinking about the workings of a "deprecation and warning facility" as described in that document, also?) --Brian On Sun, 11 Feb 2001 15:33:48 +0100 "M.-A. Lemburg" wrote: > Brian Takashi Hooper wrote: > > > > Hi there, Brian in Tokyo again, > > > > On Sat, 10 Feb 2001 11:17:19 -0800 > > Paul Prescod wrote: > > > > > Andy, I think that part of the reason that Westerners push harder for > > > Unicode than Japanese is because we are pressured (rightly) to right > > > software that works world-wide and it is simply not sane to try to do > > > that by supporting multiple character sets. Multiple encodings maybe. > > > Multiple character sets? Forget it. > > I think this is a true and valid point (that Westerners are more likely > > to want to make internationalized software), but it sounds here like > > because Westerners want to make it easier to internationalize software, > > that that is a valid reason to make it harder to make software that has > > no particular need for internationalization, in non-Western languages, > > and change the _meaning_ of such a basic data type as the Python string. > > > > If in fact, as the proposal proposes, usage of open() without an > > encoding, for example, is at some point deprecated, then if I am > > manipulating non-Unicode data in "" strings, then I think I _do_ at some > > point have to port them over. b"" then becomes > > different from "", because "" > > is now automatically being interpreted behind the scenes into an > > internal Unicode representation. If the blob of binary data actually > > happened to be in Unicode, or some Unicode-favored representation (like > > UTF-8), then I might be happy about this - but if it wasn't, I think > > that this result would instead be rather dismaying. > > We are certainly not goind to make the encoding parameter > mandatory for open(). What type the .read() method returns for > a file opened using an encoding is dependent on the codec in > use, e.g. a Unicode codec would return Unicod, but other codecs > may choose to return an encoded 8-bit string instead (with encoding > attribute set accordingly). > > There's still much to do down that road and I wouldn't take the > current proposals too seriously yet. We are still in the idea > gathering phase... > > > The current Unicode support is more explicit about this - the meaning of > > the string literal itself has not changed, so I can continue to ignore > > Unicode in cases where it serves no useful purpose. I realize that it > > would be nicer from a design perspective, more consistent, to have > > Python string mean only character data, but right now, it does sometimes > > mean binary and sometimes mean characters. The only one who can > > distinguish which is the programmer - if at some point "" means only > > Unicode character strings, then the programmer _does_, I think, have to > > go through all their programs looking for places where they are using > > strings to hold non-Unicode character data, or binary data, and > > explicitly convert them over. I have difficulty seeing how we would be > > able to provide a smooth upgrade path - maybe a command-line backwards > > compatibility option? Maybe defaults? I've heard a lot of people > > voicing dislike for default encodings, but from my perspective, > > something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are, > > strictly speaking, not supersets of ASCII because the ASCII ranges are > > usually interpreted as JIS-Roman, which contains about 4 different > > characters) is functionally a default encoding... Requiring encoding > > declarations, as the proposal suggests, is nice for people working in > > the i18n domain, but is an unnecessary inconvenience for those who are > > not. > > First, I think that most string literals in programs are > in fact text data, so switching to a text data type for "" > wouldn't be such a big change. For those few cases, where > these literals are used for binary data, switching to b"" > doesn't really hurt. > > Of course, the programmer will have to rethink text vs. binary > data, but this is what we are aiming at after all. > > Since this step can be too much of a burden for the programmer, > we'll have to come up with a way which allows Python to maintain the > old style behaviour, e.g. by telling Python to use a codec which > returns a normal 8-bit string object instead of Unicode... > > #?encoding="old-style-strings" > > at the top of the source code would then do the trick. > > -- > Marc-Andre Lemburg > ______________________________________________________________________ > Company: http://www.egenix.com/ > Consulting: http://www.lemburg.com/ > Python Pages: http://www.lemburg.com/python/ -- Brian Takashi Hooper From tdickenson@geminidataloggers.com Mon Feb 12 08:01:58 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Mon, 12 Feb 2001 08:01:58 +0000 Subject: [I18n-sig] Strawman Proposal: Smart String Test In-Reply-To: References: <4o588t4683cpu32srcmp928b1m5dr003i3@4ax.com> Message-ID: On Fri, 9 Feb 2001 08:40:28 -0800 (PST), Paul Prescod wrote: >On Fri, 9 Feb 2001, Toby Dickenson wrote: >> Paul Prescod wrote: >> >Is there a practical problem with this solution? >> >> def isstring(obj): >> return type(obj) in (StringType, UnicodeType) or isinstance(obj, >> UserString) > >Are you saying that there is a problem with isstring? Or proposing a >slightly different formulation? At the moment we dont have a tight definition of the 'string interface'. While I think we can agree that old code which uses type(x)=3D=3DStringType is probably wrong, Im not sure we can agree what that code should be using without examining that code. Note that several similar interface-testing functions are very rarely used (operator.isNumberType, operator.isMappingType), and Python doesnt have functions for other more popular interfaces (no isFileType, for example).=20 Toby Dickenson tdickenson@geminidataloggers.com From tim.one@home.com Mon Feb 12 08:10:01 2001 From: tim.one@home.com (Tim Peters) Date: Mon, 12 Feb 2001 03:10:01 -0500 Subject: [I18n-sig] RE: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <01a401c090fd$5165b700$0900a8c0@SPIFF> Message-ID: [Neil Hodgson] > Matz: "We don't believe there can be any single characer- > encoding that encompasses all the world's languages. We want > to handle multiple encodings at the same time (if you want to). [/F] > neither does the unicode designers, of course: the point > is that unicode only deals with glyphs, not languages. > > most existing japanese encodings also include language info, > and if you don't understand the difference, it's easy to think > that unicode sucks... It would be helpful to read Matz's quote in context: http://www.deja.com/getdoc.xp?AN=705520466&fmt=text The "encompasses all the world's languages" business was taken verbatim from the question to which he was replying. His concerns for Unicoded Japanese are about time efficiency for conversions from ubiquitous national encodings; relative (lack of) space efficiency for UTF-8 storage of Unicoded Japanese (unclear why he's hung up on UTF-8, though -- but it's an ongoing theme in c.l.ruby); and that Unicode (including surrogates) is too small and too late for parts of his market: I was thinking of applications that process big character set (e.g. Mojikyo set) which is not covered by Unicode. I don't know exactly how many code points it has. But I've heard it's pretty big, possibly consumes half of surrogate space. And they want to process them now. I think they don't want to wait Unicode consortium to assign code points for their characters. The first hit I found on Mojikyo was for a freely downloadable "Mojikyo Font Set", containing about 50,000 Chinese glyphs beyond those covered by Unicode, + about 20,000 more from other Asian languages. Python better move fast lest it lose the Oracle Bone market to Ruby . a-2-byte-encoding-space-was-too-small-the-day-unicode-was-conceived- and-20-bits-won't-last-either-ly y'rs - tim From tdickenson@geminidataloggers.com Mon Feb 12 08:11:55 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Mon, 12 Feb 2001 08:11:55 +0000 Subject: [I18n-sig] Strawman Proposal: Binary Strings In-Reply-To: <3A85568A.5B694917@lemburg.com> References: <3A830091.3D855EDD@ActiveState.com> <3A85568A.5B694917@lemburg.com> Message-ID: On Sat, 10 Feb 2001 15:56:10 +0100, "M.-A. Lemburg" wrote: >Note that changing e.g. .encode('latin-1') to return a binary string >doesn't really make sense, since here we know the encoding ! Instead, >strings should probably carry along the encoding information in an >additional attribute (it is not always useful, but can help in >a few situations) provided that it is known. To what use would that encoding attribute be put? surely not to provide automatic encoding when these tagged strings interact with unicode strings (Thats back towards the solution that I think we already ruled out) If .encode('latin1') or .encode('utf8') are going to return anything tagged with an encoding, then surely it should be a tagged binary string? Toby Dickenson tdickenson@geminidataloggers.com From andy@reportlab.com Mon Feb 12 08:19:09 2001 From: andy@reportlab.com (Andy Robinson) Date: Mon, 12 Feb 2001 08:19:09 -0000 Subject: [I18n-sig] Random thoughts on Unicode and Python In-Reply-To: <3A86922D.AB5AB78E@lemburg.com> Message-ID: > -----Original Message----- > From: M.-A. Lemburg [mailto:mal@lemburg.com] > Sent: 11 February 2001 13:23 > To: tree@basistech.com > Cc: Andy Robinson; i18n-sig@python.org > Subject: Re: [I18n-sig] Random thoughts on Unicode and Python > > > Tom Emerson wrote: > > > > Andy Robinson writes: > > > (1) user defined characters: the big three Japanese encodings > > > use the Kuten space of 94x94 characters. There are lots > of slight > > > venddor variations on the basic JIS0208 character set, as well > > > as people adding new Gaiji in their office workgroups. Generic > > > conversion routines from, say, EUC to Shift-JIS still work > > > perfectly whether you use Shift-JIS, cp932, or cp932 plus > > > ten extra in-house characters. Conversions to Unicode involve > > > selecting new codecs, or even making new ones, for all these > > > situations. > > > > There is no reason that we couldn't provide a set of > unified codecs > > for EUC-JP, Shift JIS, ISO-2022-JP, and CP932 that > provide appropriate > > mappings between the EUDC sections in the legacy > character sets and > > the PUA of Unicode, such that these conversions work. > > Right. Exactly. Both the problems I mentioned can and should be solved properly with Unicode. I'm just noting that a while bunch of people have solved them without Unicode in the past and that's where to look for code that will break. - Andy p.s. and yes, I'm working on those extended codecs now. From mal@lemburg.com Mon Feb 12 10:24:32 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 12 Feb 2001 11:24:32 +0100 Subject: [I18n-sig] Python and Unicode == Britain and the Euro? References: <20010211140545.49DF.BRIAN@tomigaya.shibuya.tokyo.jp> <3A86A2CC.BB64149B@lemburg.com> <20010212100732.81C6.BRIAN@tomigaya.shibuya.tokyo.jp> Message-ID: <3A87B9E0.D9A2A598@lemburg.com> Brian Takashi Hooper wrote: > > Thanks for the clarifications, Marc-Andre. > > I have no problem with following new conventions, when they are decided > upon, for new programs - I just don't want old programs to break _too_ much; > measures like the encoding directives, if they are implemented properly, > are needed I feel to ease a transition to a new paradigm. Good to have you back on board :-) > Also, I hadn't yet read Paul's "Guidelines for Language Evolution" PEP, > which I just now _did_ read, if the change is gradual and provides > warning messages for deprecated constructs, then that makes this > proposal seem less scary (does this mean that it might also be time to > start thinking about the workings of a "deprecation and warning facility" > as described in that document, also?) Right. The warning facility is already in place in 2.1: Guido added a complete warning framework which is currently used to warn about deprecated module usage like e.g. regex, regsub, etc. > --Brian > > On Sun, 11 Feb 2001 15:33:48 +0100 > "M.-A. Lemburg" wrote: > > > Brian Takashi Hooper wrote: > > > > > > Hi there, Brian in Tokyo again, > > > > > > On Sat, 10 Feb 2001 11:17:19 -0800 > > > Paul Prescod wrote: > > > > > > > Andy, I think that part of the reason that Westerners push harder for > > > > Unicode than Japanese is because we are pressured (rightly) to right > > > > software that works world-wide and it is simply not sane to try to do > > > > that by supporting multiple character sets. Multiple encodings maybe. > > > > Multiple character sets? Forget it. > > > I think this is a true and valid point (that Westerners are more likely > > > to want to make internationalized software), but it sounds here like > > > because Westerners want to make it easier to internationalize software, > > > that that is a valid reason to make it harder to make software that has > > > no particular need for internationalization, in non-Western languages, > > > and change the _meaning_ of such a basic data type as the Python string. > > > > > > If in fact, as the proposal proposes, usage of open() without an > > > encoding, for example, is at some point deprecated, then if I am > > > manipulating non-Unicode data in "" strings, then I think I _do_ at some > > > point have to port them over. b"" then becomes > > > different from "", because "" > > > is now automatically being interpreted behind the scenes into an > > > internal Unicode representation. If the blob of binary data actually > > > happened to be in Unicode, or some Unicode-favored representation (like > > > UTF-8), then I might be happy about this - but if it wasn't, I think > > > that this result would instead be rather dismaying. > > > > We are certainly not goind to make the encoding parameter > > mandatory for open(). What type the .read() method returns for > > a file opened using an encoding is dependent on the codec in > > use, e.g. a Unicode codec would return Unicod, but other codecs > > may choose to return an encoded 8-bit string instead (with encoding > > attribute set accordingly). > > > > There's still much to do down that road and I wouldn't take the > > current proposals too seriously yet. We are still in the idea > > gathering phase... > > > > > The current Unicode support is more explicit about this - the meaning of > > > the string literal itself has not changed, so I can continue to ignore > > > Unicode in cases where it serves no useful purpose. I realize that it > > > would be nicer from a design perspective, more consistent, to have > > > Python string mean only character data, but right now, it does sometimes > > > mean binary and sometimes mean characters. The only one who can > > > distinguish which is the programmer - if at some point "" means only > > > Unicode character strings, then the programmer _does_, I think, have to > > > go through all their programs looking for places where they are using > > > strings to hold non-Unicode character data, or binary data, and > > > explicitly convert them over. I have difficulty seeing how we would be > > > able to provide a smooth upgrade path - maybe a command-line backwards > > > compatibility option? Maybe defaults? I've heard a lot of people > > > voicing dislike for default encodings, but from my perspective, > > > something like ISO-Latin-1, or UTF-8, or even ASCII (EUC-JP and SJIS are, > > > strictly speaking, not supersets of ASCII because the ASCII ranges are > > > usually interpreted as JIS-Roman, which contains about 4 different > > > characters) is functionally a default encoding... Requiring encoding > > > declarations, as the proposal suggests, is nice for people working in > > > the i18n domain, but is an unnecessary inconvenience for those who are > > > not. > > > > First, I think that most string literals in programs are > > in fact text data, so switching to a text data type for "" > > wouldn't be such a big change. For those few cases, where > > these literals are used for binary data, switching to b"" > > doesn't really hurt. > > > > Of course, the programmer will have to rethink text vs. binary > > data, but this is what we are aiming at after all. > > > > Since this step can be too much of a burden for the programmer, > > we'll have to come up with a way which allows Python to maintain the > > old style behaviour, e.g. by telling Python to use a codec which > > returns a normal 8-bit string object instead of Unicode... > > > > #?encoding="old-style-strings" > > > > at the top of the source code would then do the trick. > > > > -- > > Marc-Andre Lemburg > > ______________________________________________________________________ > > Company: http://www.egenix.com/ > > Consulting: http://www.lemburg.com/ > > Python Pages: http://www.lemburg.com/python/ > > -- > Brian Takashi Hooper -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Mon Feb 12 10:39:13 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 12 Feb 2001 11:39:13 +0100 Subject: [I18n-sig] RE: [Python-Dev] Pre-PEP: Python Character Model References: Message-ID: <3A87BD51.3088DACA@lemburg.com> Tim Peters wrote: > > [Neil Hodgson] > > Matz: "We don't believe there can be any single characer- > > encoding that encompasses all the world's languages. We want > > to handle multiple encodings at the same time (if you want to). > > [/F] > > neither does the unicode designers, of course: the point > > is that unicode only deals with glyphs, not languages. > > > > most existing japanese encodings also include language info, > > and if you don't understand the difference, it's easy to think > > that unicode sucks... > > It would be helpful to read Matz's quote in context: > > http://www.deja.com/getdoc.xp?AN=705520466&fmt=text > > The "encompasses all the world's languages" business was taken verbatim from > the question to which he was replying. His concerns for Unicoded Japanese > are about time efficiency for conversions from ubiquitous national > encodings; relative (lack of) space efficiency for UTF-8 storage of Unicoded > Japanese (unclear why he's hung up on UTF-8, though -- but it's an ongoing > theme in c.l.ruby); and that Unicode (including surrogates) is too small and > too late for parts of his market: > > I was thinking of applications that process big character > set (e.g. Mojikyo set) which is not covered by Unicode. I > don't know exactly how many code points it has. But I've > heard it's pretty big, possibly consumes half of surrogate > space. And they want to process them now. I think they > don't want to wait Unicode consortium to assign code points > for their characters. > > The first hit I found on Mojikyo was for a freely downloadable "Mojikyo Font > Set", containing about 50,000 Chinese glyphs beyond those covered by > Unicode, + about 20,000 more from other Asian languages. Python better move > fast lest it lose the Oracle Bone market to Ruby . > > a-2-byte-encoding-space-was-too-small-the-day-unicode-was-conceived- > and-20-bits-won't-last-either-ly y'rs - tim Has anyone ever considered the problems this causes for type designers ? Who is going to do the job of designing 2^20 character glyphs to all match the same font design guidelines ? Perhaps I'm missing something here, but this sounds like Just is going to have a bright future ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Mon Feb 12 10:53:55 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 12 Feb 2001 11:53:55 +0100 Subject: [I18n-sig] string encoding attribute (Strawman Proposal: Binary Strings) References: <3A830091.3D855EDD@ActiveState.com> <3A85568A.5B694917@lemburg.com> Message-ID: <3A87C0C3.D27F6FF8@lemburg.com> Toby Dickenson wrote: > > On Sat, 10 Feb 2001 15:56:10 +0100, "M.-A. Lemburg" > wrote: > > >Note that changing e.g. .encode('latin-1') to return a binary string > >doesn't really make sense, since here we know the encoding ! Instead, > >strings should probably carry along the encoding information in an > >additional attribute (it is not always useful, but can help in > >a few situations) provided that it is known. > > To what use would that encoding attribute be put? The lack of encoding information is the cause of all the problems related to coercing 8-bit strings to Unicode. If we had this information on a per-string basis, then we could do a *much* better job. > surely not to > provide automatic encoding when these tagged strings interact with > unicode strings (Thats back towards the solution that I think we > already ruled out) Depends on who "we" is ;-) I believe that we should reconsider the idea on different grounds. Back when this was discussed on python-dev, the main argument against adding such an attribute was that the its value would be coerced to 'binary' much too fast to be of any value. That was certainly true at the time, but the current ideas tossed around on this list suggest that we are moving towards a clearer distinction between binary and text data. In the current context, the attribute could well be used to avoid using magic when it comes to guessing the encoding of 8-bit strings at coercion time. > If .encode('latin1') or .encode('utf8') are going to return anything > tagged with an encoding, then surely it should be a tagged binary > string? No. The encoding attribute would then return 'latin-1' and 'utf-8' resp. -- that's the point of the attribute: it should store the encoding information in case it is available. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From andy@reportlab.com Mon Feb 12 11:06:25 2001 From: andy@reportlab.com (Andy Robinson) Date: Mon, 12 Feb 2001 11:06:25 -0000 Subject: [I18n-sig] RE: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A87BD51.3088DACA@lemburg.com> Message-ID: > a-2-byte-encoding-space-was-too-small-the-day-unicode-was-conceived- > > and-20-bits-won't-last-either-ly y'rs - tim > > Has anyone ever considered the problems this causes for type > designers ? Who is going to do the job of designing 2^20 character > glyphs to all match the same font design guidelines ? Perhaps > I'm missing something here, but this sounds like Just is going > to have a bright future ;-) > Work has been going on on this glyph set for many years. And the font vendors for Japan can charge VERY high prices. Needless to say they are not big fans of Open Source. - Andy From walter@livinglogic.de Mon Feb 12 11:27:16 2001 From: walter@livinglogic.de (=?us-ascii?Q?=22Walter_D=F6rwald=22?=) Date: Mon, 12 Feb 2001 12:27:16 +0100 Subject: [I18n-sig] Random thoughts on Unicode and Python In-Reply-To: <3A85B8F9.1F494BF8@lemburg.com> References: <14981.45051.945099.633730@cymru.basistech.com> <3A85B8F9.1F494BF8@lemburg.com> Message-ID: <200102121227160015.004DA31F@mail.livinglogic.de> On 10.02.01 at 22:56 M.-A. Lemburg wrote: > [...] > We are trying to tell people that storing text data is better > done in Unicode than in a raw data buffer like Python's current > string data type. It's not enought to tell people, you actually have to make sure that storing unicode text data is better and more convenient than plain old strings, this means that Unicode text must be usable in: open(u"foo.txt") urllib.open(u"foo.txt") s =3D eval(u"\"\\u3042\"") exec(u"s =3D \"\\u3042\"") os.stat(u"foo.txt") os.system(u"foo -x \u3042") os.popen2(u"foo -x \u3042",u"r") and thousands of others. I think that the first step should be to make Unicode usable everywhere. As a first step this can be done by converting to the default encoding internally (as e.g. eval and exec do now), There may be OS services (e.g. file i/o) that are not Unicode aware. For these services converting to the default encoding is all that can be done, but when the OS supports Unicode, it should be used (for example Unicode filenames on NT/2000). The next step should be to switch to Unicode internally, i.e. use Unicode for Python variable names, module names, source code, etc. > [...] Just my $0.02! Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7 www.livinglogic.de From tdickenson@geminidataloggers.com Mon Feb 12 13:28:29 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Mon, 12 Feb 2001 13:28:29 -0000 Subject: [I18n-sig] string encoding attribute (Strawman Proposal: Bina ry Strings) Message-ID: <9FC702711D39D3118D4900902778ADC83244AA@JUPITER> > > If .encode('latin1') or .encode('utf8') are going to return anything > > tagged with an encoding, then surely it should be a tagged binary > > string? > > No. The encoding attribute would then return 'latin-1' and 'utf-8' > resp. -- that's the point of the attribute: it should store the > encoding information in case it is available. I think you misread. I said.... "A tagged binary string" not "Tagged as a binary string" In other words, at the moment I dont see must distinction between a binary string, and text string tagged with an encoding. Indeed the only distinction is the presence of a tag. Is that sufficient distinction to make them different types? Why cant I tag a binary string to say it contains a jpeg? From barry@digicool.com Mon Feb 12 14:14:23 2001 From: barry@digicool.com (Barry A. Warsaw) Date: Mon, 12 Feb 2001 09:14:23 -0500 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <14982.41461.889514.547839@anthem.wooz.org> <3A8708F2.669B0A2C@ActiveState.com> Message-ID: <14983.61375.407675.822695@anthem.wooz.org> >>>>> "PP" == Paul Prescod writes: PP> file = open("/etc/passwd", "r", "ASCII") PP> Surely that is not such a terrible burden in the interests of PP> making the world a little bit less xenophobic! Once you do PP> that, everything else "just works" and when your program PP> encounters data it can't handle in a text file it will crash PP> in a predictable way at a logical point (the read function) PP> instead of in an unpredictable way at an illogical point (some PP> random string coercion or API call). Requiring the encoding imposes too much burden on the newbie learning the language, IMHO. It seems obvious that if you're going to open something, you've got to specify what your opening (i.e. open() makes no sense without the filename parameter). I think you can easily explain the difference between opening for reading and opening for writing, although the myriad other mode options are pushing it (e.g. the difference b/w r+, w+, and a+ are quite subtle and not described sufficiently). Now to require the encoding either forces you to ask the user to trust you ("most of you will just want `ascii' for the encoding parameter, don't worry about what that means"), or to go into /some/ explanation of what encodings are, what the possible legal values are, what the difference between "ascii" and "raw" are and when you want to use them, what can happen if you misspell an encoding, how to guess the encoding of the file you're about to open, What can happen if you guess incorrectly, etc. etc. If you care about file encodings, you've got to learn all that anyway. Fine, but that's a heavy burden to place on a new convert. I'm convinced Guido felt that open() would be used very early on in a newbie's experience and wanted to make it as simple as possible. That's why it's a built-in. Other messages in this thread seem to agree that /if/ open() were to grow an encoding argument, it should be optional. -Barry From tree@basistech.com Mon Feb 12 14:38:46 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 12 Feb 2001 09:38:46 -0500 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) In-Reply-To: <14983.61375.407675.822695@anthem.wooz.org> References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <14982.41461.889514.547839@anthem.wooz.org> <3A8708F2.669B0A2C@ActiveState.com> <14983.61375.407675.822695@anthem.wooz.org> Message-ID: <14983.62838.503700.118150@cymru.basistech.com> barry@digicool.com writes: [...] > Other messages in this thread seem to agree that /if/ open() were to > grow an encoding argument, it should be optional. What if it were possible to specify the "default" encoding at configure time, while keeping the argument to open() optional. Ruby does this, as does MySQL, so there *is* precedent. -tree -- Tom Emerson Basis Technology Corp. Stringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From barry@digicool.com Mon Feb 12 14:57:20 2001 From: barry@digicool.com (Barry A. Warsaw) Date: Mon, 12 Feb 2001 09:57:20 -0500 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A7FD69C.1708339C@lemburg.com> <3A800DBC.2BE8ECEF@ActiveState.com> <3A8013BA.2FF93E8B@lemburg.com> <3A801E49.F8DF70E2@ActiveState.com> <200102062100.f16L0xm01175@mira.informatik.hu-berlin.de> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <14982.41461.889514.547839@anthem.wooz.org> <3A8708F2.669B0A2C@ActiveState.com> <14983.61375.407675.822695@anthem.wooz.org> <14983.62838.503700.118150@cymru.basistech.com> Message-ID: <14983.63952.207294.647978@anthem.wooz.org> >>>>> "TE" == Tom Emerson writes: TE> What if it were possible to specify the "default" encoding at TE> configure time, while keeping the argument to open() TE> optional. Ruby does this, as does MySQL, so there *is* TE> precedent. That's a little scary because then Python programs may cease to be portable. Moderately better would be an API to select the default encoding at runtime, but that's still worrisome. -Barry From mal@lemburg.com Mon Feb 12 16:15:21 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 12 Feb 2001 17:15:21 +0100 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A7F9084.509510B8@ActiveState.com> <3A808702.5FF36669@ActiveState.com> <200102070000.f1700BV02437@mira.informatik.hu-berlin.de> <3A80951E.DF725F03@ActiveState.com> <200102070732.f177WrV00930@mira.informatik.hu-berlin.de> <3A81AC7C.3FFE73E5@ActiveState.com> <200102080037.f180bul01609@mira.informatik.hu-berlin.de> <3A820CD2.25C3F978@ActiveState.com> <200102081929.f18JTaa00798@mira.informatik.hu-berlin.de> <3A82FD60.EFB38FAD@ActiveState.com> <200102082046.f18KkGC01420@mira.informatik.hu-berlin.de> <3A831110.6AADE590@ActiveState.com> <3A85BBC6.BBAA8D70@lemburg.com> <200102102223.RAA28498@cj20424-a.reston1.va.home.com> <3A860AA3.655F4207@ActiveState.com> <14982.41461.889514.547839@anthem.wooz.org> <3A8708F2.669B0A2C@ActiveState.com> <14983.61375.407675.822695@anthem.wooz.org> <14983.62838.503700.118150@cymru.basistech.com> <14983.63952.207294.647978@anthem.wooz.org> Message-ID: <3A880C19.7DD8FB80@lemburg.com> "Barry A. Warsaw" wrote: > > >>>>> "TE" == Tom Emerson writes: > > TE> What if it were possible to specify the "default" encoding at > TE> configure time, while keeping the argument to open() > TE> optional. Ruby does this, as does MySQL, so there *is* > TE> precedent. > > That's a little scary because then Python programs may cease to be > portable. Moderately better would be an API to select the default > encoding at runtime, but that's still worrisome. A default value for encoding wouldn't work, since not all files you open are text files. The only reasonable default for the encoding parameter is 'binary'. -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From tim.one@home.com Mon Feb 12 20:38:00 2001 From: tim.one@home.com (Tim Peters) Date: Mon, 12 Feb 2001 15:38:00 -0500 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) In-Reply-To: <14983.61375.407675.822695@anthem.wooz.org> Message-ID: [Barry A. Warsaw] > ... > Requiring the encoding imposes too much burden on the newbie learning > the language, IMHO. Indeed it does, no matter how strongly an advocate may believe users "should" be aware of i18n issues. By the same token, you could get yourself into a world of trouble by coding x = float(y) + z unless you're careful to first specify the hardware rounding mode, values for the 5 IEEE-754 exception masks, and the HW precision control setting if you're running on a Pentium (also HW range control if running on Itanium). And, someday, Python will probably grow ways to specify all that stuff. If, at that time, I suggest everyone *must* specify them before doing any fp arithmetic, I hope someone has the good taste to just shoot me . BTW, Python should drop C's text mode, because it's feeble and ill-defined across platforms. just-thought-i'd-liven-it-up-ly y'rs - tim From paulp@ActiveState.com Mon Feb 12 21:26:24 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Mon, 12 Feb 2001 13:26:24 -0800 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: Message-ID: <3A885500.68F9D984@ActiveState.com> Tim Peters wrote: > > [Barry A. Warsaw] > > ... > > Requiring the encoding imposes too much burden on the newbie learning > > the language, IMHO. > > Indeed it does, no matter how strongly an advocate may believe users > "should" be aware of i18n issues. It has nothing to do with awareness of il8n issues. The fundamental question is whether you expect to get text back from a read() or binary. If you open with ASCII you get text coercions, text conversions and other text semantics. If you open with binary you don't. I do not think it too much to ask for people to know the difference between text and binary data! Paul Prescod From tim.one@home.com Mon Feb 12 22:52:14 2001 From: tim.one@home.com (Tim Peters) Date: Mon, 12 Feb 2001 17:52:14 -0500 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) In-Reply-To: <3A885500.68F9D984@ActiveState.com> Message-ID: [Paul Prescod] > It has nothing to do with awareness of il8n issues. The fundamental > question is whether you expect to get text back from a read() or binary. C already addresses that distinction ("r" vs "rb" open modes). From paulp@ActiveState.com Tue Feb 13 00:17:11 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Mon, 12 Feb 2001 16:17:11 -0800 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: Message-ID: <3A887D07.220D5904@ActiveState.com> Tim Peters wrote: > > [Paul Prescod] > > It has nothing to do with awareness of il8n issues. The fundamental > > question is whether you expect to get text back from a read() or binary. > > C already addresses that distinction ("r" vs "rb" open modes). Python is documented as only using the distinction to handle line ends. We want to create totally different object types based on the flag. Paul Prescod From tim.one@home.com Tue Feb 13 01:45:30 2001 From: tim.one@home.com (Tim Peters) Date: Mon, 12 Feb 2001 20:45:30 -0500 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) In-Reply-To: <3A887D07.220D5904@ActiveState.com> Message-ID: [Paul Prescod] > It has nothing to do with awareness of il8n issues. The fundamental > question is whether you expect to get text back from a read() > or binary. [Tim] > C already addresses that distinction ("r" vs "rb" open modes). [Paul] > Python is documented as only using the distinction to handle line > ends. Where? Not in the open() docs. They're uselessly vague about the differences between 'r' and 'rb' (and don't mention line ends at all -- you're hallucinating that), because C is too and Python's semantics *are* C's here. Nevertheless, "When opening a binary file, you should append 'b' to the mode value for improved portability. (It's useful even on systems which don't treat binary and text files differently, where it serves as documentation.)". True enough, and good enough for newbies. Although, as I said before, I think Python should drop C's notion of text mode in favor of its own (because C's notion is wildly platform-dependent). > We want to create totally different object types based on the flag. Which flag? "b"? Fine by me -- but that's what I said at the start. From tim.one@home.com Tue Feb 13 04:05:24 2001 From: tim.one@home.com (Tim Peters) Date: Mon, 12 Feb 2001 23:05:24 -0500 Subject: [I18n-sig] RE: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: <3A87BD51.3088DACA@lemburg.com> Message-ID: [Tim] >> a-2-byte-encoding-space-was-too-small-the-day-unicode-was-conceived- >> and-20-bits-won't-last-either-ly y'rs - tim [MAL] > Has anyone ever considered the problems this causes for type > designers ? LOL! I'm picturing Guido going back a few thousand years in his time machine, to civilization after civilization on the verge of literacy, asking "Haven't you foolish people ever considered the problems this silly picture-writing will cause for type designers someday? Now grow up and use 7-bit ASCII." . The same inconsiderate bastards made computer speech recognition a lot harder than it could have been, too. Not to mention computerized inter-language translation, and whether or not it's polite or a mortal offense to point with your foot. > Who is going to do the job of designing 2^20 character glyphs > to all match the same font design guidelines ? No problem -- at Earth's current population, we can assign about 5,000 people to work on each glyph . size-is-a-relative-thing-ly y'rs - tim From mal@lemburg.com Tue Feb 13 08:01:48 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 13 Feb 2001 09:01:48 +0100 Subject: [I18n-sig] Modified open() builtin (Re: Python Character Model) References: <3A887D07.220D5904@ActiveState.com> Message-ID: <3A88E9EC.EADC6A3A@lemburg.com> Paul Prescod wrote: > > Tim Peters wrote: > > > > [Paul Prescod] > > > It has nothing to do with awareness of il8n issues. The fundamental > > > question is whether you expect to get text back from a read() or binary. > > > > C already addresses that distinction ("r" vs "rb" open modes). > > Python is documented as only using the distinction to handle line ends. > We want to create totally different object types based on the flag. Two things: 1. the difference between "r" and "rb" only exists on some non-Unix platforms (e.g. Windows) 2. the codec decides which type of object to return for .read() -- this has nothing to do with the file mode, but instead is dependent on the encoding used, e.g. encoding='binary' would return a binary string, encoding='ascii' results in Unicode and encoding='pil-image' could produce a PIL image object... Paul, you ought to write up a PEP about this subject discussing all the different issues with adding more optional parameters (encoding and errors, possibly more) to open(). It should also include a discussion about the implications using an encoding would have w/r to the applications relying on getting a real file object from the builtin open(). -- Marc-Andre Lemburg ______________________________________________________________________ Company: http://www.egenix.com/ Consulting: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Thu Feb 15 18:49:22 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 15 Feb 2001 19:49:22 +0100 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model In-Reply-To: <013401c09416$881b0f40$e46940d5@hagrid> (fredrik@effbot.org) References: <013401c09416$881b0f40$e46940d5@hagrid> Message-ID: <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de> > > However, matter-of-factually, you propose that ISO-8859-1 is the > > default encoding, as this is the encoding that is used when converting > > character strings to char* in the C API. I'd certainly call it a > > default. > > It's not an encoding. It's the subset of Unicode that you can store > in an 8-bit character. No, it is not *the* subset of Unicode that you can store in an 8-bit character. You can store any subset of Unicode with a cardinality <256 in a single octet. Latin-1 is group 0, plane 0, row 0. Why is it any better than any other plane or row? Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Feb 15 18:39:44 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 15 Feb 2001 19:39:44 +0100 Subject: [I18n-sig] Random thoughts on Unicode and Python In-Reply-To: References: Message-ID: <200102151839.f1FIdi002179@mira.informatik.hu-berlin.de> > That's my concern, and the thing I want to poll people on. > If Python "just works" for these users, and if we already offer > Unicode strings and a good codec library for people to use when they > want to, is there really a need to go further? My simple answer is: no, not at the moment. I can surely think of things that ought to work with the Unicode type and which currently don't, but most of them are a matter of fixing libraries. Regards, Martin From barry@scottb.demon.co.uk Sun Feb 18 13:01:06 2001 From: barry@scottb.demon.co.uk (Barry Scott) Date: Sun, 18 Feb 2001 13:01:06 -0000 Subject: [I18n-sig] Re: [Python-Dev] Pre-PEP: Python Character Model In-Reply-To: Message-ID: <001001c099aa$daebf240$060210ac@private> > Here's a thought. How about BinaryFile/BinarySocket/ByteArray which > do Files and sockets often contain a both string and binary data. Having StringFile and BinaryFile seems the wrong split. I'd think being able to write string and binary data is more useful for example having methods on file and socket like file.writetext, file.writebinary. NOw I can use the writetext to write the HTTP headers and writebinary to write the JPEG image say. BArry From brian@tomigaya.shibuya.tokyo.jp Tue Feb 20 10:16:09 2001 From: brian@tomigaya.shibuya.tokyo.jp (Brian Takashi Hooper) Date: Tue, 20 Feb 2001 19:16:09 +0900 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4) Message-ID: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> Here's the second message, from Tamito Kajiyama, contributor of the SJIS and EUC-JP codecs: ---- On Sun, 11 Feb 2001 20:18:51 +0900 Brian Takashi Hooper wrote: > Hi there, > > What does everyone think of the Proposed Character Model? I was also one of the people that Andy asked to contribute an opinion, so after reviewed the thread and here's what I have to say: I understand Paul's Pre-PEP as raising the following three points: 1. Deprecate the usage of the present string type as containing a sequence of bytes, and instead interpret string literals as containing Unicode characters. (Unify the present character strings and Unicode strings.) 2. Introduce a new data type (byte strings) for expressing an uninterpreted byte sequence. 3. Add a convention for specifying the encoding of a source file. In Python 2.0, there are separate data types for non-Unicode strings and Unicode character strings. The proposals 1. and 2. are essentially to replace these data types with the (Unicode) character sequence and byte sequence data types. Personally, I am opposed to the proposals 1. and 2. for the following two reasons: (1) The string types in Python 2.0 and the new string types proposed in the pre-PEP have a relationship something like this: Python 2.0 Pre-PEP string "" (byte sequence) byte string b"" Unicode string u"" (Unicode string "" character sequence) In general, the before- and after-PEP Pythons above have essentially no difference in expressiveness, and therefore it's hard to see what merit there might be in swapping the data types. On the other hand, I believe that swapping byte sequence and character sequence data types as described above has several serious demerits for Japanese Python developers. Japanese programmers have a regular need to handle legacy encodings such as EUC-JP and Shift JIS in their programs. Regular conversion back-and-forth between Unicode and legacy encodings introduces a significant cost in terms of resource usage and performance. Moreover, there is the problem of incompatibilities between different Unicode conversion tables. Furthermore, Japanese programmers are accustomed to dealing with Japanese strings as byte sequences. Japanese users have a real motivation to manipulate Japanese character strings as sequences of bytes. Regardless of whether Unicode is supported or not, the byte sequence data type is necessary in order to represent Japanese characters. The present implementation of strings in Python, where a string represents a sequence of bytes, is one feature that makes Python easy for Japanese developers to use. Changing strings to contain Unicode character data would impose a heavy burden for development and maintenance on Japanese Python programmers. Therefore, I'm against swapping byte string and character (Unicode) string types. (2) It is not always possible to unambiguously interpret string literals as Unicode character data As you know, in Japanese-encoded byte strings, 2 bytes often represent 1 character. Therefore, the position of characters is expressed in terms of bytes, not characters. Because of this, if a Japanese-encoded byte string is interpreted as-is as a Unicode character string, indexes into the string would no longer be interpreted the same way. For example, in the below code snippet the substring is output differently depending on whether the string literal is interpreted as a byte sequence or Unicode character sequence: s = "これは日本語の文字列です。" print s[6:12] Hard coding of slices as with the above is a common practice, I believe. Paul has asserted that no serious problems will occur if existing byte sequences are interpreted as Unicode, but I disagree with him on this. Due to the above two reasons, I cannot agree with the pre-PEP's first two proposals (1. and 2.). However, I believe the 3rd proposal to explicitly specify source file encoding is a necessary improvement, leaving aside for the moment the question of implementation. In Python 2.0, if a program is written containing Japanese strings in Shift-JIS, Python may raise parser errors. As many of you may know, in Shift-JIS encoded strings the second byte of some Japanese characters may be a backslash (ASCII 0x5c), and this conflicts with the backslash escaping in the string literal. As far as I know, this is also the case with the Chinese encoding Big 5. One way to solve this problem is to apply Ishimoto-san's Shift-JIS patch [1] to Python, but I feel that a more desirable solution is to allow Python itself to handle files with different source encodings. However, the intent of Paul's 3rd suggestion seems directed at solving a different problem than that of allowing specification of an encoding for byte strings. On the other hand, Marc-Andre's proposal [2] is to use the source file encoding only for the decoding of non-Unicode characters in character strings, without touching the contents of byte strings. While I prefer Marc-Andre's proposal since it seems to be a straightforward extension of Python 2.0's current Unicode support, it doesn't address the aforementioned problem with the usage of Shift-JIS and Big 5 in Python programs. Concerning this point, I think there is a need to start another discussion aside from Paul's pre-PEP. [1] http://www.gembook.org/python/ http://www.gembook.org/python/python20-sjis-20001202.zip [2] http://mail.python.org/pipermail/i18n-sig/2001-February/000756.html ---------------------------------------------------------------------- -- KAJIYAMA, Tamito From brian@tomigaya.shibuya.tokyo.jp Tue Feb 20 10:16:07 2001 From: brian@tomigaya.shibuya.tokyo.jp (Brian Takashi Hooper) Date: Tue, 20 Feb 2001 19:16:07 +0900 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (1 of 4) Message-ID: <20010220185630.F948.BRIAN@tomigaya.shibuya.tokyo.jp> Hi there, this is Brian Hooper in Tokyo. The proposed character model thread seems to have simmered down so I don't know how interested people will be in this, but I gathered a few comments about the Pre-PEP from the Japanese Python mailing list, and translated the responses - I think there were some very good points brought up, and I'd like to add the messages I received (with the permission of their authors) to the discussion. I've got four messages to post; I'm not such a fast translator so I'll post the two I have now, and the other two as I finish them. Here is Atsuo Ishimoto's post - Ishimoto-san wrote and contributed the CP 932 codec. --- On Sun, 11 Feb 2001 20:18:51 +0900 Brian Takashi Hooper wrote: > Hi there, > > What does everyone think of the Proposed Character Model? I'm opposed to it in its present form. Putting aside for the moment any criticisms of Unicode itself, building extension modules for Python would become more difficult and problematic (as Suzuki also pointed out). For example, given: PyObject *simple(PyObject *o, PyObject *args) { char *filename; if (!PyArg_ParseTuple(args, "s", filename)) return NULL; File *f = fopen(filename, "w"); if (!f) return NULL; fprintf(f, "spam"); fclose(f); Py_INCREF(Py_None); return Py_None; } from Python you can write: sample.simple("日本語ファイル名") and it will work as is in almost any platform and language environment. It works because in the present implementation of CPython, the input data string is treated as simply data by the extension module, which simply passes it along to the underlying OS or library without interpreting the content of the data. However, consider the same extension module in the case where all character sequences are handled by Python internally as Unicode. PyArg_ParseTuple() has no way of automatically knowing how to change Unicode characters with an ordinal value greater than 0xff into the encoding currently supported on the platform. In this case, sample.simple("日本語ファイル名") becomes an error. At present, most of Python's extension modules can be used without having to explicitly add CJK support - however if this PEP is implemented then most of these modules will become unusable in their present form. So, is there any solution for this? Well, we could take care when writing our Python scripts only to use strings in such a way that PyArg_ParseTuple() does not cause an error. There are two ways to do this: a. Use byte strings Instead of using a character string, we could call our function as sample.simple(b"日本語ファイル名") and everything then works fine. However, if we always have to use byte strings when interacting with extension libraries, then we haven't really achieved any real improvement in terms of internationalization, and there's not much point to implementing the PEP in that case... b. We could use an 8-bit character encoding such as ISO-8859. Suppose we use ISO-8859-1 instead of Shift-JIS or EUC-JP when creating the character string. Since the value of ord() for each character in the string is always <= 255, PyArg_ParseTuple() will have no problem with it, but in having to treat legacy encoded data as a different encoding, we haven't really made it easier to write programs which handle CJK data, or improved the situation for i18n either. It could be argued that Unicode strings could be used everywhere else, and either a. or b. above only when calling legacy code through extension modules like with simple() above. However, in the above case, it becomes necessary for the programmer to be aware of whether the function they are calling is implemented in legacy C code or not, which isn't really an improvement on the current state of things. Moreover, because in converting to Unicode we lose information about the original string encoding, automatically converting back to the original string encoding (for example in order to make the distinction between Unicode supported and non-supported libraries -B) becomes impossible. Use of a default encoding is discouraged in the PEP, but this is one example of why it may be necessary. So, returning to the extension module example above, we've seen that managing the problem on the Python script side is difficult. Another approach might be to change our extension module to support Unicode: PyObject *simple(PyObject *o, PyObject *args) { Py_UNICODE *filename; if (!PyArg_ParseTuple(args, "u", filename)) return NULL; File *f = ... :-P If the platform being used has a version of fopen() which has Unicode support, then there's no problem, but if not, then it's necessary to first convert the Unicode string to an encoding which _is_ supported on the platform: PyObject *simple(PyObject *o, PyObject *args) { Py_UNICODE *filename; char native_filename[MAX_FILE]; if (!PyArg_ParseTuple(args, "u", filename)) return NULL; #IF SJIS /* SJISに変換 */ #ELSE /* EUCに変換 */ #ENDIF FILE *f = fopen(....) I don't think anyone really wants to write code like this. Besides adding complexity, it is also hard to ignore the additional processing cost added by having to convert incoming Unicode arguments. Furthermore, adding this kind of support isn't likely to be provided by European or American programmers, since the coincidence of the ISO-8859-1 with the <= 255 range of Unicode makes such explicit support unnecessary for applications which only use Latin-1 or ASCII. (So: Non-American/ European programmers will have to add support for libraries they want to use) One of Python's strong points is that it makes it easy to wrap and use existing C libraries - however, the great majority of these C libraries are still not Unicode compliant. In that case, then it becomes necessary to add Unicode->native encoding support for all such C modules one-by-one, as described above. It's difficult to see what would be good about that. Some might react to the above by insisting, "These are just transitional problems which will soon be solved. If we restrict things to just a few main platforms, then it won't become a big problem." This position is however, flawed. For example, in Windows 95, to say nothing of UNIX-based OS's, Unicode support is only partial, and there is no Unicode version of fopen(). Considering the huge number of non-Unicode supported systems cur- rently in use around the world, we cannot ignore the importance of continuing to support them. In conclusion, supposing that Python strings are made to hold only character data as proposed in the pre-PEP, use of extension modules from non-European languages becomes much more difficult, and explicit encoding support has to be added in many cases. Python's current string implementation has important implications for its use as a glue language in non-internationlized environments. -Atsuo Ishimoto The Japanese (original) version of this opinion is available at http://www.gembook.org/moin/moin.cgi/OpinionForPepPythonCharacterModel Comments / feedback appreciated. P.S. I wonder what Tcl does with this? From tdickenson@geminidataloggers.com Tue Feb 20 14:22:43 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Tue, 20 Feb 2001 14:22:43 +0000 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (1 of 4) In-Reply-To: <20010220185630.F948.BRIAN@tomigaya.shibuya.tokyo.jp> References: <20010220185630.F948.BRIAN@tomigaya.shibuya.tokyo.jp> Message-ID: <44t49tk2ea60f47uccfo8cll3a53lgtjmc@4ax.com> On Tue, 20 Feb 2001 19:16:07 +0900, Brian Takashi Hooper wrote: >Hi there, this is Brian Hooper in Tokyo. >The proposed character model thread seems to have simmered down so I >don't know how interested people will be in this, but I gathered a few >comments about the Pre-PEP from the Japanese Python mailing list, and >translated the responses - I think there were some very good points >brought up, and I'd like to add the messages I received (with the >permission of their authors) to the discussion. =20 Thankyou for this effort. >For example, given: > >PyObject *simple(PyObject *o, PyObject *args) >{ > char *filename; > if (!PyArg_ParseTuple(args, "s", filename)) > return NULL; > File *f =3D fopen(filename, "w"); > if (!f) > return NULL; > fprintf(f, "spam"); > fclose(f); > Py_INCREF(Py_None); > return Py_None; >} > >from Python you can write: > >sample.simple("????????") > >and it will work as is in almost any platform and language environment. If those ??? are anything other than ASCII characters, then it doesnt work *predictably* today. (assuming the requirement that the file name is correct when viewed using the platforms native file browser) >Well, we could take care when writing our Python scripts only to use = strings >in such a way that PyArg_ParseTuple() does not cause an error. Sticking with the fopen example; I had assumed it is desirable to get an error if a script tries to create a file whose name contains japanse characters, on a filesystem that does not support that. >Use byte strings > >Instead of using a character string, we could call our function as > >sample.simple(b"????????") > >and everything then works fine. However, if we always have to use byte >strings when interacting with extension libraries, then we haven't = really >achieved any real improvement in terms of internationalization, and = there's >not much point to implementing the PEP in that case... If this is a legacy extension library then a byte string is all it expects. You could call this function as sample.simple(u"????????".encode('encoding_expected_by_sample_dot_simple'= )) I agree we need to provide a simpler interface to new extensions. >PyObject *simple(PyObject *o, PyObject *args) >{ > Py_UNICODE *filename; > char native_filename[MAX_FILE]; >=09 > if (!PyArg_ParseTuple(args, "u", filename)) > return NULL; > >#IF SJIS > /* SJIS??? */ >#ELSE > /* EUC??? */ >#ENDIF >=09 > FILE *f =3D fopen(....) > >I don't think anyone really wants to write code like this. I think those ifdefs could be replaced by one call to PyUnicode_Encode >Furthermore, adding this kind of support isn't likely to be provided by >European or American programmers, since the coincidence of the = ISO-8859-1 >with the <=3D 255 range of Unicode makes such explicit support = unnecessary >for applications which only use Latin-1 or ASCII. (So: Non-American/ >European programmers will have to add support for libraries they want to >use) As a European native-English speaker, I dont think this is true so long as we preserve the ASCII default encoding. An application that stores latin-1 data in a mix of unicode and plain strings will quickly trigger an exception (as soon as a unicode string mixes with a plain string containing a non-ASCII byte). A useful counterexample may be Mark Hammond's extensions for supporting win32 and com. They have always included explicit support for automatic encoding of unicode parameters on platforms where win32 uses 8-bit strings, and automatic decoding of plain strings when used with COM, which is always unicode. Toby Dickenson tdickenson@geminidataloggers.com From ishimoto@gembook.org Tue Feb 20 17:35:23 2001 From: ishimoto@gembook.org (Atsuo Ishimoto) Date: Wed, 21 Feb 2001 02:35:23 +0900 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (1 of 4) In-Reply-To: <44t49tk2ea60f47uccfo8cll3a53lgtjmc@4ax.com> References: <20010220185630.F948.BRIAN@tomigaya.shibuya.tokyo.jp> <44t49tk2ea60f47uccfo8cll3a53lgtjmc@4ax.com> Message-ID: <20010221023442.EE11.ISHIMOTO@gembook.org> Brian, Thanks for your effort to translate our comment. On Tue, 20 Feb 2001 14:22:43 +0000 Toby Dickenson wrote: > > If those ??? are anything other than ASCII characters, then it doesnt > work *predictably* today. (assuming the requirement that the file name > is correct when viewed using the platforms native file browser) > If the filename is illegal for the platform, fopen() may returns error. Why should we check whether filename is valid or not? Current python doesn't check if filename contains illegal letters, such as ':' on Win32. This is because platform knows their job and character set. We don't have to bother them to work. > >Well, we could take care when writing our Python scripts only to use strings > >in such a way that PyArg_ParseTuple() does not cause an error. > > Sticking with the fopen example; I had assumed it is desirable to get > an error if a script tries to create a file whose name contains > japanse characters, on a filesystem that does not support that. > You can get an error from platform-depend fopen(). Python or extension module don't have to check this. > If this is a legacy extension library then a byte string is all it > expects. You could call this function as > > sample.simple(u"????????".encode('encoding_expected_by_sample_dot_simple')) > > I agree we need to provide a simpler interface to new extensions. I don't believe this make people happy, even if interface is simplified. It is hard work to remember given function is Python script, legacy extension or Unicode-aware extension. > >#IF SJIS > > /* SJIS??? */ > >#ELSE > > /* EUC??? */ > >#ENDIF > > > > FILE *f = fopen(....) > > > >I don't think anyone really wants to write code like this. > > I think those ifdefs could be replaced by one call to PyUnicode_Encode May be. But to encode, you need to know the possible character set of incoming Unicode string and it's encoding, and specify them explicitly. Platform depended default encoding may eliminate hard coded encoding name, but I'm afraid of performance penalty for really long strings. > > As a European native-English speaker, I dont think this is true so > long as we preserve the ASCII default encoding. An application that > stores latin-1 data in a mix of unicode and plain strings will quickly > trigger an exception (as soon as a unicode string mixes with a plain > string containing a non-ASCII byte). > This means a lot of existing extension modules should be updated. It is hard for me to believe this is good idea. > A useful counterexample may be Mark Hammond's extensions for > supporting win32 and com. They have always included explicit support > for automatic encoding of unicode parameters on platforms where win32 > uses 8-bit strings, and automatic decoding of plain strings when used > with COM, which is always unicode. win32com works fine because COM is the Unicode world. But Python should live in the Unicode hostile land, I believe. Wishing you can read my English.... -------------------------- Atsuo Ishimoto ishimoto@gembook.org Homepage:http://www.gembook.org From guido@digicool.com Tue Feb 20 19:36:35 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 20 Feb 2001 14:36:35 -0500 Subject: [I18n-sig] How does Python Unicode treat surrogates? Message-ID: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> On the XML sig the following exchange happened. I don't know enough about the issues to investigate, but I'm sure that someone here can provide insight? It seems to boil down to whether or not surrogates may get transposed when between platforms. --Guido van Rossum (home page: http://www.python.org/~guido/) ------- Forwarded Message Date: Tue, 20 Feb 2001 11:54:34 -0700 From: Uche Ogbuji To: Guido van Rossum cc: Lars Marius Garshol , xml-sig@python.org Subject: Re: [XML-SIG] DC DOM tests (Was: Roadmap document - finally!) > > > > - DOMString and text manipulating interface methods are not > > > > tested beyond ASCII text due to an implementation limitation > > > > of ParsedXML.DOM. So, implementations will not be tested if > > > > text is correctly treated when multi-byte UTF-16 characters > > > > are involved. > > > > > > By "multi-byte UTF-16 characters" I assume you mean Unicode > > > characters outside the BMP that are represented using two > > > surrogates? > > > > I wonder if that's what Martijn means. I've read that most Java > > implementations have trouble with characters outside the BMP. I > > wonder if Python handles these properly. > > Depends on what you call properly. Can you elaborate on what you > would call proper treatment here? Sure. I admit it's hearsay, but I thought I'd read that because Java Unicode is or was underspecified, that there was the possibility of transposition of the high-surrogate with the low-surrogate character between Java implementations or platforms. Now I don't exactly write XML dissertations on "Hello Kitty" , so I'm not likely to run into this myself, but I was wondering whether Python handles surrogate blocks appropriately across platforms and implementations (I guess including cpyhton -> Jpython). -- Uche Ogbuji Principal Consultant uche.ogbuji@fourthought.com +1 303 583 9900 x 101 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA Software-engineering, knowledge-management, XML, CORBA, Linux, Python ------- End of Forwarded Message From paulp@ActiveState.com Tue Feb 20 21:46:35 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 20 Feb 2001 13:46:35 -0800 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model References: <013401c09416$881b0f40$e46940d5@hagrid> <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de> Message-ID: <3A92E5BB.38D4FB0B@ActiveState.com> "Martin v. Loewis" wrote: > > > ... > > > > It's not an encoding. It's the subset of Unicode that you can store > > in an 8-bit character. > > No, it is not *the* subset of Unicode that you can store in an 8-bit > character. You can store any subset of Unicode with a cardinality <256 > in a single octet. > > Latin-1 is group 0, plane 0, row 0. Why is it any better than any > other plane or row? I don't know. You tell me. >>> "a"==u"a"==chr(97) 1 It looks like we've already decided that group 0, plane 0, row 0 is special. A better question is why if the first half of group 0, plane 0, row 0 better than the last half? >>> unichr(160)==chr(160) Traceback (most recent call last): File "", line 1, in ? UnicodeError: ASCII decoding error: ordinal not in range(128) The Unicode guys made group 0, plane 0, row 0 Latin-1 for a reason. It's not just an accident. I don't think it makes sense for us to agree with them "halfway"...especially when this half-way agreement causes all kinds of nasty problems like forcing Python to raise exceptions in places that are really surprising like equality tests and sort functions. -- Vote for Your Favorite Python & Perl Programming Accomplishments in the first Active Awards! http://www.ActiveState.com/Awards From paulp@ActiveState.com Tue Feb 20 21:56:40 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 20 Feb 2001 13:56:40 -0800 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4) References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> Message-ID: <3A92E818.6FFACF04@ActiveState.com> Thanks for the translation Brian! That must have been a ton of work but it strikes me as very important work! > > ... > > Python 2.0 Pre-PEP > string "" (byte sequence) byte string b"" > Unicode string u"" (Unicode string "" > character sequence) > > In general, the before- and after-PEP Pythons above have essentially no > difference in expressiveness, and therefore it's hard to see what merit > there might be in swapping the data types. I think that there is an important issue here. Python is documented as having character strings. The minimal unit of a string is supposed to be a character. "Literal" strings are documented as being strings of characters. People expect this of a modern, high-level, user-centric language. Bytes are no more interesting to your average programmer than are DWORDs. We aren't going to start teaching people about bytes in introductory Python classes. More and more, people are going to find it bizarre to make a distinction between the 128 characters that happen to have lived in a quickly-becoming-obsolete American standard and the other 65,000 characters that we can use in word processors, web pages, search engines and so forth. You don't have to be Asian to see the distinction as arbitrary and historical. What if you want to insert a trademark (tm) or copyright (c) in your software? It is certainly too early for Python to abandon the one-byte centric view of the world. It is NOT too early to start putting into place a transition plan to the future world that we will all be forced to live in. Part of that transition is teaching people that literal strings may one day allow characters greater than 128 (perhaps directly, perhaps through an escape mechanism). > ... > Furthermore, Japanese programmers are accustomed to dealing with Japanese > strings as byte sequences. Japanese users have a real motivation to > manipulate Japanese character strings as sequences of bytes. Regardless > of whether Unicode is supported or not, the byte sequence data type is > necessary in order to represent Japanese characters. An explicit part of every proposal has been a continued support for rich, expressive byte-sequence manipulation. > The present implementation of strings in Python, where a string represents > a sequence of bytes, is one feature that makes Python easy for Japanese > developers to use. If Japanese programmers understand the difference between a byte and a character (which they must!), why would they be opposed to making that distinction explicit in code? > As you know, in Japanese-encoded byte strings, 2 bytes often represent > 1 character. Therefore, the position of characters is expressed in terms > of bytes, not characters. Because of this, if a Japanese-encoded byte > string is interpreted as-is as a Unicode character string, indexes into > the string would no longer be interpreted the same way. For example, in > the below code snippet the substring is output differently depending on > whether the string literal is interpreted as a byte sequence or Unicode > character sequence: > > s = "これは日本語の文字列です。" > print s[6:12] > > Hard coding of slices as with the above is a common practice, > I believe. Paul has asserted that no serious problems will occur if > existing byte sequences are interpreted as Unicode, but I disagree with > him on this. I still assert that the interpretation will not change. If you have no encoding declaration then the only rational choice is to treat each byte as a character. Therefore the indexes would work exactly as they do today. -- Vote for Your Favorite Python & Perl Programming Accomplishments in the first Active Awards! http://www.ActiveState.com/Awards From guido@digicool.com Tue Feb 20 21:54:25 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 20 Feb 2001 16:54:25 -0500 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model In-Reply-To: Your message of "Tue, 20 Feb 2001 13:46:35 PST." <3A92E5BB.38D4FB0B@ActiveState.com> References: <013401c09416$881b0f40$e46940d5@hagrid> <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de> <3A92E5BB.38D4FB0B@ActiveState.com> Message-ID: <200102202154.QAA06554@cj20424-a.reston1.va.home.com> > "Martin v. Loewis" wrote: > > Latin-1 is group 0, plane 0, row 0. Why is it any better than any > > other plane or row? > > I don't know. You tell me. > > >>> "a"==u"a"==chr(97) > 1 > > It looks like we've already decided that group 0, plane 0, row 0 is > special. A better question is why if the first half of group 0, plane 0, > row 0 better than the last half? > > >>> unichr(160)==chr(160) > Traceback (most recent call last): > File "", line 1, in ? > UnicodeError: ASCII decoding error: ordinal not in range(128) > > The Unicode guys made group 0, plane 0, row 0 Latin-1 for a reason. It's > not just an accident. I don't think it makes sense for us to agree with > them "halfway"...especially when this half-way agreement causes all > kinds of nasty problems like forcing Python to raise exceptions in > places that are really surprising like equality tests and sort > functions. This has been hashed to death many times before. We have absolutely no guarantee that the files from which Python strings are read are encoded in Latin-1, but we do know pretty sure that they are an ASCII superset (if they represent characters at all). Using the locale module the user can (implicitly) indicate what the character set is, and this may not be Latin-1. Since s.islower() and other similar functions are locale-sensitive, it would be inconsistent to declare that 8-bit strings are always encoded in Latin-1. This is historical baggage that cannot easily be fixed without breaking lots of code handling character data using legacy encodings (and typically, such code is not served by a switch to Unicode). It's possible to change locales in mid-execution, but for various reasons it's bad to change the default encoding in mid-execution, so the best we can do is assume ASCII as the default encoding. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Tue Feb 20 22:02:01 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 20 Feb 2001 17:02:01 -0500 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4) In-Reply-To: Your message of "Tue, 20 Feb 2001 13:56:40 PST." <3A92E818.6FFACF04@ActiveState.com> References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com> Message-ID: <200102202202.RAA06643@cj20424-a.reston1.va.home.com> > I think that there is an important issue here. Python is documented as > having character strings. The minimal unit of a string is supposed to be > a character. "Literal" strings are documented as being strings of > characters. Sorry, you're reading way too much into the words here. When I wrote that, in my brain there was absolutely no difference between characters and bytes, and in C the type name for a byte is 'char', so I wrote 'character' -- but I was thinking '8-bit quantity'. [starry-eyed romantic idealism skipped] > It is certainly too early for Python to abandon the one-byte centric > view of the world. It is NOT too early to start putting into place a > transition plan to the future world that we will all be forced to live > in. Part of that transition is teaching people that literal strings may > one day allow characters greater than 128 (perhaps directly, perhaps > through an escape mechanism). No objection here. > > ... > > Furthermore, Japanese programmers are accustomed to dealing with Japanese > > strings as byte sequences. Japanese users have a real motivation to > > manipulate Japanese character strings as sequences of bytes. Regardless > > of whether Unicode is supported or not, the byte sequence data type is > > necessary in order to represent Japanese characters. > > An explicit part of every proposal has been a continued support for > rich, expressive byte-sequence manipulation. > > > The present implementation of strings in Python, where a string represents > > a sequence of bytes, is one feature that makes Python easy for Japanese > > developers to use. > > If Japanese programmers understand the difference between a byte and a > character (which they must!), why would they be opposed to making that > distinction explicit in code? Maybe because, like me, they're thinking in historical terms where 'char' is just another word for byte? --Guido van Rossum (home page: http://www.python.org/~guido/) From martin@loewis.home.cs.tu-berlin.de Tue Feb 20 22:11:22 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 20 Feb 2001 23:11:22 +0100 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model In-Reply-To: <3A92E5BB.38D4FB0B@ActiveState.com> (message from Paul Prescod on Tue, 20 Feb 2001 13:46:35 -0800) References: <013401c09416$881b0f40$e46940d5@hagrid> <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de> <3A92E5BB.38D4FB0B@ActiveState.com> Message-ID: <200102202211.f1KMBMl01756@mira.informatik.hu-berlin.de> > A better question is why if the first half of group 0, plane 0, > row 0 better than the last half? Well, because it is ASCII, and because ASCII is a subset of most encodings - so assuming that an octet string is meant as ASCII when compared to a Unicode object has a high probability of being a good guess. The same is not true if there are octets >128. > > >>> unichr(160)==chr(160) > Traceback (most recent call last): > File "", line 1, in ? > UnicodeError: ASCII decoding error: ordinal not in range(128) > > The Unicode guys made group 0, plane 0, row 0 Latin-1 for a reason. Sure: to allow easy conversion between Latin-1 documents and Unicode. > It's not just an accident. I don't think it makes sense for us to > agree with them "halfway"... We agree with them all the way. The codec that deals with Latin-1 is hard-coded in _codecs, whereas the other single-byte encodings require dictionaries for operation. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Feb 20 22:21:25 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 20 Feb 2001 23:21:25 +0100 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4) In-Reply-To: <3A92E818.6FFACF04@ActiveState.com> (message from Paul Prescod on Tue, 20 Feb 2001 13:56:40 -0800) References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com> Message-ID: <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de> > I still assert that the interpretation will not change. If you have no > encoding declaration then the only rational choice is to treat each byte > as a character. Therefore the indexes would work exactly as they do > today. I'm not surprised that this assertion does not convince people too much. Again, I doubt that theoretical discussion of the issue does not bring it much further. What is needed is an actual patch to Python so people can see what exactly you are proposing, and in what way it would affect their code. I'm still pretty sure that any patch that changes string literals to be interpreted as wide strings, using the Unicode charset, would break loads of existing applications. Regards, Martin From guido@digicool.com Tue Feb 20 22:26:36 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 20 Feb 2001 17:26:36 -0500 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4) In-Reply-To: Your message of "Tue, 20 Feb 2001 23:21:25 +0100." <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de> References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com> <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de> Message-ID: <200102202226.RAA07034@cj20424-a.reston1.va.home.com> > Again, I doubt that theoretical discussion of the issue does not bring > it much further. What is needed is an actual patch to Python so people > can see what exactly you are proposing, and in what way it would > affect their code. Yes! > I'm still pretty sure that any patch that changes > string literals to be interpreted as wide strings, using the Unicode > charset, would break loads of existing applications. Note that this can already be approximated with the -U option. It might be a good idea to present the patch as an extension of what -U does (I believe -U currently *only* changes all string literals to Unicode -- but that's already very pervasive...). --Guido van Rossum (home page: http://www.python.org/~guido/) From paulp@ActiveState.com Tue Feb 20 23:04:17 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 20 Feb 2001 15:04:17 -0800 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model References: <013401c09416$881b0f40$e46940d5@hagrid> <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de> <3A92E5BB.38D4FB0B@ActiveState.com> <200102202154.QAA06554@cj20424-a.reston1.va.home.com> Message-ID: <3A92F7F1.77AFE3FD@ActiveState.com> Guido van Rossum wrote: > > ... > > This has been hashed to death many times before. We have absolutely > no guarantee that the files from which Python strings are read are > encoded in Latin-1, but we do know pretty sure that they are an ASCII > superset (if they represent characters at all). Using the locale > module the user can (implicitly) indicate what the character set is, > and this may not be Latin-1. Since s.islower() and other similar > functions are locale-sensitive, it would be inconsistent to declare > that 8-bit strings are always encoded in Latin-1. So the problem is that s.islower() might in some circumstances not equal unicode(s).islower()? Is this really a bigger deal than the fact that in some circumstances comparisons between 8-bit strings and Unicode strings will cause an exception, depending on the contents of the 8-bit string. Or that sorts could throw exceptions? Or concatenations can fail? The only arguments I have heard for the need for the builtin function "unichr" are based on the danger of concatenation failures in the 127-255 range. The price of this consistency is very high IMO! -- Vote for Your Favorite Python & Perl Programming Accomplishments in the first Active Awards! http://www.ActiveState.com/Awards From paulp@ActiveState.com Tue Feb 20 23:20:35 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 20 Feb 2001 15:20:35 -0800 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4) References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com> <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de> <200102202226.RAA07034@cj20424-a.reston1.va.home.com> Message-ID: <3A92FBC3.E8484C0B@ActiveState.com> Guido van Rossum wrote: > > > Again, I doubt that theoretical discussion of the issue does not bring > > it much further. What is needed is an actual patch to Python so people > > can see what exactly you are proposing, and in what way it would > > affect their code. > > Yes! The pre-PEP proposed roughly several month's work in terms of new types, extended functions, encoding changes and so forth to be implemented over several years. But if we don't agree on the direction of movement straight then we aren't going to move anywhere ever! The central proposal is that "Python strings" could allow characters with ordinal values higher than 255. I absolutely cannot see how this could break Python code. It is a loosening of a restriction! The trick (which may or may not be possible) is working with extension modules which have assumptions about the underlying bit-representation of strings. The only way out from under that weight is to start distinguishing between logical character strings and physical byte strings now, so that we do not have this same "legacy extension code" issue five years from now. -- Vote for Your Favorite Python & Perl Programming Accomplishments in the first Active Awards! http://www.ActiveState.com/Awards From guido@digicool.com Tue Feb 20 23:53:14 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 20 Feb 2001 18:53:14 -0500 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4) In-Reply-To: Your message of "Tue, 20 Feb 2001 15:20:35 PST." <3A92FBC3.E8484C0B@ActiveState.com> References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com> <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de> <200102202226.RAA07034@cj20424-a.reston1.va.home.com> <3A92FBC3.E8484C0B@ActiveState.com> Message-ID: <200102202353.SAA07769@cj20424-a.reston1.va.home.com> > The pre-PEP proposed roughly several month's work in terms of new types, > extended functions, encoding changes and so forth to be implemented over > several years. But if we don't agree on the direction of movement > straight then we aren't going to move anywhere ever! > > The central proposal is that "Python strings" could allow characters > with ordinal values higher than 255. I absolutely cannot see how this > could break Python code. It is a loosening of a restriction! It will probably require changes to C APIs, so it will break extensions. If some extensions aren't ported, that will in turn break 3rd party code. Also, if you want to see what could break, try running the test suite with python -U. > The trick (which may or may not be possible) is working with extension > modules which have assumptions about the underlying bit-representation > of strings. The only way out from under that weight is to start > distinguishing between logical character strings and physical byte > strings now, so that we do not have this same "legacy extension code" > issue five years from now. Sorry, I don't understand what you're proposing here. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Wed Feb 21 00:00:55 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 20 Feb 2001 19:00:55 -0500 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model In-Reply-To: Your message of "Tue, 20 Feb 2001 15:04:17 PST." <3A92F7F1.77AFE3FD@ActiveState.com> References: <013401c09416$881b0f40$e46940d5@hagrid> <200102151849.f1FInMt02227@mira.informatik.hu-berlin.de> <3A92E5BB.38D4FB0B@ActiveState.com> <200102202154.QAA06554@cj20424-a.reston1.va.home.com> <3A92F7F1.77AFE3FD@ActiveState.com> Message-ID: <200102210000.TAA07907@cj20424-a.reston1.va.home.com> > Guido van Rossum wrote: > > > > ... > > > > This has been hashed to death many times before. We have absolutely > > no guarantee that the files from which Python strings are read are > > encoded in Latin-1, but we do know pretty sure that they are an ASCII > > superset (if they represent characters at all). Using the locale > > module the user can (implicitly) indicate what the character set is, > > and this may not be Latin-1. Since s.islower() and other similar > > functions are locale-sensitive, it would be inconsistent to declare > > that 8-bit strings are always encoded in Latin-1. > > So the problem is that s.islower() might in some circumstances not equal > unicode(s).islower()? > > Is this really a bigger deal than the fact that in some circumstances > comparisons between 8-bit strings and Unicode strings will cause an > exception, depending on the contents of the 8-bit string. Or that sorts > could throw exceptions? Or concatenations can fail? Yes, it is a bigger deal, because it is a clear indication that assuming Latin-1 is simply WRONG. > The only arguments I have heard for the need for the builtin function > "unichr" are based on the danger of concatenation failures in the > 127-255 range. The price of this consistency is very high IMO! --Guido van Rossum (home page: http://www.python.org/~guido/) From kajiyama@pseudo.grad.sccs.chukyo-u.ac.jp Wed Feb 21 05:30:15 2001 From: kajiyama@pseudo.grad.sccs.chukyo-u.ac.jp (Tamito Kajiyama) Date: Wed, 21 Feb 2001 14:30:15 +0900 (JST) Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4) In-Reply-To: <3A92E818.6FFACF04@ActiveState.com> (message from Paul Prescod on Tue, 20 Feb 2001 13:56:40 -0800) References: <3A92E818.6FFACF04@ActiveState.com> <200102202202.RAA06643@cj20424-a.reston1.va.home.com> Message-ID: <200102210530.OAA11470@pseudo.grad.sccs.chukyo-u.ac.jp> Brian, thank you for the great translation! Paul Prescod wrote: | | It is certainly too early for Python to abandon the one-byte centric | view of the world. It is NOT too early to start putting into place a | transition plan to the future world that we will all be forced to live | in. Part of that transition is teaching people that literal strings may | one day allow characters greater than 128 (perhaps directly, perhaps | through an escape mechanism). I agree. | > The present implementation of strings in Python, where a string represents | > a sequence of bytes, is one feature that makes Python easy for Japanese | > developers to use. | | If Japanese programmers understand the difference between a byte and a | character (which they must!), why would they be opposed to making that | distinction explicit in code? They are not opposed to the distinction, I believe. In fact, Python 2.0 makes such a distinction since it has the byte string and Unicode string data types. The present two distinct data types are necessary and sufficient, I think. Guido van Rossum wrote: | | Maybe because, like me, they're thinking in historical terms where | 'char' is just another word for byte? Paul Prescod wrote: | | I still assert that the interpretation will not change. If you have no | encoding declaration then the only rational choice is to treat each byte | as a character. Therefore the indexes would work exactly as they do | today. As Guido pointed out, Japanese programmers are thinking that 'char' in Python (and C) is another word of 'byte'. Therefore, to treat each byte as a character is not rational at least in Japanese text processing. I'm quite sure that tons of existing programs will break if the semantics of the byte string and Unicode string are swapped. Regards, -- KAJIYAMA, Tamito From andy@reportlab.com Wed Feb 21 06:04:57 2001 From: andy@reportlab.com (Andy Robinson) Date: Wed, 21 Feb 2001 06:04:57 -0000 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (1 of 4) In-Reply-To: <20010220185630.F948.BRIAN@tomigaya.shibuya.tokyo.jp> Message-ID: > I've got four messages to post; I'm not such a fast > translator so I'll > post the two I have now, and the other two as I finish them. > Many thanks to everyone on python-ml-jp for your thoughtful answers. And Brian, thank you very much for these translations; I know you have put a lot of time and effort into them. - Andy Robinson From andy@reportlab.com Wed Feb 21 06:04:59 2001 From: andy@reportlab.com (Andy Robinson) Date: Wed, 21 Feb 2001 06:04:59 -0000 Subject: [I18n-sig] Re: Pre-PEP: Proposed Python Character Model In-Reply-To: <3A92E5BB.38D4FB0B@ActiveState.com> Message-ID: Paul Prescod wrote: > It looks like we've already decided that group 0, plane 0, row 0 is > special. A better question is why if the first half of > group 0, plane 0, row 0 better than the last half? Because the first half is compatible with just about every native encoding on the planet. The last half is just Latin-1, and byte values above 127 are different in just about every native encoding on the planet. - Andy Robinson From martin@loewis.home.cs.tu-berlin.de Wed Feb 21 09:13:30 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 21 Feb 2001 10:13:30 +0100 Subject: [I18n-sig] Japanese commentary on the Pre-PEP (2 of 4) In-Reply-To: <3A92FBC3.E8484C0B@ActiveState.com> (message from Paul Prescod on Tue, 20 Feb 2001 15:20:35 -0800) References: <20010220190538.F94A.BRIAN@tomigaya.shibuya.tokyo.jp> <3A92E818.6FFACF04@ActiveState.com> <200102202221.f1KMLPV01849@mira.informatik.hu-berlin.de> <200102202226.RAA07034@cj20424-a.reston1.va.home.com> <3A92FBC3.E8484C0B@ActiveState.com> Message-ID: <200102210913.f1L9DUh00845@mira.informatik.hu-berlin.de> > The pre-PEP proposed roughly several month's work in terms of new types, > extended functions, encoding changes and so forth to be implemented over > several years. But if we don't agree on the direction of movement > straight then we aren't going to move anywhere ever! > > The central proposal is that "Python strings" could allow characters > with ordinal values higher than 255. I absolutely cannot see how this > could break Python code. It is a loosening of a restriction! If you are convinced that your approach works, but cannot afford to implement it all, then specify it in a PEP. That might reduce the amount of work that you have to do, but will increase the amount of work that others have to do: I'd have to study it, trying to understand it, then pointing out places where it is imprecise. After that, I'd have to figure out mentally how to implement it, and point to the places that are unimplementable. Finally, I'd have to look around for code and study how it would operate under your proposal. I seriously doubt that "direction of movement" is a meaningful term here. It all depends on the details, not the grand picture. Regards, Martin From mal@lemburg.com Wed Feb 21 12:39:26 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 21 Feb 2001 13:39:26 +0100 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> Message-ID: <3A93B6FE.842A4BAF@lemburg.com> Guido van Rossum wrote: > > On the XML sig the following exchange happened. I don't know enough > about the issues to investigate, but I'm sure that someone here can > provide insight? It seems to boil down to whether or not surrogates > may get transposed when between platforms. The Python Unicode implementation assumes that the internal storage is using UTF-16 *without* surrogates. As a result the storage scheme is the same as UCS2. This is per design since surrogates introduce a whole new can of worms (making UTF-16 a variable length encoding). Still, there are some codecs (utf-8, utf-16, unicode-escape) which try to handle can handle surrogates properly. The support for surrogates is not complete though, so I wouldn't rely on it. Note that UTF-16 surrogates are only needed to reach Unicode code points beyond BMP. AFAIK, there are plans to fill this area in the next Unicode version, but the designers are very well aware of the issues this imposes on the existing implementations: Windows and Java are Unicode 2.0 based which is not capable of handling character points outside BMP. Does this answer you question ? > --Guido van Rossum (home page: http://www.python.org/~guido/) > > ------- Forwarded Message > > Date: Tue, 20 Feb 2001 11:54:34 -0700 > From: Uche Ogbuji > To: Guido van Rossum > cc: Lars Marius Garshol , xml-sig@python.org > Subject: Re: [XML-SIG] DC DOM tests (Was: Roadmap document - finally!) > > > > > > - DOMString and text manipulating interface methods are not > > > > > tested beyond ASCII text due to an implementation limitation > > > > > of ParsedXML.DOM. So, implementations will not be tested if > > > > > text is correctly treated when multi-byte UTF-16 characters > > > > > are involved. > > > > > > > > By "multi-byte UTF-16 characters" I assume you mean Unicode > > > > characters outside the BMP that are represented using two > > > > surrogates? > > > > > > I wonder if that's what Martijn means. I've read that most Java > > > implementations have trouble with characters outside the BMP. I > > > wonder if Python handles these properly. > > > > Depends on what you call properly. Can you elaborate on what you > > would call proper treatment here? > > Sure. I admit it's hearsay, but I thought I'd read that because Java > Unicode is or was underspecified, that there was the possibility of > transposition of the high-surrogate with the low-surrogate character > between Java implementations or platforms. > > Now I don't exactly write XML dissertations on "Hello Kitty" , so > I'm not likely to run into this myself, but I was wondering whether > Python handles surrogate blocks appropriately across platforms and > implementations (I guess including cpyhton -> Jpython). > > -- > Uche Ogbuji Principal Consultant > uche.ogbuji@fourthought.com +1 303 583 9900 x 101 > Fourthought, Inc. http://Fourthought.com > 4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA > Software-engineering, knowledge-management, XML, CORBA, Linux, Python > > ------- End of Forwarded Message > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://mail.python.org/mailman/listinfo/i18n-sig -- Marc-Andre Lemburg ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Pages: http://www.lemburg.com/python/ From fw@deneb.enyo.de Thu Feb 22 16:38:26 2001 From: fw@deneb.enyo.de (Florian Weimer) Date: 22 Feb 2001 17:38:26 +0100 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <3A93B6FE.842A4BAF@lemburg.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <3A93B6FE.842A4BAF@lemburg.com> Message-ID: <87hf1moepp.fsf@deneb.enyo.de> "M.-A. Lemburg" writes: > Note that UTF-16 surrogates are only needed to reach Unicode > code points beyond BMP. AFAIK, there are plans to fill this > area in the next Unicode version, but the designers are very > well aware of the issues this imposes on the existing implementations: > Windows and Java are Unicode 2.0 based which is not capable of > handling character points outside BMP. And so is Ada. However, a few useful extensions are planned for the next Unicode revisions: several mathematical alphabets and language tags come to my mind immediately. It's certainly no longer true that non-BMP characters are going to be used only by scholars (as it seemed a few years ago).