[Python-checkins] CVS: python/nondist/peps pep-0223.txt,1.3,1.4
Tim Peters
python-dev@python.org
Wed, 23 Aug 2000 20:26:44 -0700
Update of /cvsroot/python/python/nondist/peps
In directory slayer.i.sourceforge.net:/tmp/cvs-serv9254
Modified Files:
pep-0223.txt
Log Message:
Completed, about to post.
Index: pep-0223.txt
===================================================================
RCS file: /cvsroot/python/python/nondist/peps/pep-0223.txt,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -r1.3 -r1.4
*** pep-0223.txt 2000/08/23 06:03:29 1.3
--- pep-0223.txt 2000/08/24 03:26:42 1.4
***************
*** 3,11 ****
Version: $Revision$
Author: tpeters@beopen.com (Tim Peters)
! Status: Draft
Type: Standards Track
Python-Version: 2.0
Created: 20-Aug-2000
! Post-History:
--- 3,11 ----
Version: $Revision$
Author: tpeters@beopen.com (Tim Peters)
! Status: Active
Type: Standards Track
Python-Version: 2.0
Created: 20-Aug-2000
! Post-History: 23-Aug-2000
***************
*** 18,21 ****
--- 18,199 ----
compatibility with Perl regular expressions, and with minimal risk
to existing code.
+
+
+ Syntax
+
+ The syntax of \x escapes, in all flavors of non-raw strings, becomes
+
+ \xhh
+
+ where h is a hex digit (0-9, a-f, A-F). The exact syntax in 1.5.2 is
+ not clearly specified in the Reference Manual; it says
+
+ \xhh...
+
+ implying "two or more" hex digits, but one-digit forms are also
+ accepted by the 1.5.2 compiler, and a plain \x is "expanded" to
+ itself (i.e., a backslash followed by the letter x). It's unclear
+ whether the Reference Manual intended either of the 1-digit or
+ 0-digit behaviors.
+
+
+ Semantics
+
+ In an 8-bit non-raw string,
+ \xij
+ expands to the character
+ chr(int(ij, 16))
+ Note that this is the same as in 1.6 and before.
+
+ In a Unicode string,
+ \xij
+ acts the same as
+ \u00ij
+ i.e. it expands to the obvious Latin-1 character from the initial
+ segment of the Unicode space.
+
+ An \x not followed by at least two hex digits is a compile-time error,
+ specifically ValueError in 8-bit strings, and UnicodeError (a subclass
+ of ValueError) in Unicode strings. Note that if an \x is followed by
+ more than two hex digits, only the first two are "consumed". In 1.6
+ and before all but the *last* two were silently ignored.
+
+
+ Example
+
+ In 1.5.2:
+
+ >>> "\x123465" # same as "\x65"
+ 'e'
+ >>> "\x65"
+ 'e'
+ >>> "\x1"
+ '\001'
+ >>> "\x\x"
+ '\\x\\x'
+ >>>
+
+ In 2.0:
+
+ >>> "\x123465" # \x12 -> \022, "3456" left alone
+ '\0223456'
+ >>> "\x65"
+ 'e'
+ >>> "\x1"
+ [ValueError is raised]
+ >>> "\x\x"
+ [ValueError is raised]
+ >>>
+
+
+ History and Rationale
+
+ \x escapes were introduced in C as a way to specify variable-width
+ character encodings. Exactly which encodings those were, and how many
+ hex digits they required, was left up to each implementation. The
+ language simply stated that \x "consumed" *all* hex digits following,
+ and left the meaning up to each implementation. So, in effect, \x in C
+ is a standard hook to supply platform-defined behavior.
+
+ Because Python explicitly aims at platform independence, the \x escape
+ in Python (up to and including 1.6) has been treated the same way
+ across all platforms: all *except* the last two hex digits were
+ silently ignored. So the only actual use for \x escapes in Python was
+ to specify a single byte using hex notation.
+
+ Larry Wall appears to have realized that this was the only real use for
+ \x escapes in a platform-independent language, as the proposed rule for
+ Python 2.0 is in fact what Perl has done from the start (although you
+ need to run in Perl -w mode to get warned about \x escapes with fewer
+ than 2 hex digits following -- it's clearly more Pythonic to insist on
+ 2 all the time).
+
+ When Unicode strings were introduced to Python, \x was generalized so
+ as to ignore all but the last *four* hex digits in Unicode strings.
+ This caused a technical difficulty for the new regular expression engine:
+ SRE tries very hard to allow mixing 8-bit and Unicode patterns and
+ strings in intuitive ways, and it no longer had any way to guess what,
+ for example, r"\x123456" should mean as a pattern: is it asking to match
+ the 8-bit character \x56 or the Unicode character \u3456?
+
+ There are hacky ways to guess, but it doesn't end there. The ISO C99
+ standard also introduces 8-digit \U12345678 escapes to cover the entire
+ ISO 10646 character space, and it's also desired that Python 2 support
+ that from the start. But then what are \x escapes supposed to mean?
+ Do they ignore all but the last *eight* hex digits then? And if less
+ than 8 following in a Unicode string, all but the last 4? And if less
+ than 4, all but the last 2?
+
+ This was getting messier by the minute, and the proposal cuts the
+ Gordian knot by making \x simpler instead of more complicated. Note
+ that the 4-digit generalization to \xijkl in Unicode strings was also
+ redundant, because it meant exactly the same thing as \uijkl in Unicode
+ strings. It's more Pythonic to have just one obvious way to specify a
+ Unicode character via hex notation.
+
+
+ Development and Discussion
+
+ The proposal was worked out among Guido van Rossum, Fredrik Lundh and
+ Tim Peters in email. It was subsequently explained and disussed on
+ Python-Dev under subject "Go \x yourself", starting 2000-08-03.
+ Response was overwhelmingly positive; no objections were raised.
+
+
+ Backward Compatibility
+
+ Changing the meaning of \x escapes does carry risk of breaking existing
+ code, although no instances of incompabitility have yet been discovered.
+ The risk is believed to be minimal.
+
+ Tim Peters verified that, except for pieces of the standard test suite
+ deliberately provoking end cases, there are no instances of \xabcdef...
+ with fewer or more than 2 hex digits following, in either the Python
+ CVS development tree, or in assorted Python packages sitting on his
+ machine.
+
+ It's unlikely there are any with fewer than 2, because the Reference
+ Manual implied they weren't legal (although this is debatable!). If
+ there are any with more than 2, Guido is ready to argue they were buggy
+ anyway <0.9 wink>.
+
+ Guido reported that the O'Reilly Python books *already* document that
+ Python works the proposed way, likely due to their Perl editing
+ heritage (as above, Perl worked (very close to) the proposed way from
+ its start).
+
+ Finn Bock reported that what JPython does with \x escapes is
+ unpredictable today. This proposal gives a clear meaning that can be
+ consistently and easily implemented across all Python implementations.
+
+
+ Effects on Other Tools
+
+ Believed to be none. The candidates for breakage would mostly be
+ parsing tools, but the author knows of none that worry about the
+ internal structure of Python strings beyond the approximation "when
+ there's a backslash, swallow the next character". Tim Peters checked
+ python-mode.el, the std tokenize.py and pyclbr.py, and the IDLE syntax
+ coloring subsystem, and believes there's no need to change any of
+ them. Tools like tabnanny.py and checkappend.py inherit their immunity
+ from tokenize.py.
+
+
+ Reference Implementation
+
+ The code changes are so simple that a separate patch will not be produced.
+ Fredrik Lundh is writing the code, is an expert in the area, and will
+ simply check the changes in before 2.0b1 is released.
+
+
+ BDFL Pronouncements
+
+ Yes, ValueError, not SyntaxError. "Problems with literal interpretations
+ traditionally raise 'runtime' exceptions rather than syntax errors."
+
+
+ Copyright
+
+ This document has been placed in the public domain.