[Python-checkins] CVS: python/nondist/peps pep-0223.txt,1.3,1.4

Wed, 23 Aug 2000 20:26:44 -0700

Update of /cvsroot/python/python/nondist/peps
In directory slayer.i.sourceforge.net:/tmp/cvs-serv9254

Modified Files:
	pep-0223.txt 
Log Message:
Completed, about to post.


Index: pep-0223.txt
===================================================================
RCS file: /cvsroot/python/python/nondist/peps/pep-0223.txt,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -r1.3 -r1.4
*** pep-0223.txt	2000/08/23 06:03:29	1.3
--- pep-0223.txt	2000/08/24 03:26:42	1.4
***************
*** 3,11 ****
  Version: $Revision$
  Author: tpeters@beopen.com (Tim Peters)
! Status: Draft
  Type: Standards Track
  Python-Version: 2.0
  Created: 20-Aug-2000
! Post-History:
  
  
--- 3,11 ----
  Version: $Revision$
  Author: tpeters@beopen.com (Tim Peters)
! Status: Active
  Type: Standards Track
  Python-Version: 2.0
  Created: 20-Aug-2000
! Post-History: 23-Aug-2000
  
  
***************
*** 18,21 ****
--- 18,199 ----
      compatibility with Perl regular expressions, and with minimal risk
      to existing code.
+ 
+ 
+ Syntax
+ 
+     The syntax of \x escapes, in all flavors of non-raw strings, becomes
+ 
+         \xhh
+ 
+     where h is a hex digit (0-9, a-f, A-F).  The exact syntax in 1.5.2 is
+     not clearly specified in the Reference Manual; it says
+ 
+         \xhh...
+ 
+     implying "two or more" hex digits, but one-digit forms are also
+     accepted by the 1.5.2 compiler, and a plain \x is "expanded" to
+     itself (i.e., a backslash followed by the letter x).  It's unclear
+     whether the Reference Manual intended either of the 1-digit or
+     0-digit behaviors.
+ 
+ 
+ Semantics
+ 
+     In an 8-bit non-raw string,
+         \xij
+     expands to the character
+         chr(int(ij, 16))
+     Note that this is the same as in 1.6 and before.
+ 
+     In a Unicode string,
+         \xij
+     acts the same as
+         \u00ij
+     i.e. it expands to the obvious Latin-1 character from the initial
+     segment of the Unicode space.
+ 
+     An \x not followed by at least two hex digits is a compile-time error,
+     specifically ValueError in 8-bit strings, and UnicodeError (a subclass
+     of ValueError) in Unicode strings.  Note that if an \x is followed by
+     more than two hex digits, only the first two are "consumed".  In 1.6
+     and before all but the *last* two were silently ignored.
+ 
+ 
+ Example
+ 
+     In 1.5.2:
+ 
+         >>> "\x123465"  # same as "\x65"
+         'e'
+         >>> "\x65"
+         'e'
+         >>> "\x1"
+         '\001'
+         >>> "\x\x"
+         '\\x\\x'
+         >>>
+ 
+     In 2.0:
+ 
+         >>> "\x123465" # \x12 -> \022, "3456" left alone
+         '\0223456'
+         >>> "\x65"
+         'e'
+         >>> "\x1"
+         [ValueError is raised]
+         >>> "\x\x"
+         [ValueError is raised]
+         >>>
+ 
+ 
+ History and Rationale
+ 
+     \x escapes were introduced in C as a way to specify variable-width
+     character encodings.  Exactly which encodings those were, and how many
+     hex digits they required, was left up to each implementation.  The
+     language simply stated that \x "consumed" *all* hex digits following,
+     and left the meaning up to each implementation.  So, in effect, \x in C
+     is a standard hook to supply platform-defined behavior.
+ 
+     Because Python explicitly aims at platform independence, the \x escape
+     in Python (up to and including 1.6) has been treated the same way
+     across all platforms:  all *except* the last two hex digits were
+     silently ignored.  So the only actual use for \x escapes in Python was
+     to specify a single byte using hex notation.
+ 
+     Larry Wall appears to have realized that this was the only real use for
+     \x escapes in a platform-independent language, as the proposed rule for
+     Python 2.0 is in fact what Perl has done from the start (although you
+     need to run in Perl -w mode to get warned about \x escapes with fewer
+     than 2 hex digits following -- it's clearly more Pythonic to insist on
+     2 all the time).
+ 
+     When Unicode strings were introduced to Python, \x was generalized so
+     as to ignore all but the last *four* hex digits in Unicode strings.
+     This caused a technical difficulty for the new regular expression engine:
+     SRE tries very hard to allow mixing 8-bit and Unicode patterns and
+     strings in intuitive ways, and it no longer had any way to guess what,
+     for example, r"\x123456" should mean as a pattern:  is it asking to match
+     the 8-bit character \x56 or the Unicode character \u3456?
+ 
+     There are hacky ways to guess, but it doesn't end there.  The ISO C99
+     standard also introduces 8-digit \U12345678 escapes to cover the entire
+     ISO 10646 character space, and it's also desired that Python 2 support
+     that from the start.  But then what are \x escapes supposed to mean?
+     Do they ignore all but the last *eight* hex digits then?  And if less
+     than 8 following in a Unicode string, all but the last 4?  And if less
+     than 4, all but the last 2?
+ 
+     This was getting messier by the minute, and the proposal cuts the
+     Gordian knot by making \x simpler instead of more complicated.  Note
+     that the 4-digit generalization to \xijkl in Unicode strings was also
+     redundant, because it meant exactly the same thing as \uijkl in Unicode
+     strings.  It's more Pythonic to have just one obvious way to specify a
+     Unicode character via hex notation.
+ 
+ 
+ Development and Discussion
+ 
+     The proposal was worked out among Guido van Rossum, Fredrik Lundh and
+     Tim Peters in email.  It was subsequently explained and disussed on
+     Python-Dev under subject "Go \x yourself", starting 2000-08-03.
+     Response was overwhelmingly positive; no objections were raised.
+ 
+ 
+ Backward Compatibility
+ 
+     Changing the meaning of \x escapes does carry risk of breaking existing
+     code, although no instances of incompabitility have yet been discovered.
+     The risk is believed to be minimal.
+ 
+     Tim Peters verified that, except for pieces of the standard test suite
+     deliberately provoking end cases, there are no instances of \xabcdef...
+     with fewer or more than 2 hex digits following, in either the Python
+     CVS development tree, or in assorted Python packages sitting on his
+     machine.
+ 
+     It's unlikely there are any with fewer than 2, because the Reference
+     Manual implied they weren't legal (although this is debatable!).  If
+     there are any with more than 2, Guido is ready to argue they were buggy
+     anyway <0.9 wink>.
+ 
+     Guido reported that the O'Reilly Python books *already* document that
+     Python works the proposed way, likely due to their Perl editing
+     heritage (as above, Perl worked (very close to) the proposed way from
+     its start).
+ 
+     Finn Bock reported that what JPython does with \x escapes is
+     unpredictable today.  This proposal gives a clear meaning that can be
+     consistently and easily implemented across all Python implementations.
+ 
+ 
+ Effects on Other Tools
+ 
+     Believed to be none.  The candidates for breakage would mostly be
+     parsing tools, but the author knows of none that worry about the
+     internal structure of Python strings beyond the approximation "when
+     there's a backslash, swallow the next character".  Tim Peters checked
+     python-mode.el, the std tokenize.py and pyclbr.py, and the IDLE syntax
+     coloring subsystem, and believes there's no need to change any of
+     them.  Tools like tabnanny.py and checkappend.py inherit their immunity
+     from tokenize.py.
+ 
+ 
+ Reference Implementation
+ 
+     The code changes are so simple that a separate patch will not be produced.
+     Fredrik Lundh is writing the code, is an expert in the area, and will
+     simply check the changes in before 2.0b1 is released.
+ 
+ 
+ BDFL Pronouncements
+ 
+     Yes, ValueError, not SyntaxError.  "Problems with literal interpretations
+     traditionally raise 'runtime' exceptions rather than syntax errors."
+ 
+ 
+ Copyright
+ 
+     This document has been placed in the public domain.