[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Wed Sep 17 02:21:56 CEST 2014

R. David Murray writes:

 > > Do what, exactly? As I understand you, you treat the unknown bytes as
 > > completely opaque, not representing any characters at all. Which is
 > > what I'm saying: those are not characters.
 > 
 > Yes.  I thought you were saying that one could not treat the string with
 > smuggled bytes as if it were a string.

Guido's mantra is something like "Python's str doesn't contain
characters or even code points[1], it contains code units."  Implying
that dealing with characters (or the grapheme globs that occasionally
raise their ugly heads here) is an issue for higher-level facilities
than str to deal with.

The point being that

 > Basically, we are pretending that the each smuggled byte is single
 > character

is something of a misstatement (good enough for present purpose of
discussing email, but not good enough for the general case of
understanding how this is supposed to work when porting the construct
to other Python implementations), while

 > for string parsing purposes...but they don't match any of our
 > parsing constants.

is precisely Pythonically correct.  You might want to add "because all
parsing constants contain only valid characters by construction."

 > [*] I worried a lot that this was re-introducing the bytes/string
 > problem from python2.

It isn't, because the bytes/str problem was that given a str object
out of context you could not tell whether it was a binary blob or
text, and if text, you couldn't tell if it was external encoded text
or internal abstract text.

That is not true here because the representations of characters vs.
smuggled bytes in str are disjoint sets.

Footnotes: 
[1]  In Unicode terminology, a code unit is the smallest computer
object that can represent a character (this is uniquely and sanely
defined for all real Unicode transformation formats aka UTFs).  A code
point is an integer 0 - (17*256*256-1) that can represent a character,
but many code points such as surrogates and 0xFFFF are defined to be
non-characters.  Characters are those code points that may be assigned
an interpretation as a character, including undefined characters
(private space and reserved).