[Python-ideas] Processing surrogates in

Stephen J. Turnbull stephen at xemacs.org
Thu May 7 12:04:34 CEST 2015


Nick Coghlan writes:

 > What "we're" working towards (where "we" ~= the Unicode consortium +
 > operating system designers + programming language designers) is a
 > world where everything "just works", and computers talk to humans in
 > each human's preferred language (or a collection of languages,
 > depending on what the human is doing), and to each other in Unicode.
 > There are then a whole host of technical and political reasons

And economic -- which really bites here because if it weren't for the
good ol' American greenback and that huge GDP and consumption
(especially of software) this thread would be all about why GB
18030[1] is so hard.  Think about *that* prospect the next time the
"complexity of Unicode" starts to bug you. :-)

 > We'll know we're done with that transition when Unicode becomes almost
 > transparently invisible, and the vast majority of programmers are once
 > again able to just deal with "text" without worrying too much about
 > how it's represented internally

That part after the "and" is a misstatement, isn't it?  Nobody using
Python 3 is concerned with how it's represented internally *at all*,
because for all the str class cares it *could* be GB 18030, and only
ord() (and esoteric features like memoryview) would ever tell you so.
And Python 3 programmers *can* treat str as "just text"[2] as long as
they stick to pure Python, and don't have to accept or generate
encoded text for *external* modules (such as Tcl/Tk) that don't know
about (all of) Unicode.  Even surrogateescapes only matter when you're
dealing with rather unruly input (or a mendacious OS).

So it's *still* all about I/O, viz: issue22555.  "Unicode" is just the
conventional curse word that programmers use when they're thinking
"HCI is hard and it sucks and I just wish it would go away!", even
though Unicode gets us 90% of the way to the solution.  (The other 10%
is where us humans go contributing a little peace, love, and
understanding. :-)


Footnotes: 
[1]  The Chinese standard which has exactly the same character
repertoire as Unicode (because it tracks it by design), but instead of
grandfathering ISO 8859-1 code points as the first 256 code points of
Unicode, it grandfathers GB 2312 (Chinese) as the first few thousand,
and has a rather obnoxious variable width representation as a result.

[2]  With a few exceptions such as dealing with Apple's icky NFD
filesystem encoding, and formatting bidirectional strings in
reStructuredText (which I haven't tried, but I bet doesn't work very
well in tables!)



More information about the Python-ideas mailing list