Newbie question about text encoding

Sun Mar 8 21:43:27 EDT 2015

Marko Rauhamaa wrote:

> Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> 
>> Marko Rauhamaa wrote:
>>> '\udd00' is a valid str object:
>>
>> Is it though? Perhaps the bug is not UTF-8's inability to encode lone
>> surrogates, but that Python allows you to create lone surrogates in
>> the first place. That's not a rhetorical question. It's a genuine
>> question.
> 
> The problem is that no matter how you shuffle surrogates, encoding
> schemes, coding points and the like, a wrinkle always remains.

Really? Define your terms. Can you define "wrinkles", and prove that it is
impossible to remove them? What's so bad about wrinkles anyway?

> I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But
> that's where the buck stops; traditional arithmetic functions are closed
> under ℂ.

That's simply incorrect. What's z/(0+0i)?

There are many more number sets used by mathematicians, some going back to
the 1800s. Here are just a few:

* ℝ-overbar or [−∞, +∞], which adds a pair of infinities to ℝ.

* ℝ-caret or ℝ+{∞}, which does the same but with a single 
  unsigned infinity.

* A similar extended version of ℂ with a single infinity.

* Split-complex or hyperbolic numbers, defined similarly to ℂ 
  except with i**2 = +1 (rather than the complex i**2 = -1).

* Dual numbers, which add a single infinitesimal number ε != 0 
  with the property that ε**2 = 0.

* Hyperreal numbers.

* John Conway's surreal numbers, which may be the largest 
  possible set, in the sense that it can construct all finite, 
  infinite and infinitesimal numbers. (The hyperreals and dual 
  numbers can be considered subsets of the surreals.)

The process of extending ℝ to ℂ is formally known as Cayley–Dickson
construction, and there is an infinite number of algebras (and hence number
sets) which can be constructed this way. The next few are:

* Hamilton's quaternions ℍ, very useful for dealing with rotations 
  in 3D space. They fell out of favour for some decades, but are now
  experiencing something of a renaissance.

* Octonions or Cayley numbers.

* Sedenions.

> Unicode apparently hasn't found a similar closure.

Similar in what way? And why do you think this is important?

It is not a requirement for every possible byte sequence to be a valid
Unicode string, any more than it is a requirement for every possible byte
sequence to be valid JPG, zip archive, or ELF executable. Some byte strings
simply are not JPG images, zip archives or ELF executables -- or Unicode
strings. So what?

Why do you think that is a problem that needs fixing by the Unicode
standard? It may be a problem that needs fixing by (for example)
programming languages, and Python invented the surrogatesescape encoding to
smuggle such invalid bytes into strings. Other solutions may exist as well.
But that's not part of Unicode and it isn't a problem for Unicode.

> That's why I think that while UTF-8 is a fabulous way to bring Unicode
> to Linux, Linux should have taken the tack that Unicode is always an
> application-level interpretation with few operating system tie-ins.

"Should have"? That is *exactly* the status quo, and while it was the only
practical solution given Linux's history, it's a horrible idea. That
Unicode is stuck on top of an OS which is unaware of Unicode is precisely
why we're left with problems like "how do you represent arbitrary bytes as
Unicode strings?".

> Unfortunately, the GNU world is busy trying to build a Unicode frosting
> everywhere. The illusion can never be complete but is convincing enough
> for application developers to forget to handle corner cases.
> 
> To answer your question, I think every code point from 0 to 1114111
> should be treated as valid and analogous. 

Your opinion isn't very relevant. What is relevant is what the Unicode
standard demands, and I think it requires that strings containing
surrogates are illegal (rather like x/0 is illegal in the real numbers).
Wikipedia states:

    The Unicode standard permanently reserves these code point 
    values [U+D800 to U+DFFF] for UTF-16 encoding of the high 
    and low surrogates, and they will never be assigned a 
    character, so there should be no reason to encode them. The 
    official Unicode standard says that no UTF forms, including 
    UTF-16, can encode these code points.

    However UCS-2, UTF-8, and UTF-32 can encode these code points
    in trivial and obvious ways, and large amounts of software 
    does so even though the standard states that such arrangements
    should be treated as encoding errors. It is possible to 
    unambiguously encode them in UTF-16 by using a code unit equal
    to the code point, as long as no sequence of two code units can
    be interpreted as a legal surrogate pair (that is, as long as a
    high surrogate is never followed by a low surrogate). The 
    majority of UTF-16 encoder and decoder implementations translate
    between encodings as though this were the case.

http://en.wikipedia.org/wiki/UTF-16

So yet again we are left with the conclusion that *buggy implementations* of
Unicode cause problems, not the Unicode standard itself.

> Thus Python is correct here: 
> 
>    >>> len('\udd00')
>    1
>    >>> len('\ufeff')
>    1
> 
> The alternatives are far too messy to consider.

Not at all. '\udd00' should be a SyntaxError.

-- 
Steven