Newbie question about text encoding

Marko Rauhamaa marko at pacujo.net
Sun Mar 8 16:09:08 EDT 2015


Steven D'Aprano <steve+comp.lang.python at pearwood.info>:

> Marko Rauhamaa wrote:
>> '\udd00' is a valid str object:
>
> Is it though? Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in
> the first place. That's not a rhetorical question. It's a genuine
> question.

The problem is that no matter how you shuffle surrogates, encoding
schemes, coding points and the like, a wrinkle always remains.

I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But
that's where the buck stops; traditional arithmetic functions are closed
under ℂ.

Unicode apparently hasn't found a similar closure.

That's why I think that while UTF-8 is a fabulous way to bring Unicode
to Linux, Linux should have taken the tack that Unicode is always an
application-level interpretation with few operating system tie-ins.
Unfortunately, the GNU world is busy trying to build a Unicode frosting
everywhere. The illusion can never be complete but is convincing enough
for application developers to forget to handle corner cases.

To answer your question, I think every code point from 0 to 1114111
should be treated as valid and analogous. Thus Python is correct here:

   >>> len('\udd00')
   1
   >>> len('\ufeff')
   1

The alternatives are far too messy to consider.


Marko



More information about the Python-list mailing list