Newbie question about text encoding
Marko Rauhamaa
marko at pacujo.net
Sun Mar 8 16:09:08 EDT 2015
Steven D'Aprano <steve+comp.lang.python at pearwood.info>:
> Marko Rauhamaa wrote:
>> '\udd00' is a valid str object:
>
> Is it though? Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in
> the first place. That's not a rhetorical question. It's a genuine
> question.
The problem is that no matter how you shuffle surrogates, encoding
schemes, coding points and the like, a wrinkle always remains.
I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But
that's where the buck stops; traditional arithmetic functions are closed
under ℂ.
Unicode apparently hasn't found a similar closure.
That's why I think that while UTF-8 is a fabulous way to bring Unicode
to Linux, Linux should have taken the tack that Unicode is always an
application-level interpretation with few operating system tie-ins.
Unfortunately, the GNU world is busy trying to build a Unicode frosting
everywhere. The illusion can never be complete but is convincing enough
for application developers to forget to handle corner cases.
To answer your question, I think every code point from 0 to 1114111
should be treated as valid and analogous. Thus Python is correct here:
>>> len('\udd00')
1
>>> len('\ufeff')
1
The alternatives are far too messy to consider.
Marko
More information about the Python-list
mailing list