Newbie question about text encoding

Chris Angelico rosuav at gmail.com
Sun Mar 8 17:13:48 EDT 2015


On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano
<steve+comp.lang.python at pearwood.info> wrote:
> Perhaps the bug is not UTF-8's inability to encode lone
> surrogates, but that Python allows you to create lone surrogates in the
> first place. That's not a rhetorical question. It's a genuine question.

As to the notion of rejecting the construction of strings containing
these invalid codepoints, I'm not sure. Are there any languages out
there that have a Unicode string type that requires that all
codepoints be valid (no surrogates, no U+FFFE, etc)? This is the kind
of thing that's usually done in an obscure language before it hits a
mainstream one.

Pike is similar to Python here. I can create a string with invalid
code points in it:

> "\uFFFE\uDD00";
(1) Result: "\ufffe\udd00"

but I can't UTF-8 encode that:

> string_to_utf8("\uFFFE\uDD00");
Character 0x0000dd00 at index 1 is in the surrogate range and therefore invalid.
Unknown program: string_to_utf8("\ufffe\udd00")
HilfeInput:1: HilfeInput()->___HilfeWrapper()

Or, using the streaming UTF-8 encoder instead of the short-hand:

> Charset.encoder("UTF-8")->feed("\uFFFE\uDD00")->drain();
Error encoding "\ufffe"[0xdd00] using utf8: Unsupported character 56576.
/usr/local/pike/8.1.0/lib/modules/_Charset.so:1:
    _Charset.UTF8enc()->feed("\ufffe\udd00")
HilfeInput:1: HilfeInput()->___HilfeWrapper()

Does anyone know of a language where you can't even construct the string?

ChrisA



More information about the Python-list mailing list