Newbie question about text encoding

Fri Mar 6 11:45:35 EST 2015

On Sat, Mar 7, 2015 at 3:20 AM, Rustom Mody <rustompmody at gmail.com> wrote:
> C's string is not bug-prone its plain buggy as it cannot represent strings
> with nulls.
>
> I would not go that far for UTF-16.
> It is bug-inviting but it can also be implemented correctly

C's standard library string handling functions are restricted in that
they handle a 255-byte alphabet. They do not handle Unicode, they do
not handle NUL, that is simply how they are. But I never said I was
talking about the C standard library. If you type a text string into a
GUI entry field, or encode it quoted-printable and pass it to a web
server, or whatever, you shouldn't know or care about what language
the program is written in; and if that program barfs on a NUL, that's
a limitation. That limitation might be caused by its naive use of
strcpy() when it should have used memcpy(), but that's not your
problem.

It's exactly the same here: if your program chokes on an SMP
character, I don't care what your program was written in or what
library functions your program called on. All I care is that your
program - repeated for emphasis, *your* program - failed on that
input. It's up to you to choose your underlying functions
appropriately.

>> - If you are designing your own language, your implementation of Unicode
>> strings should use something like Python's FSR, or UTF-8 with tweaks to
>> make string indexing O(1) rather than O(N), or correctly-implemented
>> UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)
>
> FSR is possible in python for very specific pythonic reasons
> - dynamicness
> - immutable strings
>
> Drop either and FSR is impossible

I don't know what you mean by "dynamicness". What you do need is a
Unicode string type, such that the application program isn't aware of
the underlying bytes, but simply treats this string as a sequence of
code points. The immutability isn't technically a requirement, but it
does make the FSR much more manageable; in a language with mutable
strings, it's probably more efficient to use UTF-32 for simplicity,
but it's up to the language designer to figure that out. (It might be
best to use something like the FSR, but where strings are never
narrowed after being widened, so it'd be possible for an ASCII-only
string to be stored UTF-32. That has consequences for comparisons, but
might give a reasonable hybrid of storage and mutation performance.)

> _tkinter.TclError: character U+1f4a9 is above the range (U+0000-U+FFFF) allowed by Tcl
>
> So who/what is broken?

The exception is pretty clear on that point. Tcl can't handle SMP
characters. So it's Tcl that's broken. Unless there's evidence to the
contrary, that's what I would expect to be the case.

> Correct.
> Windows is broken for using UTF-16
> Linux is broken for conflating UTF-8 and byte string.
>
> Lot of breakage out here dont you think?
> May be related to the equation
>
> UTF-16 = UCS-2 + Duct-tape

UTF-16 is an encoding that was designed to be backward-compatible with
UCS-2, just as UTF-8 was designed to be compatible with ASCII. Call it
what you will, but backward compatibility is pretty important. Look at
things like DES3 - if you use the same key three times, it's
compatible with DES.

Linux isn't "broken" for conflating UTF-8 and byte strings. Linux is
flawed in that it defines file names to be byte strings, which means
that every file system could be different in what it actually uses as
the encoding. Since file names exist for the benefit of humans, they
should be treated as text, so we should work with them as text. But
for reasons of backward compatibility, Linux hasn't yet changed.

Windows isn't broken for using UTF-16. I think it's a poor trade-off,
given that so many file names are ASCII-only; and, of course, if any
program treats a Windows file name as UCS-2, then that program is
broken. But UTF-16 is not itself broken, any more than UTF-7 is. And
UTF-7 is a lot harder to work with.

ChrisA