python for everyday tasks

Mon Nov 25 19:09:40 EST 2013

On Tue, Nov 26, 2013 at 10:35 AM, Ben Finney <ben+python at benfinney.id.au> wrote:
> Chris Angelico <rosuav at gmail.com> writes:
>
>> (Fifteen years. It's seventeen years since Unicode 2.0, when 16-bit
>> characters were outmoded. It's about time _every_ modern language
>> followed Python's and Pike's lead and got its Unicode support right.)
>
> Most languages that already have some support for Unicode have a
> significant amount of legacy code to continue supporting, though. Python
> has the same problem: there're still heaps of Python 2 deployments out
> there, and more being installed every day, none of which do Unicode
> right.
>
> To fix Unicode support in Python, the developers and community had to
> initiate – and is still working through – a long, high-effort transition
> across a backward-incompatible change in order to get the community to
> Python 3, which finally does Unicode right.

Yes, but Python can start that process by creating Python 3; other
languages ought to be able to do something similar. Get the process
started. It's not going to get any easier by waiting.

And, more importantly: New languages are being developed. If their
designers look at Java, they'll see "UTF-16 is fine for them, so it'll
be fine for us", but if they look at Python, they'll see "The current
version of Python does it this way, everything else is just
maintenance mode, so this is obviously the way the Python designers
feel is right". Even if 99% of running Python code is Py2, that
message is still being sent, because Python 2.8 will never exist.

> Other language communities will likely have to do a similar huge effort,
> or forever live with nearly-right-but-fundamentally-broken Unicode
> support.
>
> See, for example, the enormous number of ECMAScript deployments in every
> user-facing browser, all with the false assumption (§2 of ECMA-262
> <URL:http://www.ecma-international.org/publications/standards/Ecma-262.htm>)
> that UTF-16 and Unicode are the same thing and nothing outside the BMP
> exists.
>
> And ECMAScript is near the front of the programming language pack in
> terms of Unicode support — most others have far more heinous flaws that
> need to be fixed by breaking backward compatibility. I wish their
> communities luck.

Yeah. I'm now of the opinion that JavaScript and ECMAScript can't be
fixed ("use strict" is entirely backward compatible, but changing
string handling wouldn't be), so it's time we had a new web browser
scripting language. Really, 1996 was long enough ago that using 16-bit
characters should be considered no less wrong than 8-bit characters.
If it weren't that we don't actually need the space any time soon, I
would consider the current limit of 11141112 characters to be a
problem too; there's really no reason to restrict ourselves based on
what UTF-16 is capable of encoding any more than we should define
Unicode based on what Code Page 437 can handle.

>  \         “Nature hath given men one tongue but two ears, that we may |
>   `\          hear from others twice as much as we speak.” —Epictetus, |
> _o__)                                                      _Fragments_ |

One of my brothers just got married, and someone who's friended him on
Facebook was unaware of the invitations despite being a prolific
poster. I cited the modern equivalent of the above, namely that we
have ten fingers but only two eyes, so it's acceptable to write five
times as much as we actually bother to read...

ChrisA