[Python-ideas] Processing surrogates in

Chris Angelico rosuav at gmail.com
Fri May 8 03:40:09 CEST 2015


On Fri, May 8, 2015 at 1:31 AM, Steven D'Aprano <steve at pearwood.info> wrote:
> But the other 90% of the complexity is inherent to human languages. For
> example, you know what the lower case of "I" is, don't you? It's "i".
> But not in Turkey, which has both a dotted and dotless version:
>
>     I ı
>     İ i
>
> (Strangely, as far as I know, nobody has a dotted J or dotless j.)
>
> Consequently, Unicode has a bunch of complexity related to left-to-right
> and right-to-left writing systems, accents, joiners, variant forms, and
> other issues. But, unless you're actually writing in a language which
> needs that, or writing a word-processor application, you can usually
> ignore all of that and just treat them as "characters".

Or a transliteration script. Imagine you have a whole lot of videos
with text over them, and you'd like to transcribe that text into,
well, a text file. It's pretty easy with Latin-based scripts; just
come up with a notation for keying in diacriticals and the handful of
other characters (slashed O for Norwegian, D with bar for Vietnamese,
etc), then (optionally) perform an NFC transformation, and job's done.
Cyrillic, Greek, Elder Futhark, and even IPA, can be handled fairly
readily by means of simple reversible transliterations (д becomes d, d
becomes д), with a handful of special cases (the Greek sigma has
medial (σ) and final (ς) forms, both of which translate into the Latin
letter 's'). Korean's hangul syllables are a slightly odd case,
because they can be NFC composed from individual letters, but the
decomposed forms take up more space on the page, which makes the NFC
transformation mandatory:

"hanguk" = "\u1112\u1161\u11ab\u1100\u116e\u11a8" = "\ud55c\uad6d" = "Korea"

Aside from that, all the complexities are, as Steven says, inherent to
human languages. Unicode isn't the problem; Unicode is just reflecting
the fact that people write stuff differently. Python also isn't the
problem; Python is one of my top two preferred languages for any sort
of international work (the other being Pike, and for all the same
reasons).

>> Imageine if we were starting to design the 21st century from scratch,
>> throwing away all the history?  How would we go about it?
>
> Well, for starters I would insist on re-introducing thorn þ and eth ð
> back into English :-)

Sure, that'll unify us with ancient texts, and with modern Icelandic.
But what about other languages with the same sound (IPA: θ)? European
Spanish (though not Mexican Spanish) spells it as "z" - English could
do the same, given that "s" is able to make the same sound "z" does in
English. :)

But seriously, the alphabetic languages aren't much of a problem.
Unicode can cope with European languages easily. What I'd want to
change is to use some form of phonetic system for Chinese and Japanese
languages - a system in which the written form does its best to
correspond to the spoken form, rather than the massively complex
pictorial system now in use. At very least, I'd like to see an
alternative written form used for names, in which they're composed of
sounds; that way, there'd be a finite set of characters in use, and
it'd be far easier for us to cope with them. (The problem of a
collision would be no worse than already exists when names are said
aloud. Having multiple characters pronounced the same way is a benefit
only to the written form.) It's too late now, of course.

ChrisA


More information about the Python-ideas mailing list