Unicode name questions

Martin v. Loewis martin at v.loewis.de
Wed Apr 17 04:25:23 EDT 2002


Skip Montanaro <skip at pobox.com> writes:

> "Lambda" has been spelled with a "b" as long as I can remember.  

I think the Unicode charts use an ASCII transcription of the native
pronouncation of the letters, if possible. In any case, the Unicode
names for the characters are "official" in the context of Unicode.
The Unicode consortium refrains from renaming them. Between Unicode
2.0 and Unicode 3.2, no character was renamed.

Unicode 1.0 had different names, and \u039B was indeed called GREEK
CAPITAL LETTER LAMBDA in Unicode 1.0.

> I see both "LAMBDA" and "LAMDA" in the comments in the encodings
> modules though.

Apparently, some of the modules are based on tables which use the
Unicode 1.0 names.

> Note that
> 
>     http://www.w3.org/TR/REC-html40/sgml/entities.html
> 
> spells it with a "b".

Correct. However, this is what SGML calls this letter; apparently, the
Unicode consortium found reason to call it differently. I don't know
what the rationale was - most likely, both would be considered
"correct" by a Greek speaker.

>     Brian> And change "GREEK SMALL LETTER THETA SYMBOL" to "GREEK SMALL
>     Brian> LETTER THETA"
> 
> Note that there are both 'θ' and 'ϑ' HTML entities.  I found
> it here:
> 
>     http://www.htmlhelp.com/reference/html40/entities/symbols.html
> 
> but also saw it here:
> 
>     http://www.w3.org/TR/REC-html40/sgml/entities.html
> 
> where it is marked "new".  I suspect perhaps Python's codecs just haven't
> caught up yet.

No. Unicode has

0398;GREEK CAPITAL LETTER THETA;Lu;0;L;;;;;N;;;;03B8;
03B8;GREEK SMALL LETTER THETA;Ll;0;L;;;;;N;;;0398;;0398
03D1;GREEK THETA SYMBOL;Ll;0;L;<compat> 03B8;;;;N;GREEK SMALL LETTER SCRIPT THETA;;0398;;0398

The second field is the Unicode name; Python suports all three of
them. The tenth field is the Unicode 1.0 name (if different); the
fifth field indicates that U+03D1 is a compatibility (deprecated)
character for U+3B8.

>From the HTML definition, you'll see that θ is θ which is
&#x3b8; which is GREEK SMALL LETTER THETA. ϑ is &#x3d1; i.e.
GREEK THETA SYMBOL. This was called GREEK SMALL LETTER SCRIPT THETA in
the past, but never GREEK SMALL LETTER THETA SYMBOL. One should
perhaps point out this mistake to the W3C; they ought to use the
Unicode names throughout.

Unicode 3.1 adds

03F4;GREEK CAPITAL THETA SYMBOL;Lu;0;L;<compat> 0398;;;;N;;;;03B8;

as well as various MathML thetas; these are indeed not supported in
Python, yet.

HTH,
Martin



More information about the Python-list mailing list