[Python-Dev] PEP 393 Summer of Code Project

Fri Sep 2 18:01:56 CEST 2011

On Sep 1, 2011, at 9:30 PM, Steven D'Aprano wrote:

> Antoine Pitrou wrote:
>> Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit :
>>> This is definitely thought of as a separate
>>> mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans, but I really don't know.
>>> (Antoine? Georg?)
>> Indeed, they are not separate "letters" (they are considered the same in lexicographic order, and the French alphabet has 26 letters).
> 
> 
> On the other hand, the same doesn't necessarily apply to other languages. (At least according to Wikipedia.)
> 
> http://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics

For example, in Serbo-Croatian (Serbian, Croatian, Bosnian, Montenegrin, if you want), each of the following letters represent one distinct sound of the language.  In Serbian Cyrillic alphabet, they are distinct symbols. In Latin alphabet, the corresponding letters are formed with diacritics because the alphabet is shorter.

	Letter	Approximate pronunciation	Cyrillic
	------	-------------------------	--------
	č	tch in butcher			ч
	ć	ch in chapter, but softer	ћ
	dž	j in jump			џ
	đ	j in juice			ђ
	š	sh in ship			ш
	ž	s in pleasure, measure, ...	ж

The language has 30 sounds and the corresponding 30 letters.
See the count of the letters in these tables:
- http://hr.wikipedia.org/wiki/Hrvatska_abeceda
- http://sr.wikipedia.org/wiki/Азбука

Diacritics are used in grammar books and in print (occasionally) to distinguish between four different accents of the language:

	- long rising: á,
	- short rising: à,
	- long falling: ȃ (inverted breve, *not* a circumflex â), and
	- short falling: ȁ,

especially when the words that use the same sounds -- thus, spelled with the same letters -- are next  to each other.  The accents are used to change the intonation of the whole word, not to change the sound of the letter.

For example: "Ja sam sȃm." -- "I am alone."

Both words "sam" contain the "a" sound, but the first one is pronounced short.  As a form of the verb "to be" it's an enclitic that takes the accent of the preceding word "I".  The second one is pronounced with a long falling accent.

The macron can be used to indicate the length of a *non-stressed* vowel,
e.g. ā, but is usually unnecessary in standard print.

Many languages use alphabets that are not suitable to their sound system.  The speakers of these languages adapted alphabets to their sounds either by using letters with distinct shapes (Cyrillic letters above), or adding diacritics to an existing shape (Latin letters above).  

The new combined form is a distinct letter.  These letters have separate sections in dictionaries and a sorting order.

The diacritics that indicate an accent or length are used only above vowels and do *not* represent distinct letters.

Best regards,

	Zvezdan Petković

P.S. Since I live in the USA, the last letter of my surname is *wrongly* spelled (ć -> c) and pronounced (ch -> k) most of the time.  :-)