How to convert 'ö' to 'oe' or 'o' (or other similar things) in a string?

Steve D'Aprano steve+python at pearwood.info
Sun Sep 25 07:06:07 EDT 2016


On Sun, 25 Sep 2016 09:08 am, Thomas 'PointedEars' Lahn wrote:

> Christian Gollwitzer wrote:
> 
>> Am 17.09.16 um 23:19 schrieb Thomas 'PointedEars' Lahn:
>>> Peng Yu wrote:
>>>> Hi, I want to convert strings in which the characters with accents
>>>> should be converted to the ones without accents.
>>> […]
>>>> […]
>>>> ./main.py Förstemann
>>>
>>> AFAIK, “ä”, “ö”, and “ü” are not accented characters in any natural
>>> language, but characters of their own (umlauts).
>>>
>>> In particular, I know for certain that they are not accented in Germanic
>>> languages.  Swedish has been mentioned; I can add my native language,
>>> German, to that list.
>> 
>> In German, they are letters,
> 
> If you read more carefully, my point was: In German, umlauts are not
> "accented characters".

The umlauts themselves are not. But the combination of vowel-plus-umlaut is
surely an "accented character", is it not? If not, what do you call it in
German?

My understanding is that both officially and popularly, native German
speakers consider that the alphabet has 26 letters (same as English), and
that "accented characters" including the vowels which take umlauts are not
distinct letters of the alphabet but mere variations of the standard
vowels.

That's to be contrasted to (say) Swedish, where ä and ö are *not* "a and o
with an accent/diacritic/umlaut/diaeresis/trema" but distinct letters of
the alphabet in their own right. That's different from ü (the "German Y")
in Swedish, which is only used for loan words and names of German origin,
and *is* considered to be a variant of u.

I use the term "accented character" here in the ignorant, non-linguist,
English-speaker sense of any letter of the alphabet with "funny dots and
squiggles" on it. To people who know what they are talking about, there is
a difference between an accent, umlaut, trema, diaeresis and other
diacritics, but for the purposes of my question, I'm not too worried about
the technical difference between these modifiers, only whether or not they
are considered a modifier on a standard letter or not.



[...]
> And as you have mentioned phone books, in all German-speaking phone books
> I have come across so far, “ä” does sort like “ae”, “ö” like “oe”, and “ü”
> like “ue” (this is specified in DIN 5007 as “variant 1”).
> 
> (That does not mean, however, that it is a good idea to *convert* those
> letters this way.  And there is no good reason to; all modern operating
> systems, filesystems and name schemes support Unicode.)

Alas, if we only needed to deal with modern operating systems, file systems
and naming schemes, life would be much easier. But sadly we also have to
deal with *old* operating systems, file systems and naming schemes; as well
as ASCII-only or other non-Unicode applications, plus keyboards that give
the user no obvious or easy way to add "accents" (diacritics etc.) to base
letters. See, for example:

http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/

As the author says:

"One of my clients gets address data from Europe, but most of their systems
cannot handle Latin-1 characters. With all due respect to the umlaut,
scharfes s, cedilla, and all the other fine accented characters of Europe,
all I needed to do was to prepare addresses for a shipping system."


Post offices and freight companies are used to dealing with misspelled
addresses. They can usually cope with a few missing accents.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.




More information about the Python-list mailing list