umlauts

Sat Oct 17 11:51:49 EDT 2009

Arian Kuschki schrieb:
> Hi all
> 
> this has been bugging me for a long time and I do not seem to be able to 
> understand what to do. I always have problems when dealing input text that 
> contains umlauts. Consider the following:
> 
> In [1]: import urllib
> 
> In [2]: f = urllib.urlopen("http://www.google.de/ig/api?weather=Muenchen")
> 
> In [3]: xml = f.read()
> 
> In [4]: f.close()
> 
> In [5]: print xml
> ------> print(xml)
> <?xml version="1.0"?><xml_api_reply version="1"><weather module_id="0" 
> tab_id="0" mobile_row="0" mobile_zipped="1" row="0" section="0" 
>> <forecast_information><cit
> y data="Munich, BY"/><postal_code data="Muenchen"/><latitude_e6 
> data=""/><longitude_e6 data=""/><forecast_date 
> data="2009-10-17"/><current_date_time data="2009-10
> -17 14:20:00 +0000"/><unit_system 
> data="SI"/></forecast_information><current_conditions><condition data="Meistens 
> bew�kt"/><temp_f data="43"/><temp_c data="6"/><h
> umidity data="Feuchtigkeit: 87�%"/><icon 
> data="/ig/images/weather/mostly_cloudy.gif"/><wind_condition data="Wind: W mit 
> Windgeschwindigkeiten von 13 km/h"/></curr
> ent_conditions><forecast_conditions><day_of_week data="Sa."/><low 
> data="1"/><high data="7"/><icon 
> data="/ig/images/weather/chance_of_rain.gif"/><condition data="V
> ereinzelt Regen"/></forecast_conditions><forecast_conditions><day_of_week 
> data="So."/><low data="-1"/><high data="8"/><icon 
> data="/ig/images/weather/chance_of_sno
> w.gif"/><condition data="Vereinzelt 
> Schnee"/></forecast_conditions><forecast_conditions><day_of_week 
> data="Mo."/><low data="-4"/><high data="8"/><icon data="/ig/i
> mages/weather/mostly_sunny.gif"/><condition data="Teils 
> sonnig"/></forecast_conditions><forecast_conditions><day_of_week 
> data="Di."/><low data="0"/><high data="8"
> /><icon data="/ig/images/weather/sunny.gif"/><condition 
> data="Klar"/></forecast_conditions></weather></xml_api_reply>
> 
> As you can see the umlauts in the XML are not displayed properly. When I want 
> to process this text (for example with xml.sax), I get error messages because 
> the parses can't read this.
> 
> I've tried to read up on this and there is a lot of information on the web, but 
> nothing seems to work for me. For example setting the coding to UTF like this: 
> # -*- coding: utf-8 -*- or using the decode() string method.

The encoding of the python-source-file has nothing to do with this. It's 
only relevant for unicode-literals (in python 2.x, that's u"...")

> 
> I always have this kind of problem when input contains umlauts, not just in 
> this case. My locale (on Ubuntu) is en_GB.UTF-8.

If we assume the data on the website is correct (it appears to be when I 
open it in FF), then your problem is most probably your display/terminal.

What does this show you in your interactive interpreter?

 >>> print "\xc3\xb6"
ö

For me, it's o-umlaut, ö. This is because the above bytes are the 
sequence for ö in utf-8.

If this shows something else, you need to adjust your terminal settings.

Diez