
Mark Tolonen metolone+gmane at
Sat Oct 17 16:16:45 EDT 2009

"Diez B. Roggisch" <deets at> wrote in message 
news:7jub5rF37divlU4 at
> This is wierd. I looked at the site in FireFox - and it was displayed 
> correctly, including umlauts. Bringing up the info-dialog claims the page 
> is UTF-8, the XML itself says so as well (implicit, through the missing 
> declaration of an encoding) - but it clearly is *not* utf-8.
> One would expect google to be better at this...
> Diez

According to the XML 1.0 specification:

"Although an XML processor is required to read only entities in the UTF-8 
and UTF-16 encodings, it is recognized that other encodings are used around 
the world, and it may be desired for XML processors to read entities that 
use them. In the absence of external character encoding information (such as 
MIME headers), parsed entities which are stored in an encoding other than 
UTF-8 or UTF-16 must begin with a text declaration..."

So UTF-8 and UTF-16 are the defaults supported without an xml declaration in 
the absence of external encoding information.  But we have external 
character encoding information:

>>> f = urllib.urlopen("")
>>> f.headers.dict['content-type']
'text/xml; charset=ISO-8859-1'

So the page seems correct.


More information about the Python-list mailing list