[Python-Dev] teaching the new urllib

Wed Feb 4 10:14:16 CET 2009

On Tue, Feb 03, 2009 at 06:50:44PM -0500, Tres Seaver wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> The encoding information *is* available in the response headers, e.g.:
> 
> - ---------------------- %< ---------------------------------
> $ wget -S --spider http://knuth.luther.edu/test.html
> - --18:46:24--  http://knuth.luther.edu/test.html
>            => `test.html'
> Resolving knuth.luther.edu... 192.203.196.71
> Connecting to knuth.luther.edu|192.203.196.71|:80... connected.
> HTTP request sent, awaiting response...
>   HTTP/1.1 200 OK
>   Date: Tue, 03 Feb 2009 23:46:28 GMT
>   Server: Apache/2.0.50 (Linux/SUSE)
>   Last-Modified: Mon, 17 Sep 2007 23:35:49 GMT
>   ETag: "2fcd8-1d8-43b2bf40"
>   Accept-Ranges: bytes
>   Content-Length: 472
>   Keep-Alive: timeout=15, max=100
>   Connection: Keep-Alive
>   Content-Type: text/html; charset=ISO-8859-1
> Length: 472 [text/html]
> 200 OK
> - ---------------------- %< ---------------------------------
> 
> So, the OP's use case *could* be satisfied, assuming that the Py3K
> version of urllib sprouted a means of leveraging that header.  In this
> sense, fetching the resource over HTTP is *better* than loading it from
> a file:  information about the character set is explicit, and highly
> likely to be correct, at least for any resource people expect to render
> cleanly in a browser.

First of all, as it was noted, Content-Type may have no charset parameter,
or be omitted at all.
But the most important and the worst is that charset in Content-Type may
have no relation to charset in document. And even worse - charset specified
in document may have no relation to charset used to encode the document. :(

Remember, that headers are supplied by HTTP server and it have to read document
from just a file, so there is no difference, since there is no magic in being a
HTTP server. Ofcourse it will be correct to provide web-server with some hints
about charset of byte-encoded text documents, but web-server will not stop
working without charset specified or with incorrect charset.

This use case is really important for those international segments of Internet,
which have two or more conflicting character sets for their (single) alphabet.
As an example - every Russian Internet user can tell you that a browser, that
have no menu option to select explicitly what encoding to use for current
document, is completely unusable.

-- 
Alexey Shpagin