urlencode with high characters

"Martin v. Löwis" martin at v.loewis.de
Wed Nov 2 17:23:42 EST 2005


Jim wrote:
> My understanding is that I am supposed to be able to urlencode anything
> up to the top half of latin-1 -- decimal 128-255.

I believe your understanding is incorrect. Without being able to quote
RFCs precisely, I think your understanding should be this:

- the URL literal syntax only allows for ASCII characters
- bytes with no meaning in ASCII can be quoted through %hh in URLs
- the precise meaning of such bytes in the URL is defined in the
   URL scheme, and may vary from URL scheme to URL scheme
- the http scheme does not specify any interpretation of the bytes,
   but apparantly assumes that they denote characters, and follow
   some encoding - which encoding is something that the web server
   defines, when mapping URLs to resources.

If you get the impression that this is underspecified: your impression
is correct; it is underspecified indeed.

There is a recent attempt to tighten the specification through IRIs.
The IRI RFC defines a mapping between IRIs and URIs, and it uses
UTF-8 as the encoding, not latin-1.

Regards,
Martin



More information about the Python-list mailing list