[Python-Dev] bytes / unicode

Wed Jun 23 18:30:51 CEST 2010

On Wed, Jun 23, 2010 at 10:30 AM, Tres Seaver <tseaver at palladion.com> wrote:

>  Stephen J. Turnbull wrote:
>
> > We do need str-based implementations of modules like urllib.
>
>
> Why would that be?  URLs aren't text, and never will be.  The fact that
> to the eye they may seem to be text-ish doesn't make them text.  This
> *is* a case where "dont make me think" is a losing propsition:
> programmers who work with URLs in any non-opaque way as text are
> eventually going to be bitten by this issue no matter how hard we wave
> our hands.
>

HTML is text, and URLs are embedded in that text, so it's easy to get a URL
that is text.  Though, with a little testing, I notice that text alone can't
tell you what the right URL really is (at least the intended URL when unsafe
characters are embedded in HTML).

To test I created two pages, one in Latin-1 another in UTF-8, and put in the
link:

  ./test.html?param=Réunion

On a Latin-1 page it created a link to test.html?param=R%E9union and on a
UTF-8 page it created a link to test.html?param=R%C3%A9union (the second
link displays in the URL bar as test.html?param=Réunion but copies with
percent encoding).  Though if you link to ./Réunion.html then both pages
create UTF-8 links.  And both pages also link
http://Réunion.com<http://xn--runion-bva.com>to
http://xn--runion-bva.com/.  So really neither bytes nor text works
completely; query strings receive the encoding of the page, which would be
handled transparently if you worked on the page's bytes.  Path and domain
are consistently encoded with UTF-8 and punycode respectively and so would
be handled best when treated as text.  And of course if you are a page with
a non-ASCII-compatible encoding you really must handle encodings before the
URL is sensible.

Another issue here is that there's no "encoding" for turning a URL into
bytes if the URL is not already ASCII.  A proper way to encode a URL would
be:

(Totally as an aside, as I remind myself of new module names I notice it's
not easy to google specifically for Python 3 docs, e.g. "python 3 urlsplit"
gives me 2.6 docs)

from urllib.parse import urlsplit, urlunsplit
import encodings.idna

def encode_http_url(url, page_encoding='ASCII', errors='strict'):
    scheme, netloc, path, query, fragment = urlsplit(url)
    scheme = scheme.encode('ASCII', errors)
    auth = port = None
    if '@' in netloc:
        auth, netloc = netloc.split('@', 1)
    if ':' in netloc:
        netloc, port = netloc.split(':', 1)
    netloc = encodings.idna.ToASCII(netloc)
    if port:
        netloc = netloc + b':' + port.encode('ASCII', errors)
    if auth:
        netloc = auth.encode('UTF-8', errors) + b'@' + netloc
    path = path.encode('UTF-8', errors)
    query = query.encode(page_encoding, errors)
    fragment = fragment.encode('UTF-8', errors)
    return urlunsplit_bytes((scheme, netloc, path, query, fragment))

Where urlunsplit_bytes handles bytes (urlunsplit does not).  It's helpful
for me at least to look at that code specifically:

def urlunsplit(components):
    scheme, netloc, url, query, fragment = components
    if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
        if url and url[:1] != '/': url = '/' + url
        url = '//' + (netloc or '') + url
    if scheme:
        url = scheme + ':' + url
    if query:
        url = url + '?' + query
    if fragment:
        url = url + '#' + fragment
    return url

In this case it really would be best to have Python 2's system where things
are coerced to ASCII implicitly.  Or, more specifically, if all those string
literals in that routine could be implicitly converted to bytes using
ASCII.  Conceptually I think this is reasonable, as for URLs (at least with
HTTP, but in practice I think this applies to all URLs) the ASCII bytes
really do have meaning.  That is, '/' (*in the context of urlunsplit*)
really is \x2f specifically.  Or another example, making a GET request
really means sending the bytes \x47\x45\x54 and there is no other set of
bytes that has that meaning.  The WebSockets specification for instance
defines things like "colon":
http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-76#page-5 -- in
an earlier version they even used bytes to describe HTTP (
http://tools.ietf.org/html/draft-hixie-thewebsocketprotocol-54#page-13),
though this annoyed many people.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20100623/24e8c28a/attachment.html>