[Python-Dev] URL processing conformance and principles (was Re: urllib.urlopen...)

Thu Sep 16 17:50:24 CEST 2004

"Martin v. Löwis" wrote:
> You are right: URIs are meant to be written on paper. However, RFC 2396
> also acknowledges that the issue of non-ASCII characters is unresolved.
> It suggests (in 2.1) that the URI scheme should specify the
> interpretation of byte values.

Right. This part of the thread was just about how the argument to 
urllib.urlopen() should be handled when given as unicode vs str. You seemed to 
be saying it should be str because a URI is fundamentally bytes and should be 
analyzed as such, whereas I'm saying no, a URI is fundamentally characters and 
should be analyzed as such. I mentioned %-encoding and the quirk of the BNF 
just because those are aspects of the syntax that are byte-oriented and are the 
source of much confusion, and because they may have influenced your assertion.

Are we in agreement on these points?

 -  A URL/URI consists of a finite sequence of Unicode characters;

 -  urlopen(), and anything else that takes a URL/URI argument,
    must accept both str and unicode;

 -  If given unicode, each character in the string directly represents
    a character in the URL/URI and needs no interpretation;

 -  If given str, each byte in the string represents a character in
    the URL/URI according to US-ASCII interpretation;

 -  Characters or bytes outside the ASCII range, and even certain
    characters in the ASCII range, are not permitted in a URL/URI,
    and thus the interpretation of a string containing them may
    result in an exception or other unpredictable results.

If even these principles can be agreed upon, then I can submit a
documentation patch, at the very least.

Furthermore, what about this principle?

 -  The urllib, urllib2, and urlparse modules currently do not
    claim to conform to any particular standards governing the
    interpretation of URLs; they merely acknowledge that some
    standards may be applicable. However, the intent is to provide
    standards-conformant behavior where possible, to the extent 
    that the module APIs overlap with functionality mandated by
    current standards.

    When the relevant standards become obsolete due to publication
    of updated standards (e.g. RFC 1630 -> 1738 -> 1808 -> 2396),
    the implementations *may* be updated accordingly, and users
    should expect behavior that conforms to either the current or
    obsoleted standards. Which standards are applicable to a
    particular implementation should be documented in the module
    and in its functions & classes where necessary.

And how about these?

 -  urlopen() is documented as accepting a 'url' argument that is
    the URL of 'a network object' that can be read; a file-like
    object, based on either a local file or a socket, is normally
    returned. This 'network object' may be a local file if the
    'file' scheme is used or if the URL's scheme component is omitted.

    For convenience, the 'url' argument is permitted to be given as
    a str or unicode, and may be 'absolute' or 'relative'.

    If RFC 2396 or rfc2396bis apply, then the argument is assumed to
    be what is defined in the grammar as a URI-reference. A fragment
    component, if present, is stripped (this requires a change to the
    implementation) and in all cases, the reference is resolved
    against a default base URI.

    If RFC 1808 applies (the current implementation is based largely
    on this spec, which did not clearly distinguish between a reference
    and a URI), it is what is defined in the grammar as a URL, and
    if it is relative (relativeURL in the grammar), it is considered
    to be relative to a default base URL.

    (This is essentially describing the current implementation in
    terms used by the standards).

 -  In urlopen() and the URLOpener classes it depends on, the default
    base URI is the result of resolving the result of os.getcwd(),
    converted to a URL by some undocumented means, against the base
    'file:///'. 

    (I don't think this would require a change to the implementation,
    but it is a principle that should be agreed upon and documented,
    and perhaps the nuances of getcwd vs getcwdu should be addressed).

 -  The resolution of URIs having the 'file' scheme is undertaken on
    the local filesystem according to conventions that should be, but
    presently aren't, documented. A preferred mapping of filesystem
    paths to URIs and back should be documented for each platform.

 -  In urlopen(), the processing of a 'url' argument that is
    syntactically absolute may be nonconformant on platforms
    that use ":" in their filesystem paths. On such platforms, if the
    first ":" in what is syntactically an absolute URL/URI appears to
    be intended for use other than as a scheme component delimiter,
    the path will assumed to be relative. Furthermore, on Windows,
    '\', which is not allowed in a URL, or its equivalent percent-
    encoded sequence '%5C' (case-insensitive), will be interpreted as
    a '/' in the URL.

    Thus, on Windows, an argument such as r'C:\a\b\c.txt' will be
    treated as if it were 'file:///C:/a/b/c.txt' by the URLOpeners.
    This is a convenience feature for the benefit of users who do
    not have the means to convert an OS path to full 'file' URL.

    (This mostly describes current behavior, assuming we can reach
    agreement that the "C:" in the example above should be treated
    no differently than "C|").

> As for using regular expressions in the standard library: It seems you
> believe this is discouraged. I don't know why you think so - I've never
> heard of such a constraint before (in general - in specific cases,
> submitters may have been told that alternatives are more efficient).

I was just surprised to find that regular expressions are not used much in
urllib, urllib2, and urlparse. The implementations seem to be going to a
lot of trouble to process URLs using find() and string slices. I thought
perhaps there was a good reason for this.

I must attend to other things right now; will comment on the other issues 
later.

-Mike