[Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...)

Thu Sep 16 23:30:39 CEST 2004

Mike Brown wrote:
> Right. This part of the thread was just about how the argument to 
> urllib.urlopen() should be handled when given as unicode vs str. You seemed to 
> be saying it should be str because a URI is fundamentally bytes and should be 
> analyzed as such, whereas I'm saying no, a URI is fundamentally characters and 
> should be analyzed as such. I mentioned %-encoding and the quirk of the BNF 
> just because those are aspects of the syntax that are byte-oriented and are the 
> source of much confusion, and because they may have influenced your assertion.
> 
> Are we in agreement on these points?

I think I have to answer "no". The % notation is not a quirk of the BNF.
I.e. when the BNF states that an URI contains %AC (say), this does *not*
mean that the actual URI in-memory-or-on-the-wire contains the byte
\xAC. The spec actually says that the URI, in memory, on the wire, or
on paper, actually contains the three character '%', 'A', and 'C'. So
usage of that escape mechanism is *not* a result of the BNF notation;
it is the inherent desire that URIs contain only characters in ASCII.
URIs that contain non-ASCII characters have to escape them "somehow",
typically using the % notation.

>  -  A URL/URI consists of a finite sequence of Unicode characters;

No. An URI contains of a finite sequence of characters. Whether they
are Unicode or not is not specified. The assumption certainly is that
if the characters are coded (i.e. assigned to numbers), those numbers
don't have to match Unicode code points at all. An URI that consists
of KOI-8R characters would very well be possible.

>  -  urlopen(), and anything else that takes a URL/URI argument,
>     must accept both str and unicode;

Certainly.

>  -  If given unicode, each character in the string directly represents
>     a character in the URL/URI and needs no interpretation;

No. Only ASCII characters in the string need no interpretation. For
non-ASCII characters, urllib needs to assume some escaping mechanism.

>  -  If given str, each byte in the string represents a character in
>     the URL/URI according to US-ASCII interpretation;

Yes, if the bytes are meaningful in ASCII.

>  -  Characters or bytes outside the ASCII range, and even certain
>     characters in the ASCII range, are not permitted in a URL/URI,
>     and thus the interpretation of a string containing them may
>     result in an exception or other unpredictable results.

Yes.

>  -  The urllib, urllib2, and urlparse modules currently do not
>     claim to conform to any particular standards governing the
>     interpretation of URLs; they merely acknowledge that some
>     standards may be applicable. However, the intent is to provide
>     standards-conformant behavior where possible, to the extent 
>     that the module APIs overlap with functionality mandated by
>     current standards.

Yes. For input that is out of scope of existing standards, backwards

> 
>     When the relevant standards become obsolete due to publication
>     of updated standards (e.g. RFC 1630 -> 1738 -> 1808 -> 2396),
>     the implementations *may* be updated accordingly, and users
>     should expect behavior that conforms to either the current or
>     obsoleted standards. Which standards are applicable to a
>     particular implementation should be documented in the module
>     and in its functions & classes where necessary.
> 
> And how about these?
> 
>  -  urlopen() is documented as accepting a 'url' argument that is
>     the URL of 'a network object' that can be read; a file-like
>     object, based on either a local file or a socket, is normally
>     returned. This 'network object' may be a local file if the
>     'file' scheme is used or if the URL's scheme component is omitted.
> 
>     For convenience, the 'url' argument is permitted to be given as
>     a str or unicode, and may be 'absolute' or 'relative'.
> 
>     If RFC 2396 or rfc2396bis apply, then the argument is assumed to
>     be what is defined in the grammar as a URI-reference. A fragment
>     component, if present, is stripped (this requires a change to the
>     implementation) and in all cases, the reference is resolved
>     against a default base URI.
> 
>     If RFC 1808 applies (the current implementation is based largely
>     on this spec, which did not clearly distinguish between a reference
>     and a URI), it is what is defined in the grammar as a URL, and
>     if it is relative (relativeURL in the grammar), it is considered
>     to be relative to a default base URL.
> 
>     (This is essentially describing the current implementation in
>     terms used by the standards).
> 
>  -  In urlopen() and the URLOpener classes it depends on, the default
>     base URI is the result of resolving the result of os.getcwd(),
>     converted to a URL by some undocumented means, against the base
>     'file:///'. 
> 
>     (I don't think this would require a change to the implementation,
>     but it is a principle that should be agreed upon and documented,
>     and perhaps the nuances of getcwd vs getcwdu should be addressed).
> 
>  -  The resolution of URIs having the 'file' scheme is undertaken on
>     the local filesystem according to conventions that should be, but
>     presently aren't, documented. A preferred mapping of filesystem
>     paths to URIs and back should be documented for each platform.
> 
>  -  In urlopen(), the processing of a 'url' argument that is
>     syntactically absolute may be nonconformant on platforms
>     that use ":" in their filesystem paths. On such platforms, if the
>     first ":" in what is syntactically an absolute URL/URI appears to
>     be intended for use other than as a scheme component delimiter,
>     the path will assumed to be relative. Furthermore, on Windows,
>     '\', which is not allowed in a URL, or its equivalent percent-
>     encoded sequence '%5C' (case-insensitive), will be interpreted as
>     a '/' in the URL.
> 
>     Thus, on Windows, an argument such as r'C:\a\b\c.txt' will be
>     treated as if it were 'file:///C:/a/b/c.txt' by the URLOpeners.
>     This is a convenience feature for the benefit of users who do
>     not have the means to convert an OS path to full 'file' URL.
> 
>     (This mostly describes current behavior, assuming we can reach
>     agreement that the "C:" in the example above should be treated
>     no differently than "C|").
> 
> 
>>As for using regular expressions in the standard library: It seems you
>>believe this is discouraged. I don't know why you think so - I've never
>>heard of such a constraint before (in general - in specific cases,
>>submitters may have been told that alternatives are more efficient).
> 
> 
> I was just surprised to find that regular expressions are not used much in
> urllib, urllib2, and urlparse. The implementations seem to be going to a
> lot of trouble to process URLs using find() and string slices. I thought
> perhaps there was a good reason for this.
> 
> I must attend to other things right now; will comment on the other issues 
> later.
> 
> -Mike
> 
>