[Python-Dev] Re: URL processing conformance and principles (was Re: urllib.urlopen...)

Thu Sep 16 23:39:00 CEST 2004

{I hit sent too early, here is the rest }

Mike Brown wrote:
> Right. This part of the thread was just about how the argument to 
> urllib.urlopen() should be handled when given as unicode vs str. You seemed to 
> be saying it should be str because a URI is fundamentally bytes and should be 
> analyzed as such, whereas I'm saying no, a URI is fundamentally characters and 
> should be analyzed as such. I mentioned %-encoding and the quirk of the BNF 
> just because those are aspects of the syntax that are byte-oriented and are the 
> source of much confusion, and because they may have influenced your assertion.
> 
> Are we in agreement on these points?

I think I have to answer "no". The % notation is not a quirk of the BNF.
I.e. when the BNF states that an URI contains %AC (say), this does *not*
mean that the actual URI in-memory-or-on-the-wire contains the byte
\xAC. The spec actually says that the URI, in memory, on the wire, or
on paper, actually contains the three character '%', 'A', and 'C'. So
usage of that escape mechanism is *not* a result of the BNF notation;
it is the inherent desire that URIs contain only characters in ASCII.
URIs that contain non-ASCII characters have to escape them "somehow",
typically using the % notation.

>  -  A URL/URI consists of a finite sequence of Unicode characters;

No. An URI contains of a finite sequence of characters. Whether they
are Unicode or not is not specified. The assumption certainly is that
if the characters are coded (i.e. assigned to numbers), those numbers
don't have to match Unicode code points at all. An URI that consists
of KOI-8R characters would very well be possible.

>  -  urlopen(), and anything else that takes a URL/URI argument,
>     must accept both str and unicode;

Certainly.

>  -  If given unicode, each character in the string directly represents
>     a character in the URL/URI and needs no interpretation;

No. Only ASCII characters in the string need no interpretation. For
non-ASCII characters, urllib needs to assume some escaping mechanism.

>  -  If given str, each byte in the string represents a character in
>     the URL/URI according to US-ASCII interpretation;

Yes, if the bytes are meaningful in ASCII.

>  -  Characters or bytes outside the ASCII range, and even certain
>     characters in the ASCII range, are not permitted in a URL/URI,
>     and thus the interpretation of a string containing them may
>     result in an exception or other unpredictable results.

Yes.

>  -  The urllib, urllib2, and urlparse modules currently do not
>     claim to conform to any particular standards governing the
>     interpretation of URLs; they merely acknowledge that some
>     standards may be applicable. However, the intent is to provide
>     standards-conformant behavior where possible, to the extent 
>     that the module APIs overlap with functionality mandated by
>     current standards.

Yes. For input that is out of scope of existing standards, backwards
compatibility is desirable, unless there is a strong indication that
Python should have raised an exception for this input all along.

>     When the relevant standards become obsolete due to publication
>     of updated standards (e.g. RFC 1630 -> 1738 -> 1808 -> 2396),
>     the implementations *may* be updated accordingly, and users
>     should expect behavior that conforms to either the current or
>     obsoleted standards. Which standards are applicable to a
>     particular implementation should be documented in the module
>     and in its functions & classes where necessary.

Yes.

>  -  urlopen() is documented as accepting a 'url' argument that is
>     the URL of 'a network object' that can be read; a file-like
>     object, based on either a local file or a socket, is normally
>     returned. This 'network object' may be a local file if the
>     'file' scheme is used or if the URL's scheme component is omitted.

Yes.

>     If RFC 1808 applies (the current implementation is based largely
>     on this spec, which did not clearly distinguish between a reference
>     and a URI), it is what is defined in the grammar as a URL, and
>     if it is relative (relativeURL in the grammar), it is considered
>     to be relative to a default base URL.

This is troublesome. What is a meaningful base URL? This should be 
mentioned prominently.

>  -  In urlopen() and the URLOpener classes it depends on, the default
>     base URI is the result of resolving the result of os.getcwd(),
>     converted to a URL by some undocumented means, against the base
>     'file:///'. 
> 
>     (I don't think this would require a change to the implementation,
>     but it is a principle that should be agreed upon and documented,
>     and perhaps the nuances of getcwd vs getcwdu should be addressed).

Sounds good.

>  -  The resolution of URIs having the 'file' scheme is undertaken on
>     the local filesystem according to conventions that should be, but
>     presently aren't, documented. A preferred mapping of filesystem
>     paths to URIs and back should be documented for each platform.

Ok.

>  -  In urlopen(), the processing of a 'url' argument that is
>     syntactically absolute may be nonconformant on platforms
>     that use ":" in their filesystem paths. On such platforms, if the
>     first ":" in what is syntactically an absolute URL/URI appears to
>     be intended for use other than as a scheme component delimiter,
>     the path will assumed to be relative. Furthermore, on Windows,
>     '\', which is not allowed in a URL, or its equivalent percent-
>     encoded sequence '%5C' (case-insensitive), will be interpreted as
>     a '/' in the URL.

Ok.

>     (This mostly describes current behavior, assuming we can reach
>     agreement that the "C:" in the example above should be treated
>     no differently than "C|").

I have no problem with that. There are no one-letter URL schemata,
are there?

> I must attend to other things right now; will comment on the other issues 
> later.

Take your time. This has been sitting around for many releases - one
more or less doesn't matter much in the global flow of things :-)

Regards,
Martin