[Python-Dev] URL processing conformance and principles (was Re:
urllib.urlopen...)
Mike Brown
mike at skew.org
Thu Sep 16 17:50:24 CEST 2004
"Martin v. Löwis" wrote:
> You are right: URIs are meant to be written on paper. However, RFC 2396
> also acknowledges that the issue of non-ASCII characters is unresolved.
> It suggests (in 2.1) that the URI scheme should specify the
> interpretation of byte values.
Right. This part of the thread was just about how the argument to
urllib.urlopen() should be handled when given as unicode vs str. You seemed to
be saying it should be str because a URI is fundamentally bytes and should be
analyzed as such, whereas I'm saying no, a URI is fundamentally characters and
should be analyzed as such. I mentioned %-encoding and the quirk of the BNF
just because those are aspects of the syntax that are byte-oriented and are the
source of much confusion, and because they may have influenced your assertion.
Are we in agreement on these points?
- A URL/URI consists of a finite sequence of Unicode characters;
- urlopen(), and anything else that takes a URL/URI argument,
must accept both str and unicode;
- If given unicode, each character in the string directly represents
a character in the URL/URI and needs no interpretation;
- If given str, each byte in the string represents a character in
the URL/URI according to US-ASCII interpretation;
- Characters or bytes outside the ASCII range, and even certain
characters in the ASCII range, are not permitted in a URL/URI,
and thus the interpretation of a string containing them may
result in an exception or other unpredictable results.
If even these principles can be agreed upon, then I can submit a
documentation patch, at the very least.
Furthermore, what about this principle?
- The urllib, urllib2, and urlparse modules currently do not
claim to conform to any particular standards governing the
interpretation of URLs; they merely acknowledge that some
standards may be applicable. However, the intent is to provide
standards-conformant behavior where possible, to the extent
that the module APIs overlap with functionality mandated by
current standards.
When the relevant standards become obsolete due to publication
of updated standards (e.g. RFC 1630 -> 1738 -> 1808 -> 2396),
the implementations *may* be updated accordingly, and users
should expect behavior that conforms to either the current or
obsoleted standards. Which standards are applicable to a
particular implementation should be documented in the module
and in its functions & classes where necessary.
And how about these?
- urlopen() is documented as accepting a 'url' argument that is
the URL of 'a network object' that can be read; a file-like
object, based on either a local file or a socket, is normally
returned. This 'network object' may be a local file if the
'file' scheme is used or if the URL's scheme component is omitted.
For convenience, the 'url' argument is permitted to be given as
a str or unicode, and may be 'absolute' or 'relative'.
If RFC 2396 or rfc2396bis apply, then the argument is assumed to
be what is defined in the grammar as a URI-reference. A fragment
component, if present, is stripped (this requires a change to the
implementation) and in all cases, the reference is resolved
against a default base URI.
If RFC 1808 applies (the current implementation is based largely
on this spec, which did not clearly distinguish between a reference
and a URI), it is what is defined in the grammar as a URL, and
if it is relative (relativeURL in the grammar), it is considered
to be relative to a default base URL.
(This is essentially describing the current implementation in
terms used by the standards).
- In urlopen() and the URLOpener classes it depends on, the default
base URI is the result of resolving the result of os.getcwd(),
converted to a URL by some undocumented means, against the base
'file:///'.
(I don't think this would require a change to the implementation,
but it is a principle that should be agreed upon and documented,
and perhaps the nuances of getcwd vs getcwdu should be addressed).
- The resolution of URIs having the 'file' scheme is undertaken on
the local filesystem according to conventions that should be, but
presently aren't, documented. A preferred mapping of filesystem
paths to URIs and back should be documented for each platform.
- In urlopen(), the processing of a 'url' argument that is
syntactically absolute may be nonconformant on platforms
that use ":" in their filesystem paths. On such platforms, if the
first ":" in what is syntactically an absolute URL/URI appears to
be intended for use other than as a scheme component delimiter,
the path will assumed to be relative. Furthermore, on Windows,
'\', which is not allowed in a URL, or its equivalent percent-
encoded sequence '%5C' (case-insensitive), will be interpreted as
a '/' in the URL.
Thus, on Windows, an argument such as r'C:\a\b\c.txt' will be
treated as if it were 'file:///C:/a/b/c.txt' by the URLOpeners.
This is a convenience feature for the benefit of users who do
not have the means to convert an OS path to full 'file' URL.
(This mostly describes current behavior, assuming we can reach
agreement that the "C:" in the example above should be treated
no differently than "C|").
> As for using regular expressions in the standard library: It seems you
> believe this is discouraged. I don't know why you think so - I've never
> heard of such a constraint before (in general - in specific cases,
> submitters may have been told that alternatives are more efficient).
I was just surprised to find that regular expressions are not used much in
urllib, urllib2, and urlparse. The implementations seem to be going to a
lot of trouble to process URLs using find() and string slices. I thought
perhaps there was a good reason for this.
I must attend to other things right now; will comment on the other issues
later.
-Mike
More information about the Python-Dev
mailing list