[Web-SIG] PEP 444 (aka Web3)

Thu Sep 16 20:00:52 CEST 2010

On Thu, Sep 16, 2010 at 12:35 PM, Guido van Rossum <guido at python.org> wrote:

> On Thu, Sep 16, 2010 at 10:01 AM, Ian Bicking <ianb at colorstudy.com> wrote:
> > Well, reiterating some things I've said before:
> >
> > * This is clearly just WSGI slightly reworked, why the new name?
> > * Why byte values in the environ?  No one has offered any real reason
> they
> > are better than native strings.  I keep asking people to offer a reason,
> > *and no one ever does*.  It's just hyperbole and distraction.  Frankly
> I'm
> > feeling annoyed.  So far my experience makes me believe using native
> strings
> > will make it easier to port and support libraries across 2 and 3.
>
> Hm. IIUC the proposal is to implicitly assume Latin1 when decoding the
> bytes to Unicode. I worry that this will just perpetuate mojibake and
> other atrocities committed in Python 2.
>

I was reading http://python.org/dev/peps/pep-0444/ -- is there another
revision under discussion?  This seems to explicitly say all environ values
will be bytes.  There have been other str-oriented proposals, including
mod_wsgi's implementation.

There is consensus that request and response bodies should be bytes.  So
really we're talking about whether headers and status are bytes or native
strings.  Most HTTP headers can only contain sensible characters in ASCII,
and while anyone can submit anything in a header I'm not aware of it being a
problem that, e.g., someone submits a Cache-Control header with non-ASCII
values.

There are a small number of headers that can reasonably contain Latin1
characters.  Latin1 is specified in HTTP, and in a few instances RFC2047
encoding is allowed, though I don't believe anyone proposes that servers
should try to handle RFC2047 (I believe CherryPy does/did do this, but I
believe Robert Brewer who is in charge of that project supports removing
that).  There are headers that can reasonably contain RFC2047, but this can
be decoded at the application level.

The Cookie header does frequently contain incorrect encodings, but to handle
this you have to decode the header as bytes or latin1 (all the meaningful
characters are the same in both cases) and then decode/transcode values
after parsing.  Latin1 imposes only a small speedbump for a header that
already has a bunch of speedbumps.

The other case when Latin1 is not appropriate is the URL-decoded path, WSGI
1's SCRIPT_NAME and PATH_INFO.  This proposal removes those.  The
URL-encoded values are ASCII-safe, or at least could be safely normalized to
be safe in the server level.

-- 
Ian Bicking  |  http://blog.ianbicking.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/web-sig/attachments/20100916/97a8622e/attachment.html>