[Baypiggies] urllib.urlencode and encoding

Thu Apr 19 20:22:57 CEST 2007

On 4/19/07, Tung Wai Yip <tungwaiyip at yahoo.com> wrote:
> > On Apr 18, 2007, at 5:08 PM, Keith Dart ♂ wrote:
> >> >    When a new URI scheme defines a component that represents textual
> >> >    data consisting of characters from the Universal Character Set
> >> > [UCS],
> >> >    the data should first be encoded as octets according to the UTF-8
> >> >    character encoding [STD63]; then only those octets that do not
> >> >    correspond to characters in the unreserved set should be percent-
> >> >    encoded.  For example, the character A would be represented as "A",
> >> >    the character LATIN CAPITAL LETTER A WITH GRAVE would be
> >> > represented
> >> >    as "%C3%80", and the character KATAKANA LETTER A would be
> >> > represented
> >> >    as "%E3%82%A2".
>
> Thanks Keith for the heads up. One issue I regularly have is to track down
> the lineage of RFCs. When I found RFC X, I am often not aware of a RFC Y
> that supersede it. It doesn't help that historically there are many
> documents pointing to RFC X. But from RFC X itself it has no link to RFC
> Y. Try to follow the link from the bottom of the urlparse module
> documentation. It does not lead to RFC 3986.
>
>    http://docs.python.org/lib/module-urlparse.html
>
>
> On Wed, 18 Apr 2007 21:15:34 -0700, David Reid <dreid at dreid.org> wrote:
> > The key piece of information here is "When a new URI scheme" the RFC
> > (AFAICT) makes no mention of what to do about old schemes, such as
> > HTTP.  In fact the HTML4 spec makes it's own claims as to %-encoded
> > as a result of form submission:
> >
> > http://www.w3.org/TR/html4/interact/forms.html
> >
> >      accept-charset = charset list [CI]
> >          This attribute specifies the list of character encodings for
> > input data that is accepted by the server processing this form. The
> > value is a space- and/or comma-delimited list of charset values. The
> > client must interpret this list as an exclusive-or list, i.e., the
> > server is able to accept any single character encoding per entity
> > received.
> > The default value for this attribute is the reserved string
> > "UNKNOWN". User agents may interpret this value as the character
> > encoding that was used to transmit the document containing this FORM
> > element.
> >
> > So I think it's still incorrect for urllib to make any such
> > assumptions as to the data being UTF-8. (Though I hope it won't be in
> > the future.)
> >
> > - -David
> > http://dreid.org
>
> I think RFC 3986 says a character should be encoded in UTF-8 only if it is
>  from the UCS. But it is also legitimate to use other character set, for
> example as in the HTML4 spec David has pointed out. Say you are writing a
> screen scrapper for a Japanese website you should use the character
> encoding the website expects, which is not necessary UTF-8.

Ok, thanks for all your comments guys.  David, thanks for the RFC
quotes.  If I am to understand things correctly, because the rest of
my page is all working correctly using UTF-8, I can .encode('UTF-8')
parameters before passing them to urlencode.  However, it doesn't make
sense to put that .encode inside urlencode.

>Welcome to the tower of babel!

I was reading <http://www.mozilla.org/docs/web-developer/faq.html#accept>
the other day, and I was pondering the fact that we can't even agree
on versions of HTML.  Mozilla *still* recommends HTML 4.01 over XHTML.
 Since HTML is a language used to transport content, I recognized that
this too was a case of the Tower of Babel.  Upon realizing this, in my
head, I heard a little voice say, "Gotcha!"

*sigh*
-jj

-- 
"'Software Engineering' is something of an oxymoron.  It's very
difficult to have real engineering before you have physics, and there
isn't anything even close to a physics for software." -- L. Peter
Deutsch