urllib.urlencode wrongly encoding ± character

Fri Apr 7 04:04:08 EDT 2006

Evren Esat Ozkan wrote:

> Ok, I think this code snippet enough to show what i said;
>
> ===================================
>
> #!/usr/bin/env python
> #  -*- coding: utf-8 -*-
> #Change utf-8 to latin-1
> #Or move variable decleration to another file than import it
>
> val='00090±NO:±H±H±H±H±'
>
> from urllib import urlencode
>
> data={'key':val}
>
> print urlencode(data)
>
> ===================================

did you cut and paste this into your mail program?  because the file
I got was ISO-8859-1 encoded:

    Content-Type: text/plain; charset="iso-8859-1"

and uses a single byte to store each "±", and produces

    key=00090%B1NO%3A%B1H%B1H%B1H%B1H%B1

when I run it, which is the expected result.

I think you're still not getting what's going on here, so let's try again:

- the urlencode function doesn't care about encodings; it translates
the bytes it gets one by one.  if you pass in chr(0xB1), you get %B1
in the output.

- it's your editor that decides how that "±" you typed in the original
script are stored on disk; it may use one ISO-8859-1 bytes, two
UTF-8 bytes, or something else.

- the coding directive doesn't affect non-Unicode string literals in
Python.  in an 8-bit string, Python only sees a number of bytes.

- the urlencode function only cares about the bytes.

since you know that you want to use ISO-8859-1 encoding for your
URL, and you seem to insist on typing the "±" characters in your code,
the most portable (and editor-independent) way to write your code is
to use Unicode literals when building the string, and explicitly convert
to ISO-8859-1 on the way out.

    # build the URL as a Unicode string
    val = u'00090±NO:±H±H±H±H±'

    # encode as 8859-1 (latin-1)
    val = val.encode("iso-8859-1")

    from urllib import urlencode
    data={'key':val}
    print urlencode(data)

    key=00090%B1NO%3A%B1H%B1H%B1H%B1H%B1

this will work the same way no matter what character set you use to
store the Python source file, as long as the coding directive matches
what your editor is actually doing.

if you want to make your code 100% robust, forget the idea of putting
non-ascii characters in string literals, and use \xB1 instead:

    val = '00090\xb1NO:\xb1H\xb1H\xb1H\xb1H\xb1'

    # no need to encode, since the byte string is already iso-8859-1

    from urllib import urlencode
    data={'key':val}
    print urlencode(data)

    key=00090%B1NO%3A%B1H%B1H%B1H%B1H%B1

hope this helps!

</F>