[Python-Dev] (Not) delaying the 3.2 release

Thu Sep 16 19:21:33 CEST 2010

On Thu, Sep 16, 2010 at 8:42 AM, Toshio Kuratomi <a.badger at gmail.com> wrote:
> On Thu, Sep 16, 2010 at 09:52:48AM -0400, Barry Warsaw wrote:
>> On Sep 16, 2010, at 11:28 PM, Nick Coghlan wrote:
>>
>> >There are some APIs that should be able to handle bytes *or* strings,
>> >but the current use of string literals in their implementation means
>> >that bytes don't work. This turns out to be a PITA for some networking
>> >related code which really wants to be working with raw bytes (e.g.
>> >URLs coming off the wire).
>>
>> Note that email has exactly the same problem.  A general solution -- even if
>> embodied in *well documented* best-practices and convention -- would really
>> help make the stdlib work consistently, and I bet third party libraries too.
>>
> I too await a solution with abated breath :-) I've been working on
> documenting best practices for APIs and Unicode and for this type of
> function (take bytes or unicode and output the same type), knowing the
> encoding is seems like a requirement in most cases:
>
> http://packages.python.org/kitchen/designing-unicode-apis.html#take-either-bytes-or-unicode-output-the-same-type
>
> I'd love to add another strategy there that shows how you can robustly
> operate on bytes without knowing the encoding but from writing that, I think
> that anytime you simplify your API you have to accept limitations on the
> data you can take in.  (For instance, some simplifications can handle
> anything except ASCII-incompatible encodings).

In all cases I can imagine where such polymorphic functions make
sense, the necessary and sufficient assumption should be that the
encoding is a superset of 7-bit(*) ASCII. This includes UTF-8, all
Latin-N variant, and AFAIK also the popular CJK encodings other than
UTF-16. This is the same assumption made by Python's byte type when
you use "character-based" methods like lower().

--Guido

__________
(*) In my mind ASCII and 7-bit are synonymous, but unfortunately there
are droves of naive users who believe that ASCII includes all 256
possible 8-bit bytes using some encoding -- typically the default
encoding of their DOS or Windows box. :-(

-- 
--Guido van Rossum (python.org/~guido)