[Python-Dev] accept string in a2b and base64?

Tue Feb 21 03:51:08 CET 2012

On Tue, Feb 21, 2012 at 11:24 AM, R. David Murray <rdmurray at bitdance.com> wrote:
> If most people agree with Antoine I won't fight it, but it seems to me
> that accepting unicode in the binascii and base64 APIs is a bad idea.

I see it as essentially the same as the changes I made in
urllib.urlparse to support pure ASCII bytes->bytes in many of the APIs
(which work by doing an implicit ascii+strict decode at the beginning
of the function, and then reversing that at the end). For those, if
your byte sequence has non-ASCII data in it, they'll throw a
UnicodeDecodeError and it's up to you to figure out where those
non-ASCII bytes are coming from. Similarly, if one of these updated
APIs throws ValueError, then you'll have to figure out where the
non-ASCII code points are coming from.

Yes, it's a niggling irritation from a purist point of view, but it's
also an acknowledgement of the fact that whether a pure ASCII sequence
should be treated as a sequence of bytes or a sequence of code points
is going to be application and context depended. Sometimes it will
make more sense to treat it as binary data, other times as text.

The key point is that any multimode support that depends on implicit
type conversion from bytes->str (or vice-versa) really needs to be
limited to *strict* ASCII only (if no other information on the
encoding is available). If something is 7-bit ASCII pure, then odds
are very good that it really *is* ASCII text. As soon as that
high-order bit gets set though, all bets are off and we have to push
the text encoding problem back on the API caller to figure out.

The reason Python 2's implicit str<->unicode conversions are so
problematic isn't just because they're implicit: it's because they
effectively assume *latin-1* as the encoding on the 8-bit str side.
That means reliance on implicit decoding can silently corrupt
non-ASCII data instead of triggering exceptions at the point of
implicit conversion. If you're lucky, some *other* part of the
application will detect the corruption and you'll have at least a
vague hope of tracking it down. Otherwise, the corrupted data may
escape the application and you'll have an even *thornier* debugging
problem on your hands.

My one concern with the base64 patch is that it doesn't test that
mixing types triggers TypeError. While this shouldn't require any
extra code (the error should arise naturally from the method
implementation), it should still be tested explicitly to ensure type
mismatches fail as expected. Checking explicitly for mismatches in the
code would then just be a matter of wanting to emit nice error
messages explaining the problem rather than being needed for
correctness reasons (e.g. urlparse uses pre-checks in order to emit a
clear error message for type mismatches, but it has significantly
longer function signatures to deal with).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia