different encodings for unicode() and u''.encode(), bug?

Sat Jan 12 03:58:42 EST 2008

On Jan 4, 12:02 am, John Machin <sjmac... at lexicon.net> wrote:
> On Jan 4, 8:03 am, mario <ma... at ruggier.org> wrote:
> > On Jan 2, 2:25 pm, Piet van Oostrum <p... at cs.uu.nl> wrote:
>
> > > Apparently for the empty string the encoding is irrelevant as it will not
> > > be used. I guess there is an early check for this special case in the code.
>
> > In the module I an working on [*] I am remembering a failed encoding
> > to allow me, if necessary, to later re-process fewer encodings.
>
> If you were in fact doing that, you would not have had a problem. What
> you appear to have been doing is (a) remembering a NON-failing
> encoding, and assuming that it would continue not to fail

Yes, exactly. But there is no difference which ones I remember as the
two subsets will anyway add up to always the same thing. In this
special case (empty string!) the unccode() call does not fail...

> (b) not
> differentiating between failure reasons (codec doesn't exist, input
> not consistent with specified encoding).

There is no failure in the first pass in this case... if I do as you
suggest further down, that is to use s.decode(encoding) instead of
unicode(s, encoding) to force the lookup, then I could remember the
failure reason to be able to make a decision about how to proceed.
However I am aiming at an automatic decision, thus an in-context error
message would need to be replaced with a more rigourous info about how
the guessing should proceed. I am also trying to keep this simple ;)

<snip>

> In any case, a pointless question (IMHO); the behaviour is extremely
> unlikely to change, as the chance of breaking existing code outvotes
> any desire to clean up a minor inconsistency that is easily worked
> around.

Yes, I would agree. The work around may not even be worth it though,
as what I really want is a unicode object, so changing from calling
unicode() to s.decode() is not quite right, and will anyway require a
further check. Less clear code, and a little unnecessary performance
hit for the 99.9 majority of cases... Anyhow, I have improved a little
further the "post guess" checking/refining logic of the algorithm [*].

What I'd like to understand better is the "compatibility heirarchy" of
known encodings, in the positive sense that if a string decodes
successfully with encoding A, then it is also possible that it will
encode with encodings B, C; and in the negative sense that is if a
string fails to decode with encoding A, then for sure it will also
fail to decode with encodings B, C. Any ideas if such an analysis of
the relationships between encodings exists?

Thanks! mario

[*] http://gizmojo.org/code/decodeh/