[Python-Dev] bytes.from_hex()

Josiah Carlson jcarlson at uci.edu
Sun Feb 19 02:26:49 CET 2006


Ron Adam <rrr at ronadam.com> wrote:
> Josiah Carlson wrote:
> > Ron Adam <rrr at ronadam.com> wrote:
> >> Josiah Carlson wrote:
> > [snip]
> >>> Again, the problem is ambiguity; what does bytes.recode(something) mean?
> >>> Are we encoding _to_ something, or are we decoding _from_ something? 
> >> This was just an example of one way that might work, but here are my 
> >> thoughts on why I think it might be good.
> >>
> >> In this case, the ambiguity is reduced as far as the encoding and 
> >> decodings opperations are concerned.)
> >>
> >>       somestring = encodings.tostr( someunicodestr, 'latin-1')
> >>
> >> It's pretty clear what is happening to me.
> >>
> >>      It will encode to a string an object, named someunicodestr, with 
> >> the 'latin-1' encoder.
> > 
> > But now how do you get it back?  encodings.tounicode(..., 'latin-1')?,
> > unicode(..., 'latin-1')?
> 
> Yes, Just do.
> 
>       someunicodestr = encoding.tounicode( somestring, 'latin-1')
> 
> > What about string transformations:
> >     somestring = encodings.tostr(somestr, 'base64')
>  >
> > How do we get that back?  encodings.tostr() again is completely
> > ambiguous, str(somestring, 'base64') seems a bit awkward (switching
> > namespaces)?
> 
> In the case where a string is converted to another string. It would 
> probably be best to have a requirement that they all get converted to 
> unicode as an intermediate step.  By doing that it becomes an explicit 
> two step opperation.
> 
>      # string to string encoding
>      u_string = encodings.tounicode(s_string, 'base64')
>      s2_string = encodings.tostr(u_string, 'base64')

Except that ambiguates it even further.

Is encodings.tounicode() encoding, or decoding?  According to everything
you have said so far, it would be decoding.  But if I am decoding binary
data, why should it be spending any time as a unicode string?  What do I
mean?

    x = f.read() #x contains base-64 encoded binary data
    y = encodings.to_unicode(x, 'base64')
    
y now contains BINARY DATA, except that it is a unicode string

    z = encodings.to_str(y, 'latin-1')

Later you define a str_to_str function, which I (or someone else) would
use like:

    z = str_to_str(x, 'base64', 'latin-1')

But the trick is that I don't want some unicode string encoded in
latin-1, I want my binary data unencoded.  They may happen to be the
same in this particular example, but that doesn't mean that it makes any
sense to the user.

[...]

> >>> What about .reencode and .redecode?  It seems as
> >>> though the 're' added as a prefix to .encode and .decode makes it
> >>> clearer that you get the same type back as you put in, and it is also
> >>> unambiguous to direction.
> 
> ...
> 
>  > I must not be expressing myself very well.
>  >
> > Right now:
> >     s.encode() -> s
> >     s.decode() -> s, u
> >     u.encode() -> s, u
> >     u.decode() -> u
> > 
> > Martin et al's desired change to encode/decode:
> >     s.decode() -> u
> >     u.encode() -> s
>  >
>  > No others.
> 
> Which would be similar to the functions I suggested.  The main 
> difference is only weather it would be better to have them as methods or 
> separate factory functions and the spelling of the names.  Both have 
> their advantages I think.

While others would disagree, I personally am not a fan of to* or from*
style namings, for either function names (especially in the encodings
module) or methods.  Just a personal preference.

Of course, I don't find the current situation regarding
str/unicode.encode/decode to be confusing either, but maybe it's because
my unicode experience is strictly within the realm of GUI widgets, where
compartmentalization can be easier.


> >> The method bytes.recode(), always does a byte transformation which can 
> >> be almost anything.  It's the context bytes.recode() is used in that 
> >> determines what's happening.  In the above cases, it's using an encoding 
> >> transformation, so what it's doing is precisely what you would expect by 
> >> it's context.

[THIS IS THE AMBIGUITY]
> > Indeed, there is a translation going on, but it is not clear as to
> > whether you are encoding _to_ something or _from_ something.  What does
> > s.recode('base64') mean?  Are you encoding _to_ base64 or _from_ base64? 
> > That's where the ambiguity lies.
> 
> Bengt didn't propose adding .recode() to the string types, but only the 
> bytes type.  The byte type would "recode" the bytes using a specific 
> transformation.  I believe his view is it's a lower level API than 
> strings that can be used to implement the higher level encoding API 
> with, not replace the encoding API.  Or that is they way I interpreted 
> the suggestion.

But again, what would the transformation be?  To something?  From
something?  'to_base64', 'from_base64', 'to_rot13' (which happens to be
identical to) 'from_rot13', ...  Saying it would "recode ... using a
specific transformation" is a cop-out, what would the translation be? 
How would it work?  How would it be spelled?

That smells quite a bit like .encode() and .decode(), just spelled
differently, and without quite a clear path.  That is why I was offering...

> > >     s.reencode() -> s (you get encoded strings as strings)
> > >     s.redecode() -> s (you get decoded strings as strings)
> > >     u.reencode() -> u (you get encoded unicode as unicode)
> > >     u.redecode() -> u (you get decoded unicode as unicode)

You keep the encode and decode to be translating between types, you use
reencode and redecode to keep the type, and define whether you are
encoding or decoding your data/text.

While I have come to agree with Terry Reedy regarding the 're' prefix on
the 'encode' and 'decode', I think that having the name of the method
define the action and the argument of the method define the codec, is
the way to go (essentially the status quo).  It may make sense to
differentiate the cases of what an encoding/decoding process may return
(types change, types stay the same), but we then have a naming issue. 
So far, I've not seen _really_ good names for describing the
encoding/decoding process, except for what we already have: encode and
decode.

What if instead of using encode/decode for the following
transformations:

> > Martin et al's desired change to encode/decode:
> >     s.decode() -> u
> >     u.encode() -> s

We use some method name for inter-type transformations:
    s.transform() -> u
    u.transform() -> s

... or something better than 'transform', then we use the
.encode()/.decode() for intra-type transformations...

    s.encode() -> s (you get encoded strings as strings)
    s.decode() -> s (you get decoded strings as strings)
    u.encode() -> u (you get encoded unicode as unicode)
    u.decode() -> u (you get decoded unicode as unicode)

Probably DOA, but just a thought.

> >> There isn't a bytes.decode(), since that's just another transformation. 
> >> So only the one method is needed.  Which makes it easer to learn.
> > 
> > But ambiguous.
> 
> What's ambiguous about it?

See the section above that I marked "[THIS IS THE AMBIGUITY]" .

> It's no more ambiguous than any math 
> operation where you can do it one way with one operations and get your 
> original value back with the same operation by using an inverse value.
> 
>     n2=n+1; n3=n+(-1); n==n3
>     n2=n*2; n3=n*(.5); n==n3

Ahh, so you are saying 'to_base64' and 'from_base64'.  There is one
major reason why I don't like that kind of a system: I can't just say
encoding='base64' and use str.encode(encoding) and str.decode(encoding),
I necessarily have to use, str.recode('to_'+encoding) and
str.recode('from_'+encoding) .  Seems a bit awkward.


 - Josiah



More information about the Python-Dev mailing list