[Python-3000] PEP 3138- String representation in Python 3000

Thu May 15 12:34:32 CEST 2008

Greg Ewing wrote:
> Stephen J. Turnbull wrote:
>> This discussion isn't about whether it could be done or not, it's
>> about where people expect to find such functionality.  Personally, if
>> I can find .encode('euc-jp') on a string object, I would expect to
>> find .encode('gzip') on a bytes object, too.
> 
> What I'm not seeing is a clear rationale on where you
> draw the line. Out of all the possible transformations
> between a string and some other kind of data, which
> ones deserve to be available via this rather strange
> and special interface, and why?
> 

Where this kind of unified interface to binary and character transforms 
is incredibly handy is in a stacking IO model like the one used in Py3k. 
For example, suppose you're using a compressed XML stream to communicate 
over a network socket. What this approach allows you to do is have 
generic 'transformation' layers in your IO stack, so you can just build 
up your IO stack as something like:

XMLParserIO('myschema')
BufferedTextIO('utf-8')
BytesTransform('gzip')
RawSocketIO

To change to a different compression mechanism (e.g. bz2), you just 
chance the codec used by the BytesTransform layer from 'gzip' to 'bz2'.

As for how you choose what to provide as codecs... well, that's a major 
reason why the codec registry is extensible. The answer is that any 
binary or character transform which is useful to the application 
programmer can be accessed via the codec API - the only question will be 
whether the application programmer will have to write the codec 
themselves, or will find it already provided in the standard library.

Cheers,
Nick.

P.S. My original tangential response that didn't actually answer your 
question, but may still be useful to some folks:

An actual codec that encodes a character string to a byte sequence, and 
decodes a byte sequence back to a character string would be invoked via 
the str.encode() and bytes.decode() methods. For example, 
mystr.encode('utf-8') to serialise a string using UTF-8, 
mybytes.decode('utf-8') to read it back.

A text transform that converts a character string to a different 
character string would be invoked via the str.transform() and 
str.untransform() methods. For example, 
mystr.transform('unicode-escape') to convert unicode characters to their 
\u or \U equivalents, mystr.untransform('unicode-escape') to convert 
them back to the actual unicode characters.

A binary transform that converts a byte sequence to a different byte 
sequence would be invoked via the bytes.transform() and 
bytes.untransform() methods. For example, mybytes.transform('gzip') to 
compress a byte sequence, mybytes.untransform('gzip') to decompress it.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://www.boredomandlaziness.org