PEP263 (Specifying encoding) and bytecode strings

Mike C. Fletcher mcfletch at rogers.com
Mon May 5 16:28:24 EDT 2003


Terry Reedy wrote:

>I am a little puzzled by some of the questions and comments in this
>thread.  Am I missing something?
>
Probably the purpose of resource-package :) , namely automatically 
embedding sets of binary resources in Python source-code files.

>"Tony Meyer" <ta-meyer at ihug.co.nz> wrote in message
>news:mailman.1052120948.3150.python-list at python.org...
>  
>
>>>>Is there some way to specify that all strings are
>>>>bytecodes, and not encoded characters?
>>>>        
>>>>
>
>The value of a Python string object *is* a sequence of bytecodes.
>Character encoding is in the eye of the interpreter/user of a string.
>
Sure, until someone (with the best of intentions) decides some day that 
all strings in the interpreter are Unicode (such things happen, and I'm 
pretty sure I've heard rumblings from very deep in the hierarchy along 
this path), and there will be a seperate "buffer" type for byte-streams. 
When/if that happens, your source-file says that your binary data is 
latin-1-encoded Unicode data, which makes the binary data gibberish when 
the Unicode hits the fan.  In essence, by declaring the data as 
"latin-1", you're encoding garbage in the file so that future versions 
of Python won't be able to recognise that the data is actually a 
byte-stream.

What I'd like from Pep 263 is a way to make the declaration "this file 
has no encoding" or "these strings are byte-sequences, *not* Unicode 
data encoded with some particular encoding".  Using a particular 1-byte 
encoding is fine for now, but you're encoding erroneous information in 
the file, which is not the best design practice.  Given that Pep 263 is 
already requiring 1 rewrite of all old software to support itself, 
making it necessary to some day do another (to change that declaration 
again), it seems somewhat... intrusive... especially if the goal is to 
maintain customer confidence in Python's stability.

>>I probably phrased my question poorly: what, then, is the correct
>>encoding for the output of zlib.compress()?  I know IANA has a list
>>    
>>
>[1]
>  
>
>>of encodings, but it's not really clear which is the right one.
>>    
>>
>
>I think you are asking for *the* 'correct' fake declaration.  If there
>is not yet a way to say encoding = None or encoding = bytes, then any
>one that works should be ok.
>
Escept for the GIGO principle, sure ;) .  Pep 263 just has this annoying 
habit of violating every aesthetic sense I have :) , from the inclusion 
of semantics in comments, to breaking old code, to requiring that all 
strings be converted to Unicode and back again during the parsing 
process, to the inability to specify a NULL/RAW encoding.  I know it's a 
messy problem, but eek what a solution :) !

>Imitating the Python interpretation of source code so as to see '\xXX'
>as one byte rather than a quoted string of four Ascii chars, correct.
>So?  This should only matter if you are putting compress() output or
>decompress() input into source code, such as for testing each function
>separately.  
>
Which is exactly what resource-package does (though for portable access 
to the files during package embedding/deployment, not necessarily 
testing).  For now I guess we'll use the latin-1 hack, but honestly, 
that kind of kludge is not something I'm happy about including in lots 
of people's files (every user of resource package will likely need to 
run an upgrade script *on every embedded resource file* at some point in 
the future, that's not a great joy to me).  The alternative (~2x size 
explosion (1/2 of bytes become 4 bytes)) really isn't that much better, 
and still requires rewrites when said Unicode hits said fan.

Sigh,
Mike

_______________________________________
  Mike C. Fletcher
  Designer, VR Plumber, Coder
  http://members.rogers.com/mcfletch/








More information about the Python-list mailing list