PEP263 (Specifying encoding) and bytecode strings

Bengt Richter bokr at oz.net
Tue May 6 11:45:40 EDT 2003


On Mon, 5 May 2003 20:56:55 -0400, "Terry Reedy" <tjreedy at udel.edu> wrote:

>
>"Mike C. Fletcher" <mcfletch at rogers.com> wrote in message
>news:mailman.1052166614.10808.python-list at python.org...
>> Terry Reedy wrote:
>>
>> >I am a little puzzled by some of the questions and comments in this
>> >thread.  Am I missing something?
>> >
>> Probably the purpose of resource-package :) , namely automatically
>> embedding sets of binary resources in Python source-code files.
>
>You're right.  I've never done this (yet ;-) and did not think of it.
>
>> What I'd like from Pep 263 is a way to make the declaration "this
>file
>> has no encoding" or "these strings are byte-sequences, *not* Unicode
>
>This is what I asking about with
>
>> >.  If there is not yet a way to say encoding = None or encoding =
>bytes, ...
>
>> >So?  This should only matter if you are putting zlib.compress()
>output or
>> >decompress() input into source code, such as for testing each
>function
>> >separately.
>
>> Which is exactly what resource-package does (though for portable
>access
>> to the files during package embedding/deployment, not necessarily
>> testing).
>
>Now I understand better.  Using the Python interpreter beyond its
>defined limits is attractive but slightly dangerous.  Multiplying half
>the compresser output by 4 (for net expansion about 2) is certainly
>not inviting.  I see these possible courses of action.
>
>* Get a null encoding option into PEP263 and its implementation.
>There should be a way to tell the interpreter to handle the sequence
>of bytes the same way it currently does after escape processing.
>
>* Do nothing now and ignore the warning or adjust now to suppress
>them, and wait for future hammers to fall when they do.
>
>3. Write your own importer.  (One halfway tack would be to readin,
>expand 'illegal' bytes, and explicitly feed to eval or compile.)
>
>4. Investigate the new import-from-zipfile facility.  If  you are zlib
>compressing large ascii texts, compress the entire file instead of
>just the quoted text.  If you have inherently binary data, expand the
>hi-bit bytes (expanding file by 2 if half are such) and, again,
>compress the whole file, not just the data strings.
>
How about an optional explicit __END__ by itself on a line to end program source
and the assumption of the declared encoding?

That way binary stuff can be appended. You could append multiple chunks
by prefixing them with a delimited ascii integer containing the byte count
(you could do that in binary also, but it might be interesting to be able
to do an all-ascii appendage), e.g., three sections, with a two-section
tail of sections with post-fixed lengths that you read from the back and
seek back to find the beginnings of. The beginning of post-fixed-length
sections is delimited by the [?] length prefix. E.g.,

__END__
[10]0123456789[?]012345[6]The End ;-)[11]

The easy way to append binary to a source would then be something like
(untested)

    import os
    fs = file('the_source','ab')
    fb = file('the_binary','rb')
    b = fb.read()
    fs.write('__END__'+os.linesep+'[?]')
    fs.write(b)
    fs.write('[%s]'%len(b))

For access, the whole binary tail after the __END__ line could be
made into a string bound to e.g., __MODULE_DATA__. You could parse
it yourself or make a utility function that would return a file-like
object for reading a selected section, e.g., f = moduledata(section_number).

I guess you'd modify the tokenizer to recognize the __END__ line
and create a special string token from the rest of the bytes, that
could eventually get bound to __MODULE_DATA__.

If the binary data really was e.g., a raw unicode file, you'd need to
apply appropriate decode/encode when you access a  __MODULE_DATA__
section as a byte stream, depending on what you wanted to do.

Regards,
Bengt Richter




More information about the Python-list mailing list