Comment on PEP 263 - Defining Python Source Code Encodings

Bengt Richter bokr at oz.net
Sun May 12 16:23:28 EDT 2002


On 11 May 2002 15:41:20 +0200, martin at v.loewis.de (Martin v. Loewis) wrote:
>Robin Becker <robin at jessikat.fsnet.co.uk> writes:
[...]
>
>> As for the PEP itself the only snag seems to me to be the BOM + comment
>> problem. If I change the BOM by hitting saveAs myWeirdEncoding the file
>> is a dead python unless I also change the comment (or is that an issue
>> only with utf8 at present?).
>
>I'm not sure I understand the problem. If you do saveAs
>myWeirdEncoding, there won't be a BOM in the file unless
>myWeirdEncoding is UTF-8. If there are multiple conflicting encoding
>specifications in a file, the file is in error.
>
I think Robin is alluding to something like the problem of an encoding-conversion
save-as export filter utility fed with a script with a given encoding and containing
a magic comment. If the utility is not magic-comment-syntax-aware and able to change
the comment to reflect the new encoding, there would be a problem to fix manually.

ISTM mixing meta-data and data in an ad-hoc way is not good, and encoding is
meta-data w.r.t. the file it describes. Even the UTF BOM is IMO really a compromise
forced by the fact that most file systems do not support user-extended meta-data storage.
Of course file extensions have effectively become keys into a database of meta-data
for windows files, but that is a kind of name space abuse IMO, however expedient.

<tangential>
Of course, the UTF BOM can be looked on as a kind of packet header, and the rest
of the file as the packet body. If packet headers were standardized, this could
be a generic packet file, and decoded as such. Other packet headers would include
supporting meta-data and then be followed by the packet body. A file would be a
sequence of zero or more packets.

Perhaps unicode could be a gateway to a universal format for packetized files,
if you reserved one code page for packet headers and their metadata.
Detecting that code page would then be sufficient to trigger interpretation
of the file as packets each composed of headers and data. Some packet types
would of course allow and expect nested packets in their data. Others might
contain XML or .exe or .so or whatever. Some kind of registry of packet types
analogous to UPC codes would have to exist to maintain standards. Life for
the 'file' command implementers would be simplified a lot.

You could do something like packet headers sor Python scripts by a command line parameter that
specifies n lines of 'packet-header' meta-data after the initial #! line.
(I think someone suggested something like that).
</tangential>

When a single file encoding is replaced by multiple possible encodings in a system,
there is the problem of dealing with a mix, where it may not be feasible to modify
the old encodings to mark them. E.g., binary executable files are also encoded files, and
in windows .PIF files were (I assume) invented to carry meta-data necessary to run DOS files
with default vs special parameters to define the virtual DOS environment to use.

It would be possible for new Python interpreters to look for e.g., .pyf meta-data files
before using .py or .pyc files. Such files would be optional, and serve a purpose
analogous to .pif files. They could also specify interpreter versions and special imports that
should be in effect, making a kind of script closure, to think of it another way.

It is tempting to mandate UTF-8 encoding for .pyf files, but perhaps they should be
least-common-denominator-encoded, e.g., ascii.

If/when a file system is available that supports user meta-data, the .pyf contents could
optionally be represented that way when using that file system.

If you want to use magic strings in comments, you are really altering the grammar of the
language IMO, and maybe it should be defined in the grammar and parsed. While you're at it,
might as well define some syntax for external supporting document linkage, and ways to use it
programmatically, e.g. via some generated module attribute, so you could kick off external
help from an exception handler etc.

.pyf files could coexist with UTF-8 BOM detection and a warning be issued if an encoding were
specified that conflicted. I'm sure someone could write an emacs macro to write .pyf files
along with .py files as required.

I mentioned some variant of this idea in the past, and an objection then was that the
meta-data could get separated from the associated script file too easily. But separation
also has virtues. You can have separate access permissions on separate files, and you
can describe a file to which you only have read-only network access (e.g. by including
remote file location info in the local .pyf).

Regards,
Bengt Richter



More information about the Python-list mailing list