PEP 263 comments

Fri Mar 1 04:30:13 EST 2002

On 28 Feb 2002 15:09:23 +0100, Martin von Loewis <loewis at informatik.hu-berlin.de> wrote:
[...]
>You may wonder why Python (the programming language) needs to worry
>about the encoding at all. The reason is that we allow Unicode
>literals, in the form
>
>   u"text"
>
>The question is what is the encoding of "text", on disk. In memory, it
>will be 2-byte Unicode, so the interpreter needs to convert. To do
>that, it must know what the encoding is, on disk. The choices are
>using either UTF-8, or allowing encoding cookies.
>
I'm not sure what you mean by 'encoding cookies' but I assume you
mean something analogous to browser cookies, where some data of
interest is stored separately but related to some other data and
processing, like HTML form sumbissions etc.

Well, forget the cookie associations, but I think keeping meta-data
separate from data is a Good Thing(tm).

Also keeping it out of the names of things (i.e., don't encode file types
in name extensions ;-)

<the main idea of this post>
Perhaps we could just use a file to contain extra file metadata,
letting a file of metadata govern other files it names in the same
directory as itself. Probably a dot file in *nix.

For PEP 263 purposes, it would only need to be a text file with file
names tab delimited from keyword=encoding-info, with the first line(s)
perhaps with a glob pattern for a compact way of specifying encoding
for a lot of files in a directory at once.

To provide international encoding for file-associated info, like
a local dialect/special characters name etc., in a system whose
native file naming is more restricted, perhaps this directory of
file attributes could be standardized to UTF-8 for its own encoding.

That way, you could have the first column represent the file name
the system sees and an optional uname= keyword could provide an
alternate utf-8 encoded name for the file that tools that knew of it
could display, and then encoding=whatever for the actual file data per se.

The nice thing is that you don't have to touch the original files
to describe them. By including a location= keyword you could even
have this work like a symbolic link to a network file or even
an URL-specified file, which could be read-only and burned in a CD,
or a please-mount-backup-tape-x location, etc.

The actual file data would not have to be in the same directory at all.
</the main idea of this post>

I have more ideas, but I tend to overdo one post that way ;-)

Regards,
Bengt Richter

P.S. This discussion made me look for some more UTF info. For anyone
interested, I found  a FAQ at

http://www.unicode.org/unicode/faq/utf_bom.html#2

and

http://www.unicode.org/unicode/reports/tr27/

has a nice table showing where bits go for UTF-8 and UTF-16 encoding
of unicode characters, and even 32-bit stuff.

Might make a refs links for the PEP.

There are some changes as to legality checks, apparently,
as of last May. I'm wondering if this affects PEP 263
and/or the unicode implementation in Python.