[Python-Dev] #pragmas in Python source code

Thu, 13 Apr 2000 13:50:17 +0200

M.-A. Lemburg wrote:
> The current need for #pragmas is really very simple: to tell
> the compiler which encoding to assume for the characters
> in u"...strings..." (*not* "...8-bit strings...").

why not?

why keep on pretending that strings and strings are two
different things?  it's an artificial distinction, and it only
causes problems all over the place.

> Could be that we don't need this pragma discussion at all
> if there is a different, more elegant solution to this...

here's one way:

1. standardize on *unicode* as the internal character set.  use
an encoding marker to specify what *external* encoding you're
using for the *entire* source file.  output from the tokenizer is
a stream of *unicode* strings.

2. if the user tries to store a unicode character larger than 255
in an 8-bit string, raise an OverflowError.

3. the default encoding is "none" (instead of XML's "utf-8"). in
this case, treat the script as an ascii superset, and store each
string literal as is (character-wise, not byte-wise).

additional notes:

-- item (3) is for backwards compatibility only.  might be okay to
change this in Py3K, but not before that.

-- leave the implementation of (1) to 1.7.  for now, assume that
scripts have the default encoding, which means that (2) cannot
happen.

-- we still need an encoding marker for ascii supersets (how about
<?python encoding=3D"utf-8" version=3D"1.6"?> ;-).  however, it's up to
the tokenizer to detect that one, not the parser.  the parser only
sees unicode strings.

</F>