[Python-Dev] #pragmas in Python source code

Thu, 13 Apr 2000 17:55:08 +0200

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > The current need for #pragmas is really very simple: to tell
> > the compiler which encoding to assume for the characters
> > in u"...strings..." (*not* "...8-bit strings...").
> 
> why not?

Because plain old 8-bit strings should work just as before,
that is, existing scripts only using 8-bit strings should not break.

> why keep on pretending that strings and strings are two
> different things?  it's an artificial distinction, and it only
> causes problems all over the place.

Sure. The point is that we can't just drop the old 8-bit
strings... not until Py3K at least (and as Fred already
said, all standard editors will have native Unicode support
by then).

So for now we're stuck with Unicode *and* 8-bit strings
and have to make the two meet somehow -- which isn't all
that easy, since 8-bit strings carry no encoding information.

> > Could be that we don't need this pragma discussion at all
> > if there is a different, more elegant solution to this...
> 
> here's one way:
> 
> 1. standardize on *unicode* as the internal character set.  use
> an encoding marker to specify what *external* encoding you're
> using for the *entire* source file.  output from the tokenizer is
> a stream of *unicode* strings.

Yep, that would work in Py3K...

> 2. if the user tries to store a unicode character larger than 255
> in an 8-bit string, raise an OverflowError.

There are no 8-bit strings in Py3K -- only 8-bit data
buffers which don't have string methods ;-)

> 3. the default encoding is "none" (instead of XML's "utf-8"). in
> this case, treat the script as an ascii superset, and store each
> string literal as is (character-wise, not byte-wise).

Uhm. I think UTF-8 will be the standard for text file formats
by then... so why not make it UTF-8 ?

> additional notes:
> 
> -- item (3) is for backwards compatibility only.  might be okay to
> change this in Py3K, but not before that.
> 
> -- leave the implementation of (1) to 1.7.  for now, assume that
> scripts have the default encoding, which means that (2) cannot
> happen.

I'd say, leave all this to Py3K.

> -- we still need an encoding marker for ascii supersets (how about
> <?python encoding="utf-8" version="1.6"?> ;-).  however, it's up to
> the tokenizer to detect that one, not the parser.  the parser only
> sees unicode strings.

Hmm, the tokenizer doesn't do any string -> object conversion.
That's a task done by the parser.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/