[Python-Dev] #pragmas in Python source code
M.-A. Lemburg
mal@lemburg.com
Thu, 13 Apr 2000 17:55:08 +0200
Fredrik Lundh wrote:
>
> M.-A. Lemburg wrote:
> > The current need for #pragmas is really very simple: to tell
> > the compiler which encoding to assume for the characters
> > in u"...strings..." (*not* "...8-bit strings...").
>
> why not?
Because plain old 8-bit strings should work just as before,
that is, existing scripts only using 8-bit strings should not break.
> why keep on pretending that strings and strings are two
> different things? it's an artificial distinction, and it only
> causes problems all over the place.
Sure. The point is that we can't just drop the old 8-bit
strings... not until Py3K at least (and as Fred already
said, all standard editors will have native Unicode support
by then).
So for now we're stuck with Unicode *and* 8-bit strings
and have to make the two meet somehow -- which isn't all
that easy, since 8-bit strings carry no encoding information.
> > Could be that we don't need this pragma discussion at all
> > if there is a different, more elegant solution to this...
>
> here's one way:
>
> 1. standardize on *unicode* as the internal character set. use
> an encoding marker to specify what *external* encoding you're
> using for the *entire* source file. output from the tokenizer is
> a stream of *unicode* strings.
Yep, that would work in Py3K...
> 2. if the user tries to store a unicode character larger than 255
> in an 8-bit string, raise an OverflowError.
There are no 8-bit strings in Py3K -- only 8-bit data
buffers which don't have string methods ;-)
> 3. the default encoding is "none" (instead of XML's "utf-8"). in
> this case, treat the script as an ascii superset, and store each
> string literal as is (character-wise, not byte-wise).
Uhm. I think UTF-8 will be the standard for text file formats
by then... so why not make it UTF-8 ?
> additional notes:
>
> -- item (3) is for backwards compatibility only. might be okay to
> change this in Py3K, but not before that.
>
> -- leave the implementation of (1) to 1.7. for now, assume that
> scripts have the default encoding, which means that (2) cannot
> happen.
I'd say, leave all this to Py3K.
> -- we still need an encoding marker for ascii supersets (how about
> <?python encoding="utf-8" version="1.6"?> ;-). however, it's up to
> the tokenizer to detect that one, not the parser. the parser only
> sees unicode strings.
Hmm, the tokenizer doesn't do any string -> object conversion.
That's a task done by the parser.
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/