[Python-Dev] #pragmas in Python source code

M.-A. Lemburg mal@lemburg.com
Fri, 14 Apr 2000 10:46:15 +0200


Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > Fredrik Lundh wrote:
> > >
> > > M.-A. Lemburg wrote:
> > > > The current need for #pragmas is really very simple: to tell
> > > > the compiler which encoding to assume for the characters
> > > > in u"...strings..." (*not* "...8-bit strings...").
> > >
> > > why not?
> >
> > Because plain old 8-bit strings should work just as before,
> > that is, existing scripts only using 8-bit strings should not break.
> 
> but they won't -- if you don't use an encoding directive, and
> don't use 8-bit characters in your string literals, everything
> works as before.
> 
> (that's why the default is "none" and not "utf-8")
> 
> if you use 8-bit characters in your source code and wish to
> add an encoding directive, you need to add the right encoding
> directive...

Fair enough, but this would render all the auto-coercion
code currently in 1.6 useless -- all string to Unicode
conversions would have to raise an exception.

> > > why keep on pretending that strings and strings are two
> > > different things?  it's an artificial distinction, and it only
> > > causes problems all over the place.
> >
> > Sure. The point is that we can't just drop the old 8-bit
> > strings... not until Py3K at least (and as Fred already
> > said, all standard editors will have native Unicode support
> > by then).
> 
> I discussed that in my original "all characters are unicode
> characters" proposal.  in my proposal, the standard string
> type will have to roles: a string either contains unicode
> characters, or binary bytes.
> 
> -- if it contains unicode characters, python guarantees that
> methods like strip, lower (etc), and regular expressions work
> as expected.
> 
> -- if it contains binary data, you can still use indexing, slicing,
> find, split, etc.  but they then work on bytes, not on chars.
> 
> it's still up to the programmer to keep track of what a certain
> string object is (a real string, a chunk of binary data, an en-
> coded string, a jpeg image, etc).  if the programmer wants
> to convert between a unicode string and an external encoding
> to use a certain unicode encoding, she needs to spell it out.
> the codecs are never called "under the hood".
> 
> (note that if you encode a unicode string into some other
> encoding, the result is binary buffer.  operations like strip,
> lower et al does *not* work on encoded strings).

Huh ? If the programmer already knows that a certain
string uses a certain encoding, then he can just as well
convert it to Unicode by hand using the right encoding
name. The whole point we are talking about here is that
when having the implementation convert a string to
Unicode all by itself it needs to know which encoding
to use. This is where we have decided long ago that UTF-8
should be used.

The pragma discussion is about a totally different
issue: pragmas could make it possible for the programmer
to tell the *compiler* which encoding to use for literal
u"unicode" strings -- nothing more. Since "8-bit" strings
currently don't have an encoding attached to them we store
them as-is.

I don't want to get into designing a completely new
character container type here... this can all be done for Py3K,
but not now -- it breaks things at too many ends (even though
it would solve the issues with strings being used in different
contexts).
 
> > > -- we still need an encoding marker for ascii supersets (how about
> > > <?python encoding="utf-8" version="1.6"?> ;-).  however, it's up to
> > > the tokenizer to detect that one, not the parser.  the parser only
> > > sees unicode strings.
> >
> > Hmm, the tokenizer doesn't do any string -> object conversion.
> > That's a task done by the parser.
> 
> "unicode string" meant Py_UNICODE*, not PyUnicodeObject.
> 
> if the tokenizer does the actual conversion doesn't really matter;
> the point is that once the code has passed through the tokenizer,
> it's unicode.

The tokenizer would have to know which parts of the
input string to convert to Unicode and which not... plus there
are different encodings to be applied, e.g. UTF-8, Unicode-Escape,
Raw-Unicode-Escape, etc.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/