Re[Python-Dev] #pragmas in Python source code

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Fri, 14 Apr 2000 21:17:23 +0200


M.-A. Lemburg <mal@lemburg.com> wrote:
> > but they won't -- if you don't use an encoding directive, and
> > don't use 8-bit characters in your string literals, everything
> > works as before.
> >=20
> > (that's why the default is "none" and not "utf-8")
> >=20
> > if you use 8-bit characters in your source code and wish to
> > add an encoding directive, you need to add the right encoding
> > directive...
>=20
> Fair enough, but this would render all the auto-coercion
> code currently in 1.6 useless -- all string to Unicode
> conversions would have to raise an exception.

I though it was rather clear by now that I think the auto-
conversion stuff *is* useless...

but no, that doesn't mean that all string to unicode conversions
need to raise exceptions -- any 8-bit unicode character obviously
fits into a 16-bit unicode character, just like any integer fits in a
long integer.

if you convert the other way, you might get an OverflowError, just
like converting from a long integer to an integer may give you an
exception if the long integer is too large to be represented as an
ordinary integer.  after all,

    i =3D int(long(v))

doesn't always raise an exception...

> > > > why keep on pretending that strings and strings are two
> > > > different things?  it's an artificial distinction, and it only
> > > > causes problems all over the place.
> > >
> > > Sure. The point is that we can't just drop the old 8-bit
> > > strings... not until Py3K at least (and as Fred already
> > > said, all standard editors will have native Unicode support
> > > by then).
> >=20
> > I discussed that in my original "all characters are unicode
> > characters" proposal.  in my proposal, the standard string
> > type will have to roles: a string either contains unicode
> > characters, or binary bytes.
> >=20
> > -- if it contains unicode characters, python guarantees that
> > methods like strip, lower (etc), and regular expressions work
> > as expected.
> >=20
> > -- if it contains binary data, you can still use indexing, slicing,
> > find, split, etc.  but they then work on bytes, not on chars.
> >=20
> > it's still up to the programmer to keep track of what a certain
> > string object is (a real string, a chunk of binary data, an en-
> > coded string, a jpeg image, etc).  if the programmer wants
> > to convert between a unicode string and an external encoding
> > to use a certain unicode encoding, she needs to spell it out.
> > the codecs are never called "under the hood".
> >=20
> > (note that if you encode a unicode string into some other
> > encoding, the result is binary buffer.  operations like strip,
> > lower et al does *not* work on encoded strings).
>=20
> Huh ? If the programmer already knows that a certain
> string uses a certain encoding, then he can just as well
> convert it to Unicode by hand using the right encoding
> name.

I thought that was what I said, but the text was garbled.  let's
try again:

    if the programmer wants to convert between a unicode
    string and a buffer containing encoded text, she needs
    to spell it out.  the codecs are never called "under the
    hood"

> The whole point we are talking about here is that when
> having the implementation convert a string to Unicode all
> by itself it needs to know which encoding to use. This is
> where we have decided long ago that UTF-8 should be
> used.

does "long ago" mean that the decision cannot be
questioned?  what's going on here?

face it, I don't want to guess when and how the interpreter
will convert strings for me.  after all, this is Python, not Perl.

if I want to convert from a "string of characters" to a byte
buffer using a certain character encoding, let's make that
explicit.

Python doesn't convert between other data types for me, so
why should strings be a special case?

> The pragma discussion is about a totally different
> issue: pragmas could make it possible for the programmer
> to tell the *compiler* which encoding to use for literal
> u"unicode" strings -- nothing more. Since "8-bit" strings
> currently don't have an encoding attached to them we store
> them as-is.

what do I have to do to make you read my proposal?

shout?

okay, I'll try:

    THERE SHOULD BE JUST ONE INTERNAL CHARACTER
    SET IN PYTHON 1.6: UNICODE.

for consistency, let this be true for both 8-bit and 16-bit
strings (as well as Py3K's 31-bit strings ;-).

there are many possible external string encodings, just like there
are many possible external integer encodings.   but for integers,
that's not something that the core implementation cares much
about.  why are strings different?

> I don't want to get into designing a completely new
> character container type here... this can all be done for Py3K,
> but not now -- it breaks things at too many ends (even though
> it would solve the issues with strings being used in different
> contexts).

you don't need to -- you only need to define how the *existing*
string type should be used.  in my proposal, it can be used in two
ways:

-- as a string of unicode characters (restricted to the
   0-255 subset, by obvious reasons).  given a string 's',
   len(s) is always the number of characters, s[i] is the
   i'th character, etc.

or=20

-- as a buffer containing binary bytes. given a buffer 'b',
   len(b) is always the number of bytes, b[i] is the i'th
   byte, etc.

this is one flavour less than in the 1.6 alphas -- where strings =
sometimes
contain UTF-8 (and methods like upper etc doesn't work), sometimes an
8-bit character set (and upper works), and sometimes binary buffers (for
which upper doesn't work).

(hmm.  I've said all this before, haven't I?)

> > > > -- we still need an encoding marker for ascii supersets (how =
about
> > > > <?python encoding=3D"utf-8" version=3D"1.6"?> ;-).  however, =
it's up to
> > > > the tokenizer to detect that one, not the parser.  the parser =
only
> > > > sees unicode strings.
> > >
> > > Hmm, the tokenizer doesn't do any string -> object conversion.
> > > That's a task done by the parser.
> >=20
> > "unicode string" meant Py_UNICODE*, not PyUnicodeObject.
> >=20
> > if the tokenizer does the actual conversion doesn't really matter;
> > the point is that once the code has passed through the tokenizer,
> > it's unicode.
>=20
> The tokenizer would have to know which parts of the
> input string to convert to Unicode and which not...  plus there
> are different encodings to be applied, e.g. UTF-8, Unicode-Escape,
> Raw-Unicode-Escape, etc.

sigh.  why do you insist on taking a very simple thing and making
it very very complicated?  will anyone out there ever use an editor
that supports different encodings for different parts of the file?

why not just assume that the *ENTIRE SOURCE FILE* uses a single
encoding, and let the tokenizer (or more likely, a conversion stage
before the tokenizer) convert the whole thing to unicode.

let the rest of the compiler work on Py_UNICODE* strings only, and
all your design headaches will just disappear.

...

frankly, I'm beginning to feel like John Skaller.  do I have to write my
own interpreter to get this done right? :-(

</F>