Re[Python-Dev] #pragmas in Python source code

Andrew M. Kuchling akuchlin@mems-exchange.org
Fri, 14 Apr 2000 15:37:01 -0400 (EDT)


Fredrik Lundh writes:
>    if the programmer wants to convert between a unicode
>    string and a buffer containing encoded text, she needs
>    to spell it out.  the codecs are never called "under the
>    hood"

Watching the successive weekly Unicode patchsets, each one fixing some
obscure corner case that turned out to be buggy -- '%s' % ustr,
concatenating literals, int()/float()/long(), comparisons -- I'm
beginning to agree with Fredrik.  Automatically making Unicode strings
and regular strings interoperate looks like it requires many changes
all over the place, and I worry if it's possible to catch them all in
time.  

Maybe we should consider being more conservative, and just having the
Unicode built-in type, the unicode() built-in function, and the u"..."
notation, and then leaving all responsibility for conversions up to
the user.  On the other hand, *some* default conversion seems needed,
because it seems draconian to make open(u"abcfile") fail with a
TypeError.

(While I want to see Python 1.6 expedited, I'd also not like to see it
saddled with a system that proves to have been a mistake, or one
that's a maintenance burden.  If forced to choose between delaying and
getting it right, the latter wins.)

>why not just assume that the *ENTIRE SOURCE FILE* uses a single
>encoding, and let the tokenizer (or more likely, a conversion stage
>before the tokenizer) convert the whole thing to unicode.

To reinforce Fredrik's point here, note that XML only supports
encodings at the level of an entire file (or external entity). You
can't tell an XML parser that a file is in UTF-8, except for this one
element whose contents are in Latin1.  

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
Dream casts a human shadow, when it occurs to him to do so.
  -- From SANDMAN: "Season of Mists", episode 0