[I18n-sig] Re: Unicode debate

M.-A. Lemburg mal@lemburg.com
Fri, 28 Apr 2000 14:13:56 +0200


Tom Emerson wrote:
> 
> Just van Rossum writes:
>  > How will other parts of a program know which encoding was used for
>  > non-unicode string literals?
> 
> This is the exact reason that Unicode should be used for all string
> literals: from a language design perspective I don't understand the
> rationale for providing "traditional" and "unicode" string.
> 
>  > It seems to me that an encoding attribute for 8-bit strings solves this
>  > nicely. The attribute should only be set automatically if the encoding of
>  > the source file was specified or when the string has been encoded from a
>  > unicode string. The attribute should *only* be used when converting to
>  > unicode. (Hm, it could even be used when calling unicode() without the
>  > encoding argument.) It should *not* be used when comparing (or adding,
>  > etc.) 8-bit strings to each other, since they still may contain binary
>  > goop, even in a source file with a specified encoding!
> 
> In Dylan there is an explicit split between 'characters' (which are
> always Unicode) and 'bytes'.
> 
> What are the compelling reasons to not use UTF-8 as the (source)
> document encoding? In the past the usual response is, "the tools are't
> there for authoring UTF-8 documents". This argument becomes more
> specious as more OS's move towards Unicode. I firmly believe this can
> be done without Java's bloat.
> 
> One off-the-cuff solution is this:
> 
> All character strings are Unicode (utf-8 encoding). Language terminals
> and operators are restricted to US-ASCII, which are identical to
> UTF8. The contents of comments are not interpreted in any way.

That would be an option... albeit one that would probably render
many of the existing programs useless (I do believe that many
people have encoded their local charset into their programs,
either by entering locale dependent strings directly in the source
code or by making some assumption about their encoding).
 
>  > >- We need a way to indicate the encoding of input and output data
>  > >files, and we need shortcuts to set the encoding of stdin, stdout and
>  > >stderr (and maybe all files opened without an explicit encoding).
>  >
>  > Can you open a file *with* an explicit encoding?
> 
> If you cannot, you lose. You absolutely must be able to specify the
> encoding of a file when opening it, so that the runtime can transcode
> into the native encoding as you read it. This should be otherwise
> transparent the user.

You can: codecs.open(). The interface needs some further
refinement though.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/