[I18n-sig] Pre-PEP: Proposed Python Character Model

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Thu, 8 Feb 2001 02:08:56 +0100


> > Just try
> > 
> >   reader = codecs.lookup("ISO-8859-2")[2]
> >   charfile = reader(file)
> > 
> > There could be a convenience function, but that also is a detail.
> 
> Usability is not a detail in this particular case. We are trying to
> change people's behavior and help them make more robust code.

Ok, just propose a specific patch; I'd recommend to add another
function to the codecs module, rather than adding another built-in.

> That's fine. I'll change the document to be more explicit. Would you
> agree that: "Unicode is the only *character set* that supports *all of
> the world's major written languages.*"

That is certainly the case.

> > > Not under my proposal. file.read returns a character string. Sometimes
> > > the character string contains characters between 0 and 255 and is
> > > indistinguishable from today's string type. Sometimes the file object
> > > knows that you want the data decoded and it returns large characters.
> > 
> > I guess we have to defer this until I see whether it is feasible
> > (which I believe it is not - it was the mistake Sun made in the early
> > JDKs).
> 
> What was the mistake?

Java early had methods that treated Strings and byte array
interchangably if the strings had character values below 256. One
left-over from that is

  public String(byte[] ascii, int hibyte); // in class java.lang.String

It would use the ascii array, and fill it with hibyte in-between;
hibyte was typically 0. The documentation now says

# Deprecated. This method does not properly convert bytes into
# characters. As of JDK 1.1, the preferred way to do this is via the
# String constructors that take a character-encoding name or that use
# the platform's default encoding.

The reverse operation of that is getBytes(nt srcBegin, int srcEnd,
byte[] dst, int dstBegin):

# Deprecated. This method does not properly convert characters into
# bytes. As of JDK 1.1, the preferred way to do this is via the
# getBytes(String enc) method, which takes a character-encoding name,
# or the getBytes() method, which uses the platform's default
# encoding.

I'd say your proposal is in the direction of repeating this mistake.

> You and I agree that streams can change encoding mid-stream. You
> probably think that should be handled by passing the stream to various
> codecs as you read (or by doing double-buffer reads). I think that it
> should be possible right in the read method.

Please take it as a fact that it is impossible to do that at an
arbitrary point in the stream; codecs that need to maintain state will
result strangely.

> It's funny how we switch back and forth. If I say that Python reads byte
> 245 into character 245 and thus uses Latin 1 as its default encoding I'm
> told I'm wrong. Python has no native encoding. If I claim that in
> passing data to C we should treat character 245 as the C "char" with the
> value 245 you tell me that I'm proposing Latin 1 as the default
> encoding.

Python has no default character set *in its byte string type*. Once
you have Unicode objects, talking about language-specified character
sets is meaningful.

> Python has a concept of character that extends from 0 to 255. C has a
> concept of character that extends from 0 to 255. There is no issue of
> "encoding" as long as you stay within those ranges.

C supports various character sets, depending on context. Encodings do
matter here already, e.g. when selecting fonts. Some character sets
supported in C have characters >256, even if they are stored in char*
(in particular, MBCS have these properties).

> > That Chinese Python programmer should use his editor of choice, and
> > put _() around strings that are meant as text (as opposed to strings
> > that are protocol). 
> 
> I don't know what you mean by "protocol" here. 

If you do

print "GET "+url+" HTTP/1.0"

then the strings are really not meant to be human-readable, they are
part of some machine-to-machine communication protocol.

> But nevertheless, you are saying that the Chinese programmer must do
> more than the English programmer does and I consider that a problem.

It just works for the English programmer by coincidence; that
programmer should really tell apart text and byte strings in source as
well.

Following the Unicode path, source files should be UTF-8, but that
won't work in practice because of missing editor support.

Regards,
Martin