diferences between 22 and python 23

Thu Dec 4 14:22:52 EST 2003

bokr at oz.net (Bengt Richter) writes:

> Still, the actual characters used in the _source_ representation will have to
> be whatever the -*- xxx -*- thing says, right? -- including the characters
> in the source representation of a string that might wind up utf-8 internally?

Yes, and no. Yes, characters in the source code have to follow the
source representation. No, they will not wind up utf-8
internally. Instead, (byte) string objects have the same byte
representation that they originally had in the source code.

The source declaration only matters in the following respects:
- the source may be erroneous, if the bytes form illegal encodings
  in the declared source encoding.
- a unicode object will be created based upon the source encoding,
  by decoding the bytes in the unicode literal.
- the meaning of certain bytes might not be what it would be in
  ASCII. In particular, byte 92 does not always denote a 
  backslash (\), in all encodings. As a result, if byte 92 appears
  in a string literal, the end of the string literal might depend
  on the encoding.

> >The byte string type is not going away. It is a useful type, e.g. when
> >reading or writing to or from a byte stream.
> >
> Is this moving towards a single 8-bit str base type with various
> encoding-specifying subtypes?

I don't think so. If byte strings where tagged with encoding, you
have to answer many difficult questions, like "what is the result
of adding strings with different encodings?", or "what encoding tag
has a string returned from a socket read", etc.

Instead, applications should apply encodings whereever needed, using
Unicode strings for character data, and byte strings for binary data.

Regards,
Martin