[Python-Dev] String encoding

M.-A. Lemburg mal@lemburg.com
Tue, 23 May 2000 18:48:08 +0200


"Fred L. Drake" wrote:
> 
> On Tue, 23 May 2000, M.-A. Lemburg wrote:
>  > The problem is that "s" and "t" return C pointers to some
>  > internal data structure of the object. It has to be assured
>  > that this data remains intact at least as long as the object
>  > itself exists.
>  >
>  > AFAIK, this cannot be fixed without creating a memory leak.
>  >
>  > The "es" parser marker uses a different strategy, BTW: the
>  > data is copied into a buffer, thus detaching the object
>  > from the data.
>  >
>  > > > C APIs which want to support Unicode should be fixed to use
>  > > > "es" or query the object directly and then apply proper, possibly
>  > > > OS dependent conversion.
>  > >
>  > > for convenience, it might be a good idea to have a "wide system
>  > > encoding" too, and special parser markers for that purpose.
>  > >
>  > > or can we assume that all wide system API's use unicode all the
>  > > time?
>  >
>  > At least in all references I've seen (e.g. ODBC, wchar_t
>  > implementations, etc.) "wide" refers to Unicode.
> 
>   On Linux, wchar_t is 4 bytes; that's not just Unicode.  Doesn't ISO
> 10646 require a 32-bit space?

It is, Unicode is definitely moving in the 32-bit direction.

>   I recall a fair bit of discussion about wchar_t when it was introduced
> to ANSI C, and the character set and encoding were specifically not made
> part of the specification.  Making a requirement that wchar_t be Unicode
> doesn't make a lot of sense, and opens up potential portability issues.
> 
> -1 on any assumption that wchar_t is usefully portable.

Ok... so could be that Fredrik has a point there, but I'm
not deep enough into this to be able to comment.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/