[I18n-sig] Re: Unicode debate

Just van Rossum just@letterror.com
Fri, 28 Apr 2000 12:58:28 +0100


At 12:28 PM +0200 28-04-2000, M.-A. Lemburg wrote:
[ encoding attr for 8 bit strings ]
>This would indeed solve some issues... it would cost sizeof(short)
>per string object though (the integer would map into a table
>of encoding names).
>
>I'm not sure what to do with the attribute when strings with
>differing encodings meet. UTF-8 + ASCII will still be UTF-8,
>but e.g. UTF-8 + Latin will not result in meaningful data. Two
>ideas for coercing strings with different encodings:
>
> 1. the encoding of the resulting string is set to 'undefined'
>
> 2. coerce both strings to Unicode and then apply the action

1, because 2 can lead to surprises when two strings containing binary goop
are added and only one was a literal in a source file with an explicit
encoding.

(Would "undefined" be the same as "default"? It would still be nice to be
able to set the global default encoding.)

>Also, how would one create a string having a specific encoding ?
>str(object, encname) would match unicode(object, encname)...

Dunno. Is such a high level interface needed? I'm not proposing to make
8-bit strings almost as powerful as unicode strings: unicode strings are
just fine for those kinds of operations... Hm, I just realized that the
encoding attr can't be mutable (doh!), so maybe your suggestion isn't so
bad at all.

Off-topic, what's the idea behind this behavior?:
>>> unicode(u"abc")
u'\000a\000b\000c'

>> Can you open a file *with* an explicit encoding?
>
>You can specify the encoding by means of using codecs.open()
>instead of open(), but the interface will currently only
>accept (.write) and return (.read) Unicode objects.

Thanks, I wasn't aware of that. Can't the builtin open() function get an
additional encoding argument?

Just