[I18n-sig] Re: Unicode debate

M.-A. Lemburg mal@lemburg.com
Tue, 02 May 2000 17:55:40 +0200


[Guido going ASCII]

Do you mean going ASCII all the way (using it for all
aspects where Unicode gets converted to a string and cases
where strings get converted to Unicode), or just 
for some aspect of conversion, e.g. just for the silent
conversions from strings to Unicode ?

[BTW, I'm pretty sure that the Latin-1 folks won't like
ASCII for the same reason they don't like UTF-8: it's
simply an inconvenient way to write strings in their favorite
encoding directly in Python source code. My feeling in this
whole discussion is that it's more about convenience than
anything else. Still, it's very amusing ;-) ]

FYI, here's the conversion table of (potentially) all
conversions done by the implementation:

Python:
-------
string + unicode:       unicode(string,'utf-8') + unicode
string.method(unicode): unicode(string,'utf-8').method(unicode)
print unicode:          print unicode.encode('utf-8'); with stdout
                        redirection this can be changed to any
                        other encoding
str(unicode):           unicode.encode('utf-8')
repr(unicode):          repr(unicode.encode('unicode-escape'))


C (PyArg_ParserTuple):
----------------------
"s" + unicode:          same as "s" + unicode.encode('utf-8')
"s#" + unicode:         same as "s#" + unicode.encode('unicode-internal')
"t" + unicode:          same as "t" + unicode.encode('utf-8')
"t#" + unicode:         same as "t#" + unicode.encode('utf-8')

This effects all C modules and builtins. In case a C module
wants to receive a certain predefined encoding, it can
use the new "es" and "es#" parser markers.


Ways to enter Unicode:
----------------------
u'' + string            same as unicode(string,'utf-8')
unicode(string,encname) any supported encoding
u'...unicode-escape...' unicode-escape currently accepts
                        Latin-1 chars as single-char input; using
                        escape sequences any Unicode char can be
                        entered (*)
codecs.open(filename,mode,encname)
                        opens an encoded file for
                        reading and writing Unicode directly
raw_input() + stdin redirection (see one of my earlier posts for code)
                        returns UTF-8 strings based on the input
                        encoding

IO:
---
open(file,'w').write(unicode)
        same as open(file,'w').write(unicode.encode('utf-8'))
open(file,'wb').write(unicode)
        same as open(file,'wb').write(unicode.encode('unicode-internal'))
codecs.open(file,'wb',encname).write(unicode)
        same as open(file,'wb').write(unicode.encode(encname))
codecs.open(file,'rb',encname).read()
        same as unicode(open(file,'rb').read(),encname)
stdin + stdout
        can be redirected using StreamRecoders to handle any
        of the supported encodings

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/