[Python-Dev] Unicode Proposal: Version 0.7

M.-A. Lemburg mal@lemburg.com
Fri, 19 Nov 1999 00:41:32 +0100


Skip Montanaro wrote:
> 
> I haven't been following this discussion closely at all, and have no
> previous experience with Unicode, so please pardon a couple stupid questions
> from the peanut gallery:
> 
>     1. What does U+0061 mean (other than 'a')?  That is, what is U?

U+XXXX means Unicode character with ordinal hex number XXXX. It is
basically just another way to say, hey I want the Unicode character
at position 0xXXXX in the Unicode spec.
 
>     2. I saw nothing about encodings in the Codec/StreamReader/StreamWriter
>        description.  Given a Unicode object with encoding e1, how do I write
>        it to a file that is to be encoded with encoding e2?  Seems like I
>        would do something like
> 
>            u1 = unicode(s, encoding=e1)
>            f = open("somefile", "wb")
>            u2 = unicode(u1, encoding=e2)
>            f.write(u2)
> 
>        Is that how it would be done?  Does this question even make sense?

The unicode() constructor converts all input to Unicode as
basis for other conversions. In the above example, s would be
converted to Unicode using the assumption that the bytes in
s represent characters encoded using the encoding given in e1.
The line with u2 would raise a TypeError, because u1 is not
a string. To convert a Unicode object u1 to another encoding,
you would have to call the .encode() method with the intended
new encoding. The Unicode object will then take care of the
conversion of its internal Unicode data into a string using
the given encoding, e.g. you'd write:

f.write(u1.encode(e2))
 
>     3. What will the impact be on programmers such as myself currently
>        living with blinders on (that is, writing in plain old 7-bit ASCII)?

If you don't want your scripts to know about Unicode, nothing
will really change. In case you do use e.g. Latin-1 characters
in your scripts for strings, you are asked to include a pragma
in the comment lines at the beginning of the script (so that
programmers viewing your code using other encoding have a chance
to figure out what you've written).

Here's the text from the proposal:
"""
Note that you should provide some hint to the encoding you used to
write your programs as pragma line in one the first few comment lines
of the source file (e.g. '# source file encoding: latin-1'). If you
only use 7-bit ASCII then everything is fine and no such notice is
needed, but if you include Latin-1 characters not defined in ASCII, it
may well be worthwhile including a hint since people in other
countries will want to be able to read you source strings too.
"""

Other than that you can continue to use normal strings like
you always have.

Hope that clarifies things at least a bit,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Y2000:                                                    43 days left
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/