[XML-SIG] Python 1.6a2 Unicode experiences?

Andy Robinson andy@reportlab.com
Thu, 27 Apr 2000 11:28:38 +0100


----- Original Message ----- 
From: Andrew Kuchling <akuchlin@mems-exchange.org>
To: <xml-sig@python.org>
Sent: 27 April 2000 03:08
Subject: [XML-SIG] Python 1.6a2 Unicode experiences?


> Has anyone here tried to use the new Unicode support in Python 1.6a2?
> The XML-SIG's readership seems likely to have a use for Unicode, so
> perhaps people here have tried the new code.
> 
> I'm asking because there are lengthy debates about the 1.6a2 Unicode
> support on the python-dev list at the moment, and I thought that
> hearing from actual users would be helpful in resolving the issues.
> 
> --amk

I've played with it, but my job took a turn away from dealing with 
Asian data in February so I have not used it in anger.  However, XML
is going to matter a lot to ReportLab and I played a part in shaping
the spec.

I think (it's hard to follow) that the current discussion is about whether 
literal strings and source files should be treated as UTF-8 or Latin-1.
Some people with little practical i18n experience note that Tcl
and Perl have only one string type, and it just became UTF-8, so
we should do the same thing.  I disagree.

I think our proposal is BETTER than Java, Tcl, Visual Basic etc for 
the following reasons:
- you can work with old fashioned strings, which are understood
by everyone to be arrays of bytes, and there is no magic
conversion going on.  The bytes in literal strings in your script file 
are the bytes that end up in the program.
- you can work with Unicode strings if you want
- you are in explicit control of conversions between them
- both types have similar methods so there isn't much to learn or 
remember

These give us all the tools we need.  If you are writing an XML
parser you would only need one line of code to say "this file is
UTF-8, so let's treat it accordingly, and work in Unicode internally".  

I'm also convinced that the majority of Python scripts won't need
to work in Unicode.  Even working with exotic languages,
there is always a native 8-bit encoding.  I have only used Unicode
when 
(a) working with data that is in several languages
(b) doing conversions, which requires a 'central point'
(b) wanting to do per-character operations safely on multi-byte data

I still haven't sorted out in my head whether the default encoding 
thing is a big red herring or is important; I already have a safe way
to construct Unicode literals in my source files if I want to using
    unicode('rawdata','myencoding').  
But if there has to be one I'd say the following:
- strict ASCII is an option
- Latin-1 is the more generous option that is right for the most people,
and has a 'special status' among 8-bit encodings
- UTF-8 is not one byte per character and will confuse people

Just my 2p worth,

Andy













> 
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://www.python.org/mailman/listinfo/xml-sig
>