Prothon should not borrow Python strings!

gabriele renzi surrender_it at remove.yahoo.it
Tue May 25 04:51:34 EDT 2004


il Mon, 24 May 2004 21:43:12 -0700, "Mark Hahn" <mark at prothon.org> ha
scritto::

>Roger Binns wrote:
>
>> In addition to all the excellent notes from Paul, I would recommend
>> you consult with someone familiar with the locale and encoding
>> issues for Hebrew, Arabic and various oriental languages such
>> as Japanese, Korean, Vietnamese and Tibetan.  Bonus points for
>> Tamil :-)
>
>I sure hope you are kidding.  If not you are scaring me away from doing
>anything.

Sorry if It's kind of OT, but a huge thread about this appeared in
comp.lang.ruby some time ago.
Quoting a little for you:

"""
|As far as I can see, currently 20 bits are sufficient :-)
|http://www.unicode.org/charts/
|
|And anything after "Special" looks really quite special to me.  At least
|western languages as well as Kanji, Hiragana and Katakana are supported.
|IMHO pragmatically 16 bits are good enough.

I assume you're saying that there's no more than 65536 characters on
earth in daily use, even including Asian ideograms (Kanjis).

You are right, if we can live in the idealistic world.

The problems are:

  * Japan, China, Korea and Taiwan have characters from same origin,
    but with different glyph (appearance).  Due to Han unification,
    Unicode assigns same character code number to those characters.
    We used to use encodings to switch country information (script) in
    internationalized applications.  Unicode does not allow this
    approach.  We need to implement another layer to switch script.

  * Due to historical reason and unification, some characters do not
    round trip through conversion from/to Unicode.  Sometimes we loose
    information by implicit Unicode conversion.

  * Asian people have used multibyte encoding (EUC-JP for example) for
    long time.  We have gigabytes of legacy encoding files.  The cost
    of code conversion is not negligible.  We also have to care about
    the round trip problem.

  * There are some huge set of characters little known to western
    world.  For example, the TRON code contains 170,000 characters.
    They are important to researchers, novelists, and people who care
    characters.
"""



More information about the Python-list mailing list