Prothon should not borrow Python strings!

Roger Binns rogerb at rogerbinns.com
Mon May 24 19:40:16 EDT 2004


> Choosing an internal encoding is actually pretty tricky because there
> are space versus time tradeoffs and you need to make some guesses about
> how often particular characters are likely to be useful to your users.

There are two ways to deal with it.  One is to convert to an internal
"UNICODE" format such as utf8, or using arrays of 16 or 32 bit integers.

You also have to decide if you are going to normalise the string.
For example you can have characters followed by a combining accent.
On display they are one character, but often there is a codepoint
for the single character combined with the accent, so you could
reduce the two down to one.  There are also other characters such as
those that specify the direction of the following text which are
considered noise in some contexts.

The other way of dealing with things is to keep the text as it
was given, and not do any conversion or normalisation on it.
This is generally more future proof, but does burden other code
with having to deal with conversion issues (for example NT/2K/XP
only uses 16 bits for codepoints which is less than the full
range now).

If you want to score extra bonus points, you should also store
the locale of the string along with the encoding.  I won't elaborate
here why.

Another design to consider is to allow tags that cover character
ranges and then assign properties to those tags (such as locale,
encoding), but importantly allow multiple tags per character.
(If you have used the Tk text widget you'll understand what I
am thinking of).

> By the way, if you have the courage to distance yourself from every
> other language under the sun, I would propose that you throw an
> exception on unknown escape sequences.

Perl did that first :-)  It didn't distinguish between arrays of
bytes and arrays of characters so you easily end up with humunguous
amounts of warnings about invalid UTF8 stuff when dealing with
bytes.  (I have no idea what goes on under the hood - you just
see it when installing Perl stuff like SpamAssassin).

In addition to all the excellent notes from Paul, I would recommend
you consult with someone familiar with the locale and encoding
issues for Hebrew, Arabic and various oriental languages such
as Japanese, Korean, Vietnamese and Tibetan.  Bonus points for
Tamil :-)

Just to make life even more interesting, you should realise that
there is more than one system of digits.  You can see how Java
handles the issue here:

http://java.sun.com/j2se/1.4.2/docs/api/java/awt/font/NumericShaper.html

Since you are doing new language design, I also think there would
be great value in forcing things so that you do not have
strings embedded in the program, and they have to come from
external resource files.  This also gives you the opportunity to
deal with string interpolation issues and get them right.
(It also means that "Hello, World" remains one line, but also
requires an external file with the message, or some other
mechanism).

The other Java i18n pages make for interesting reading:

http://java.sun.com/j2se/corejava/intl/index.jsp

Roger





More information about the Python-list mailing list