Prothon should not borrow Python strings!

Tue May 25 14:24:26 EDT 2004

Paul Prescod wrote:
> Agree: space versus time.

It is somewhat more subtle than that.  It was keep all strings as
you get them (unnormalised) and then consume CPU when having to
deal with them vs consuming CPU up front when originally presented
with a string and normalise it and munge it into a universal
storage encoding.

> First, it is probably too much work to normalize for a 1.0 language
> designer (even Python doesn't). Second, it is quite possibly the wrong
> thing to do at a programming language level. Just as sometimes you want
> to work with the raw bits of a file, sometimes you will want to work
> with the un-normalized representation of a string.

The language implementor does however need to take a stance.
If they decide that normalisation will never happen, then all
other code may have to deal with normalisation issues (for
example what is len on an unnormalized string?)

Conversely they could decide to always normalize which means that
other code doesn't have to worry about it.

The worst thing to do is not make any decision, since that
is equivalent to making both decisions and code will always
have to worry about wether it is or isn't normalised.

> > Another design to consider is to allow tags that cover character
> > ranges and then assign properties to those tags (such as locale,
> > encoding), but importantly allow multiple tags per character.
> > (If you have used the Tk text widget you'll understand what I
> > am thinking of).
>
> I'd say that's also beyond 1.0!

It could be implemented beyond 1.0, but should be designed before
that.  We have already seen the email pointing out some of the
issues with the Han unification and how you really need to know
the character origin to render it correctly even though the codepoint
is the same.

> > In addition to all the excellent notes from Paul, I would recommend
> > you consult with someone familiar with the locale and encoding
> > issues for Hebrew, Arabic and various oriental languages such
> > as Japanese, Korean, Vietnamese and Tibetan.  Bonus points for
> > Tamil :-)
>
> That's probably a little daunting for 1.0. The question is what is the
> minimum possible he can get away with in the next few months.

For design you need to get it right at the begining.  For implementation
you can wait a while.  The moment you start taking short cuts, they
turn into arbitrary design decisions and you tie yourself into stuff
you wouldn't want to be.

> Seems a little over-strict for me. If I'm writing an HTML handling
> program I have to keep the HTML tags in a separate file?

Why not?  See my earlier response to Mark for some ideas on how
to handle that.

Roger