Re: encoding problems (é and è)

Sat Mar 25 00:24:05 EST 2006

Jean-Paul Calderone wrote:
> On Fri, 24 Mar 2006 09:33:19 +1100, John Machin <sjmachin at lexicon.net> wrote:
> >On 24/03/2006 8:36 AM, Peter Otten wrote:
> >> John Machin wrote:
> >>
> >>>You can replace ALL of this upshifting and accent removal in one blow by
> >>>using the string translate() method with a suitable table.
> >>
> >> Only if you convert to unicode first or if your data maintains 1 byte == 1
> >> character, in particular it is not UTF-8.
> >>
> >
> >I'm sorry, I forgot that there were people who are unaware that
> >variable-length gizmos like UTF-8 and various legacy CJK encodings are
> >for storage & transmission, and are better changed to a
> >one-character-per-storage-unit representation before *ANY* data
> >processing is attempted.
>
> Unfortunately, unicode only appears to solve this problem in a sane manner.

What problem do you mean? Loose matching is solved by unicode in a sane
manner, it is described in the unicode collation algorithm.

  Serge.