[Python-Dev] [Fwd: Re: Case sensitivity]

Paul Prescod paul@prescod.net
Thu, 08 Jun 2000 23:32:27 +0200


Here's a good summary of how XML's case sensitivity came to be.

-------- Original Message --------
Subject: Re: Case sensitivity
Date: Mon, 3 Apr 2000 12:44:37 -0400
From: Steve DeRose <Steven_DeRose@brown.edu>
To: xml-dev@lists.oasis-open.org
References: <B50E2EFA.1B57%soord@vda.nl>

Languages with no need for case folding are not much of  a problem: the
case-folding table or program would merely have no effect on characters
belonging to such languages: It would change 26 of our 26 alphabetic
code
points, and no others. After all, in English we already use lots of
characters that don't get case-folded (like digits).

The serious problems are subtler:

The practical problem that with Unicode your folding table gets really
big;
on the order of 128Kbytes instead of 256 bytes (barring compresson):
this
is a pain on small devices (like a cell-phone browser), a pain to load,
a
pain to implement compression for, etc.

The theoretical problem is more important: it's not the caseless
languages
that pose problems, but the complicated case-folding ones. For example,
lots of languages only apply diacritical marks to lower-case letters:
for
example, "a" may come with 6 different accents, but "A" takes none --
which
makes case-folding unreversible. If there are languages that operate the
other way as well, then neither fold-to-upper nor fold-to-lower can work
for all languages: either way would trash some languages.


That said, I think it incumbent on XML *search engines* to support
case-folding (as well as decent treatment of accents, types of hyphens,
etc) for text content searches: Making English speakers search for

  "the" | "thE" | "tHe" | "tHE" | "The" | "ThE" | "THe" | "THE"
or
  "[tT][hH][eE]

is patently absurd; and besides, there is no user cost to (say) a
Japanese
speaker if an engine *does* case-fold. Also, many languages use Roman
characters occasionally, as for acronyms; so their speakers also pay a
price if searches aren't smart enough. And the primary problems with
case-folding do not weigh so heavily in the search engine world (for
example, AltaVista isn't likely to run their main servers on cell phones
for a while yet).