[Python-Dev] Re: Re: Alternative Implementation for PEP 292: SimpleString Substitutions

Mon Sep 13 08:53:53 CEST 2004

Stephen J. Turnbull wrote:

> But I worry that it's an exceptional example, when you use assumptions
> like "real-life text uses characters drawn from a small number of short
> contiguous regions in the alphabet."

The problem is that I cannot tell if you've studied search issues, or if you're
just applying general "but wait, it's different for asian languages" arguments
here.

There are many issues here, all pulling in different directions:

- If you look at usage statistics, you'll find that the absolute majority
of all searches are for a single character (usually separators, like colons,
spaces, commas).  The second largest category is computer-level
keywords (usually pure ASCII, also in localized programs), used to
process network protocols, file formats, message headers, etc.  Searches
for "human text" are not that common, really, and search terms are usually
limited to only a few words.

- This means that most searches have exactly the same characteristics,
independent of the locale.  Even if a new algorithm would only be better
for pure-ASCII text, everyone would benefit.

- As for non-ASCII search terms, the "human text" search terms are
usually shorter in languages with many ideographs (my non-scientific
tests indicate that chinese text uses about 4 times less symbols than
english; I'm sure someone can dig up better figures).

- This means that even if you are more likely to get collisions in the
compressed skip table, there are fewer characters in the table.

- This means that you'll probably be able to make long skips as often
as for non-Asian text.

- On the other hand, the long skips are shorter than for non-Asian text,
so you may have to make more of them.

- On the other hand, the target strings are also likely to be shorter, so
that might not matter.

- And so on.

The only way to know for sure is if anyone has the time and energy to carry
out tests on real-life datasets.  (or at least prepare some datasets; I can run
the tests if someone provides me with a file with search terms and a number
of files containing texts to apply them to, preferrably using UTF-8 encoding).

</F>