[Python-Dev] Re: Re: Alternative Implementation for PEP 292: SimpleString Substitutions

Stephen J. Turnbull stephen at xemacs.org
Mon Sep 13 16:00:57 CEST 2004


>>>>> "Fredrik" == Fredrik Lundh <fredrik at pythonware.com> writes:

    Fredrik> Stephen J. Turnbull wrote:

    >> But I worry that it's an exceptional example, when you use
    >> assumptions like "real-life text uses characters drawn from a
    >> small number of short contiguous regions in the alphabet."

    Fredrik> The problem is that I cannot tell if you've studied
    Fredrik> search issues,

Enough to understand Boyer-Moore and how the proposed algorithm
differs, and to recognize that your statements about the distribution
of search applications are true.  Not that I want to argue about
search, I'm all in favor of better search.  I was startled to read
that Python still uses a brute-force algorithm for searching.

My point about distribution of ideographs was simply that you made an
unjustified assumption in the context of what is (to me, anyway) an
important subdomain of text processing.  Here, it is "obviously
harmless," but that's because brute force search is so bad.  In other
applications, or with a better status quo, there very well may be real
tradeoffs between what's good for 8-bit text and what's good for
Unicode.

    Fredrik> or if you're just applying general "but wait, it's
    Fredrik> different for asian languages" arguments here.

No, I know that ostrich won't fly.

    Fredrik> Searches for "human text" are not that common, really,
    Fredrik> and search terms are usually limited to only a few words.

In the context of PEP 292 is a focus on "human text" unwarranted?
After all, what motivated the PEP and the implementation was evidently
"human text" processing.  In my experience, the notation for
interpolation it uses would have much bigger advantages over the
format string style for "human text" than for the "non-human text"
applications I know of.  Not that it's useless for the latter, just
that it's much more of a luxury there.

If that's valid, there's a point where it makes sense for people who
develop human-text-oriented features based on Unicode strings to say
"pick the features you really want for 8-bit strings, because you have
to support them yourselves."

    Fredrik> The only way to know for sure is if anyone has the time
    Fredrik> and energy to carry out tests on real-life datasets.  (or
    Fredrik> at least prepare some datasets;

I can prepare datasets and do some statistical work for Japanese, but
it probably won't happen this month.  Sounds like a worthwhile thing
to have around, though.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.


More information about the Python-Dev mailing list