[Python-Dev] Why Foo is better than Baz

Andrew M. Kuchling akuchlin at cnri.reston.va.us
Mon May 3 17:56:46 CEST 1999


Guido van Rossum writes:
>Hmm...  I looked when Tcl 8.1 was in alpha, and I *think* that at that 
>point the regex engine was compiled twice, once for 8-bit chars and
>once for 16-bit chars.  But this may have changed.

	It doesn't seem to currently; the code in tclRegexp.c looks
like this:

    /* Remember the UTF-8 string so Tcl_RegExpRange() can convert the
     * matches from character to byte offsets.
     */
    regexpPtr->string = string;
    Tcl_DStringInit(&stringBuffer);
    uniString = Tcl_UtfToUniCharDString(string, -1, &stringBuffer);
    numChars = Tcl_DStringLength(&stringBuffer) / sizeof(Tcl_UniChar);
    /* Perform the regexp match. */
    result = TclRegExpExecUniChar(interp, re, uniString, numChars, -1,
            ((string > start) ? REG_NOTBOL : 0));

	ISTR the Spencer engine does, however, define a small and
large representation for NFAs and have two versions of the engine, one
for each representation.  Perhaps that's what you're thinking of.

>I've noticed that Perl is taking the same position (everything is
>UTF-8 internally).  On the other hand, Java distinguishes 16-bit chars 
>from 8-bit bytes.  Python is currently in the Java camp.  This might
>be a good time to make sure that we're still convinced that this is
>the right thing to do!

	I don't know.  There's certainly the fundamental dichotomy
that strings are sometimes used to represent characters, where
changing encodings on input and output is reasonably, and sometimes
used to hold chunks of binary data, where any changes are incorrect.
Perhaps Paul Prescod is right, and we should try to get some other
data type (array.array()) for holding binary data, as distinct from
strings.

>I'm sure that if it's good code, we'll find a way.  Perhaps a more
>interesting question is whether it is Perl5 compatible.  I contacted
>Henry Spencer at the time and he was willing to let us use his code.

	Mostly Perl-compatible, though it doesn't look like the 5.005
features are there, and I haven't checked for every single 5.004
feature.  Adding missing features might be problematic, because I
don't really understand what the code is doing at a high level.  Also,
is there a user community for this code?  Do any other projects use
it?  Philip Hazel has been quite helpful with PCRE, an important thing
when making modifications to the code.
 
	Should I make a point of looking at what using the Spencer
engine would entail?  It might not be too difficult (an evening or
two, maybe?) to write a re.py that sat on top of the Spencer code;
that would at least let us do some benchmarking.

-- 
A.M. Kuchling			http://starship.python.net/crew/amk/
In Einstein's theory of relativity the observer is a man who sets out in quest
of truth armed with a measuring-rod. In quantum theory he sets out with a
sieve.
    -- Sir Arthur Eddington






More information about the Python-Dev mailing list