"in" operator for strings

Sat Feb 3 04:12:20 EST 2001

"Magnus Lie Hetland" <mlh at idi.ntnu.no> wrote in message
news:95ftl4$o6b$1 at tyfon.itea.ntnu.no...
    [snip]
> > > only allow contiguous subsequences, or any subsequence?
> > > If we ever get built-in sets contiguity would be meningless,
> > > as would it be if we get "in" tests for dictionaries.)
> >
> > Dictionaries aren't sequences, nor will sets be, so the
> > issue doesn't apply to either.
>
> Sure it does. It will then become the question of whether to
> allow the same operator do check membership and subset-ness.

Subsetness is one thing, subsequencehood another (actually
two others, given contiguity may be required or not); the
difference is ordering -- 'ab' is a subset but not a subsequence
of 'obak'.  The whole issue of subsequence is meaningless
where no sequence (ordering) exists.

> > with bisection (binary search) over the
> > 'places' sequences, O(M log N) appears like an easy
> > upperbound, and maybe it can be cut down further.
>
> Well... If you could store an array of size N for each
> letter you could easily do it in O(1) time... But then
> you'd need O(M*N) time to initialize the stuff :-)

You don't know M at initialization time -- it's the size
of the subsequence you're going to match (in the
normal notation I've seen used), while N is the size
of the big sequence in which you're looking for matches.

Alphabet size is generally thought of as fixed in most
published literature, though I suspect that O(N) is
more realistic for generic sequences.

You still can't do a match in O(1) time anyway, even
with the auxiliary structure you suggest; O(M) seems
the lower bound, except by precomputing ALL the
possible subsequences and indexing them into some
totally hypothetical O(1)-lookup-time structure (but
in fact computing the hash IS still O(M), so there is
no big-O advantage here).

> If you add a requirement that the matched letters must
> be relatively close to each other (within a constant
> distance limit) you can get a speedup. (A reasonable
> requirement in some cases.) Actually, I know of some
> harwdare that does this very quickly.

I'm curious about the algorithm and its auxiliary data
structures here.  The O(M) match still seems pretty
elusive to me.

> To me the module doesn't even have to be small. Just think
> about all the other function and class names you define in
> a module... Why should "split" specifically have to have
> a module prefix? If I choose to import it that may be as
> thought through as defining a function myself. Only if I

I like the clarity that comes from use of explicit prefixes --
no wondering 'WHAT split function is this one' any more
(string or re or ...?).  Calling a method of an object, as in
'foo bar'.split(), is clearest -- totally local context, zero
wondering-factor.  Calling a prefixed function is second
best.  Calling a non-prefixed function sends me looking for
it throughout the non-necessarily-small module -- and
I won't find the '^def split' either (so I have to look at the
from statements and the assignments -- sigh).

> (OK... So I know you dislike the word "convenient". Well...
> I find it prettier too ;-)

The (modest) convenience of the code author becomes
the (rather larger) IN-convenience of the maintainer.

When one has spent a few years doing maintenance on
large modules written by others, one's priorities tend to
change -- as does one's outlook on "prettiness" when it
interferes with clarity (the typical C++ scope obscurity
due to implicit-this, vs. Python's clarity with explicit
self, being a good example).

> Why not write somesequence.extend([1]) or
> somesequence.join(" ") (not both, of course) so that they would
> be consistent with each other? Oh, well. I'm sure it's just
> that I don't see the Greatness of it all.

I've discussed the "what object should join be a method
of" in a couple of lengthy postings 2 or 3 months ago --
it boils down to "where do I get the most leverage for
polymorphism", and the current architecture wins hands
down.  extend is similar -- which object needs the
polymorphism most, the one being modified or the
sequence just being 'read'?  And of course the answer
is the same.

So, the architecture DOES turn out to be fully consistent
here, because exactly the same forces are being
resolved!  In each case, the subsequence that is
just being read (for the purpose of building a string
out of it, or for that of extending another object) is
the argument -- all the polymorphism we need on it
is that it obeys the sequence-protocol, after all.  The
'joiner' (aka separator) object, and the object being
modified (extended), are in each case the target of
the method and get full benefit of polymorphism.

> The " ".join([1,2,3]) doesn't seem like an accessor to me.
> It seems like a plain function with a weird syntax.
>
> But, hey, since this is the way it *is*, I'm the one with
> a problem here :-)

And we're here to help you come to terms with it!-)

What I'm doing is asking a 'joiner object' "please read
this sequence of strings and join it up into one big
string, will you".  Function syntax does not give you
polymorphism directly in Python - it can at best be
syntax sugar for a method call that does do the poly
thing:

def myJoin(joiner, sequence):
    return joiner.join(sequence)

which is basically what string.join now does.  We do
not have multimethods in Python (one generic function
dispatched directly to the appropriate method), so
the main Python approach to polymorphism is calling
a method on an object (sometimes with weird syntax
sugar, such as a+b for a.__add__(b) or maybe for
b.__radd__(a), but that's another issue:-).  It only
remains to determine which object gets real benefit
from a given case of polymorphism.

Alex