efficiently splitting up strings based on substrings

Rhodri James rhodri at wildebst.demon.co.uk
Sat Sep 5 18:42:16 EDT 2009


On Sat, 05 Sep 2009 22:54:41 +0100, per <perfreem at gmail.com> wrote:

> I'm trying to efficiently "split" strings based on what substrings
> they are made up of.
> i have a set of strings that are comprised of known substrings.
> For example, a, b, and c are substrings that are not identical to each
> other, e.g.:
> a = "0" * 5
> b = "1" * 5
> c = "2" * 5
>
> Then my_string might be:
>
> my_string = a + b + c
>
> i am looking for an efficient way to solve the following problem.
> suppose i have a short
> string x that is a substring of my_string.  I want to "split" the
> string x into blocks based on
> what substrings (i.e. a, b, or c) chunks of s fall into.
>
> to illustrate this, suppose x = "00111". Then I can detect where x
> starts in my_string
> using my_string.find(x).  But I don't know how to partition x into
> blocks depending
> on the substrings.  What I want to get out in this case is: "00",
> "111".  If x were "001111122",
> I'd want to get out "00","11111", "22".
>
> is there an easy way to do this?  i can't simply split x on a, b, or c
> because these might
> not be contained in x.  I want to avoid doing something inefficient
> like looking at all substrings
> of my_string etc.
>
> i wouldn't mind using regular expressions for this but i cannot think
> of an easy regular
> expression for this problem.  I looked at the string module in the
> library but did not see
> anything that seemd related but i might have missed it.

I'm not sure I understand your question exactly.  You seem to imply
that the order of the substrings of x is consistent.  If that's the
case, this ought to help:

>>> import re
>>> x = "001111122"
>>> m = re.match(r"(0*)(1*)(2*)", x)
>>> m.groups()
('00', '11111', '22')
>>> y = "00111"
>>> m = re.match(r"(0*)(1*)(2*)", y)
>>> m.groups()
('00', '111', '')

You'll have to filter out the empty groups for yourself, but that's
no great problem.

-- 
Rhodri James *-* Wildebeest Herder to the Masses



More information about the Python-list mailing list