efficiently splitting up strings based on substrings

Sat Sep 5 19:07:09 EDT 2009

On Sat, 05 Sep 2009 23:54:08 +0100, per <perfreem at gmail.com> wrote:

> On Sep 5, 6:42 pm, "Rhodri James" <rho... at wildebst.demon.co.uk> wrote:
>> On Sat, 05 Sep 2009 22:54:41 +0100, per <perfr... at gmail.com> wrote:
>> > I'm trying to efficiently "split" strings based on what substrings
>> > they are made up of.
>> > i have a set of strings that are comprised of known substrings.
>> > For example, a, b, and c are substrings that are not identical to each
>> > other, e.g.:
>> > a = "0" * 5
>> > b = "1" * 5
>> > c = "2" * 5
>>
>> > Then my_string might be:
>>
>> > my_string = a + b + c
>>
>> > i am looking for an efficient way to solve the following problem.
>> > suppose i have a short
>> > string x that is a substring of my_string.  I want to "split" the
>> > string x into blocks based on
>> > what substrings (i.e. a, b, or c) chunks of s fall into.
>>
>> > to illustrate this, suppose x = "00111". Then I can detect where x
>> > starts in my_string
>> > using my_string.find(x).  But I don't know how to partition x into
>> > blocks depending
>> > on the substrings.  What I want to get out in this case is: "00",
>> > "111".  If x were "001111122",
>> > I'd want to get out "00","11111", "22".
>>
>> > is there an easy way to do this?  i can't simply split x on a, b, or c
>> > because these might
>> > not be contained in x.  I want to avoid doing something inefficient
>> > like looking at all substrings
>> > of my_string etc.
>>
>> > i wouldn't mind using regular expressions for this but i cannot think
>> > of an easy regular
>> > expression for this problem.  I looked at the string module in the
>> > library but did not see
>> > anything that seemd related but i might have missed it.
>>
>> I'm not sure I understand your question exactly.  You seem to imply
>> that the order of the substrings of x is consistent.  If that's the
>> case, this ought to help:
>>
>> >>> import re
>> >>> x = "001111122"
>> >>> m = re.match(r"(0*)(1*)(2*)", x)
>> >>> m.groups()
>>
>> ('00', '11111', '22')>>> y = "00111"
>> >>> m = re.match(r"(0*)(1*)(2*)", y)
>> >>> m.groups()
>>
>> ('00', '111', '')
>>
>> You'll have to filter out the empty groups for yourself, but that's
>> no great problem.
>
> The order of the substrings is consistent but what if it's not 0, 1, 2
> but a more complicated string? e.g.
>
> a = 1030405, b = 1babcf, c = fUUIUP
>
> then the substring x might be 4051ba, in which case using a regexp
> with (1*) will not work since both a and b substrings begin with the
> character 1.

Right.  This looks approximately nothing like what I thought your
problem was.  Would I be right in thinking that you want to match
substrings of your potential "substrings" against the string x?

I'm sufficiently confused that I think I'd like to see what your
use case actually is before I make more of a fool of myself.

-- 
Rhodri James *-* Wildebeest Herder to the Masses