efficiently splitting up strings based on substrings

per perfreem at gmail.com
Sat Sep 5 19:29:14 EDT 2009


On Sep 5, 7:07 pm, "Rhodri James" <rho... at wildebst.demon.co.uk> wrote:
> On Sat, 05 Sep 2009 23:54:08 +0100, per <perfr... at gmail.com> wrote:
> > On Sep 5, 6:42 pm, "Rhodri James" <rho... at wildebst.demon.co.uk> wrote:
> >> On Sat, 05 Sep 2009 22:54:41 +0100, per <perfr... at gmail.com> wrote:
> >> > I'm trying to efficiently "split" strings based on what substrings
> >> > they are made up of.
> >> > i have a set of strings that are comprised of known substrings.
> >> > For example, a, b, and c are substrings that are not identical to each
> >> > other, e.g.:
> >> > a = "0" * 5
> >> > b = "1" * 5
> >> > c = "2" * 5
>
> >> > Then my_string might be:
>
> >> > my_string = a + b + c
>
> >> > i am looking for an efficient way to solve the following problem.
> >> > suppose i have a short
> >> > string x that is a substring of my_string.  I want to "split" the
> >> > string x into blocks based on
> >> > what substrings (i.e. a, b, or c) chunks of s fall into.
>
> >> > to illustrate this, suppose x = "00111". Then I can detect where x
> >> > starts in my_string
> >> > using my_string.find(x).  But I don't know how to partition x into
> >> > blocks depending
> >> > on the substrings.  What I want to get out in this case is: "00",
> >> > "111".  If x were "001111122",
> >> > I'd want to get out "00","11111", "22".
>
> >> > is there an easy way to do this?  i can't simply split x on a, b, or c
> >> > because these might
> >> > not be contained in x.  I want to avoid doing something inefficient
> >> > like looking at all substrings
> >> > of my_string etc.
>
> >> > i wouldn't mind using regular expressions for this but i cannot think
> >> > of an easy regular
> >> > expression for this problem.  I looked at the string module in the
> >> > library but did not see
> >> > anything that seemd related but i might have missed it.
>
> >> I'm not sure I understand your question exactly.  You seem to imply
> >> that the order of the substrings of x is consistent.  If that's the
> >> case, this ought to help:
>
> >> >>> import re
> >> >>> x = "001111122"
> >> >>> m = re.match(r"(0*)(1*)(2*)", x)
> >> >>> m.groups()
>
> >> ('00', '11111', '22')>>> y = "00111"
> >> >>> m = re.match(r"(0*)(1*)(2*)", y)
> >> >>> m.groups()
>
> >> ('00', '111', '')
>
> >> You'll have to filter out the empty groups for yourself, but that's
> >> no great problem.
>
> > The order of the substrings is consistent but what if it's not 0, 1, 2
> > but a more complicated string? e.g.
>
> > a = 1030405, b = 1babcf, c = fUUIUP
>
> > then the substring x might be 4051ba, in which case using a regexp
> > with (1*) will not work since both a and b substrings begin with the
> > character 1.
>
> Right.  This looks approximately nothing like what I thought your
> problem was.  Would I be right in thinking that you want to match
> substrings of your potential "substrings" against the string x?
>
> I'm sufficiently confused that I think I'd like to see what your
> use case actually is before I make more of a fool of myself.
>
> --
> Rhodri James *-* Wildebeest Herder to the Masses

it's exactly the same problem, except there are no constraints on the
strings.  so the problem is, like you say, matching the substrings
against the string x. in other words, finding out where x "aligns" to
the ordered substrings abc, and then determine what chunk of x belongs
to a, what chunk belongs to b, and what chunk belongs to c.

so in the example i gave above, the substrings are: a = 1030405, b =
1babcf, c = fUUIUP, so abc = 10304051babcffUUIUP

given a substring like 4051ba, i'd want to split it into the chunks a,
b, and c. in this case, i'd want the result to be: ["405", "1ba"] --
i.e. "405" is the chunk of x that belongs to a, and "1ba" the chunk
that belongs to be. in this case, there are no chunks of c.  if x
instead were "4051babcffUU", the right output is: ["405", "1babcf",
"fUU"], which are the corresponding chunks of a, b, and c that make up
x respectively.

i'm not sure how to approach this. any ideas/tips would be greatly
appreciated. thanks again.



More information about the Python-list mailing list