NewB question on text manipulation

Wed May 3 15:16:57 EDT 2006

On Wed, 03 May 2006 10:29:55 -0700, ProvoWallis wrote:
> I only have one issue that I can't figure out. When I print the new
> string I'm getting all of the values in the lt list rather than just
> the one that corresponds to the original entry.

I did not realize that each entry would have its own LT value.  I had
thought that there were several sets of <SC> and <XC> with one <LT>.  You
only showed one example...

I have modified the program to collect LT values at the same time it
collects SC and XC values. Also, it now collects whatever code appears
before the first SC code.  I don't know what this code is for so I just
called the variable "before".

Notes on the code:

* Instead of doing this:

title = m.group(2)
title = title.strip()

I just do this:

title = m.group(2).strip()

You can apply string methods on any string, and it's convenient to do it
all in one line.  There are several lines like that.

* There are two patterns to detect the LT code.  The first one is for
finding it, and the second one is only for removing it.  The second one
uses '^' to anchor the pattern, so it will only remove the LT code if the
LT code is the first thing in the string.  The first pattern does not have
the '^' anchor so it will look ahead, past any number of <SC> codes, to
find the next <LT> code.

* Otherwise this is pretty much like the first version.  It collects data,
saves it in a list, and then prints its output from the list.

I am busy now, so I won't have any time to make any more versions of this
for you. I hope you can study what I have done and understand how to apply
the ideas to your problems.  Good luck!

-- cut here -- cut here -- cut here -- cut here -- cut here --
import re

s = "<1><SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 " + \
    "<1><SC>PROC GUIDE<XC>92<LT>1(b)(1)" + \
    "<1><SC>FAM LAW ENF<XC>259-232<LT>-687" + \
    "<1><SC>APPEAL<XC>40-38; 40-44; 44-18; 45-15<LT>1"

s_space = " "  # a single space
s_empty = ""  # empty string

pat_sc = re.compile("\s*(<[^<]+)<SC>([^<]+)<XC>([^<]+)")
pat_lt = re.compile("<LT>([^<]+)")
pat_lt_remove = re.compile("^<LT>([^<]+)")

lst = []
lt = None

while True:
    m = pat_sc.search(s)
    if not m:
        break

    before = m.group(1).strip()
    title = m.group(2).strip()
    xc = m.group(3).replace(s_space, s_empty)

    s = pat_sc.sub(s_empty, s, 1)

    m = pat_lt.search(s)
    if m:
        lt = m.group(1)
        lt = lt.strip()

    s = pat_lt_remove.sub(s_empty, s, 1)

    tup = (before, title, xc, lt)
    lst.append(tup)

for before, title, xc, lt in lst:
    lst_pp = xc.split(";")
    for pp in lst_pp:
        print "%s<SC>%s<XC>%s<LT>%s" % (before, title, pp, lt)
-- cut here -- cut here -- cut here -- cut here -- cut here --

-- 
Steve R. Hastings    "Vita est"
steve at hastings.org    http://www.blarg.net/~steveha