NewB question on text manipulation
Steve R. Hastings
steve at hastings.org
Wed May 3 15:16:57 EDT 2006
On Wed, 03 May 2006 10:29:55 -0700, ProvoWallis wrote:
> I only have one issue that I can't figure out. When I print the new
> string I'm getting all of the values in the lt list rather than just
> the one that corresponds to the original entry.
I did not realize that each entry would have its own LT value. I had
thought that there were several sets of <SC> and <XC> with one <LT>. You
only showed one example...
I have modified the program to collect LT values at the same time it
collects SC and XC values. Also, it now collects whatever code appears
before the first SC code. I don't know what this code is for so I just
called the variable "before".
Notes on the code:
* Instead of doing this:
title = m.group(2)
title = title.strip()
I just do this:
title = m.group(2).strip()
You can apply string methods on any string, and it's convenient to do it
all in one line. There are several lines like that.
* There are two patterns to detect the LT code. The first one is for
finding it, and the second one is only for removing it. The second one
uses '^' to anchor the pattern, so it will only remove the LT code if the
LT code is the first thing in the string. The first pattern does not have
the '^' anchor so it will look ahead, past any number of <SC> codes, to
find the next <LT> code.
* Otherwise this is pretty much like the first version. It collects data,
saves it in a list, and then prints its output from the list.
I am busy now, so I won't have any time to make any more versions of this
for you. I hope you can study what I have done and understand how to apply
the ideas to your problems. Good luck!
-- cut here -- cut here -- cut here -- cut here -- cut here --
import re
s = "<1><SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 " + \
"<1><SC>PROC GUIDE<XC>92<LT>1(b)(1)" + \
"<1><SC>FAM LAW ENF<XC>259-232<LT>-687" + \
"<1><SC>APPEAL<XC>40-38; 40-44; 44-18; 45-15<LT>1"
s_space = " " # a single space
s_empty = "" # empty string
pat_sc = re.compile("\s*(<[^<]+)<SC>([^<]+)<XC>([^<]+)")
pat_lt = re.compile("<LT>([^<]+)")
pat_lt_remove = re.compile("^<LT>([^<]+)")
lst = []
lt = None
while True:
m = pat_sc.search(s)
if not m:
break
before = m.group(1).strip()
title = m.group(2).strip()
xc = m.group(3).replace(s_space, s_empty)
s = pat_sc.sub(s_empty, s, 1)
m = pat_lt.search(s)
if m:
lt = m.group(1)
lt = lt.strip()
s = pat_lt_remove.sub(s_empty, s, 1)
tup = (before, title, xc, lt)
lst.append(tup)
for before, title, xc, lt in lst:
lst_pp = xc.split(";")
for pp in lst_pp:
print "%s<SC>%s<XC>%s<LT>%s" % (before, title, pp, lt)
-- cut here -- cut here -- cut here -- cut here -- cut here --
--
Steve R. Hastings "Vita est"
steve at hastings.org http://www.blarg.net/~steveha
More information about the Python-list
mailing list