NewB question on text manipulation

Steve R. Hastings steve at hastings.org
Wed May 3 02:34:02 EDT 2006


On Tue, 02 May 2006 22:37:04 -0700, ProvoWallis wrote:
> I have a file that looks like this:
> 
> <SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 <SC>PROC
> GUIDE<XC>92<LT>1(b)(1)
> 
> (i.e., <<SC>[chapter name]<XC>[multiple or single book page
> ranges]<SC>[chapter name]<XC>[multiple or single book page
> ranges]<LT>[code]
> 
> but I want to change it so that it looks like this
> 
> <1><SC>APPEAL<XC>40-24<LT>1(b)(1)
> <1><SC>APPEAL<XC>40-46<LT>1(b)(1)
> <1><SC>APPEAL<XC>42-46<LT>1(b)(1)
> <1><SC>APPEAL<XC>42-48<LT>1(b)(1)
> <1><SC>APPEAL<XC>42-62<LT>1(b)(1)
> <1><SC>APPEAL<XC>42-63<LT>1(b)(1)
> <1><SC>PROC GUIDE<XC>92<LT>1(b)(1)

I'll show my code first, then explain it.

-- cut here -- cut here -- cut here -- cut here -- cut here --
import re

s = "<SC>APPEAL<XC>40-24; 40-46; 42-46; 42-48; 42-62; 42-63 " + \
    "<SC>PROC GUIDE<XC>92<LT>1(b)(1)"

s_space = " "  # a single space
s_empty = ""  # empty string

pat = re.compile("\s*<SC>([^<]+)<XC>([^<]+)")

lst = []

while True:
    m = pat.search(s)
    if not m:
        break

    title = m.group(1).strip()
    xc = m.group(2)
    xc = xc.replace(s_space, s_empty)
    tup = (title, xc)
    lst.append(tup)
    s = pat.sub(s_empty, s, 1)

lt = s.strip()

for title, xc in lst:
    lst_pp = xc.split(";")
    for pp in lst_pp:
        print "<1><SC>%s<XC>%s%s" % (title, pp, lt)
-- cut here -- cut here -- cut here -- cut here -- cut here --

My strategy here is to divide the problem into two separate parts: first,
I collect all the data we need; then, I reformat the collected data and
print it in the desired format.

"pat" is a compiled regular expression.  It recognizes the SC and XC
codes, and collects the strings enclosed by those codes:

([^<]+)

The above regular expression means "any character that is not a '<'", "one
or more of them", and since it's in parentheses it's remembered so we can
collect it later.

So we collect title and the XC page ranges.  We tidy them up a bit:
title.strip() will remove any leading or trailing white space from the
title.  The replace() on the XC string gets rid of any spaces; I'm
assuming that the spaces are optional and the semicolons are the real
separators here.

Now, we could save the title and XC string in two lists, but that would be
silly in Python.  It's easier to pair them up in a tuple, and save the
tuple in a single list.  You can do it in one line, but I made the tuple
explicit ("tup").

After we collect them, we use a sub() to chop the collected data out of
the source string.

A while loop runs until all the SC and XC values are collected; anything
left over is assumed to be the LT.

Now, we have all the data; it's easy enough to rearrange it.

We can convert the XC string into a list of page ranges just by calling
.split(";"), which will split on semicolons.  Loop over this list,
printing each time, and there you go.

I'll leave packaging these up into tidy functions, reading the data from
the file, etc. as exercises for the reader. :-)

If you have any questions on how this works or why I did things the way I
did, ask away.

Good luck!
-- 
Steve R. Hastings    "Vita est"
steve at hastings.org    http://www.blarg.net/~steveha




More information about the Python-list mailing list