Count and replacing strings a texfile

Wed Jan 24 05:28:12 EST 2001

"Greg Jorgensen" <gregj at pobox.com> wrote in message
news:94llph$8dl$1 at nnrp1.deja.com...
> Try this:
>
> ----
> # read all text from the file
> f = open(filename, "r")
> s = f.read()
> f.close()
>
> # split text into a list at every occurence of %id%
> t = s.split("%id%")
> n = len(t)      # number of list elements
> result = s[0]   # start with first element in list
> # iterate over list, appending counter and next text chunk
> for i in range(1,n):
>         result += str(i) + s[i]
>
> print "%s occurences replaced" % (n-1)
> print result

I like this general approach better than the RE-based
ones; however, building up the 'result' string by
successive concatenations is apt to be pretty slow --
remember the typical start string was said to be over
120k and to contain 'several' occurrences of '%id%'.

In general, building a big string by successive + or
+= of many small pieces is O(N squared).

A potentially better variation, roughly O(N)...:

def numberIDstring(input_string):
    input_pieces = input_string.split('%id%')
    pieces_number = len(input_pieces)
    output_pieces = ['']*(pieces_number*2-1)
    output_pieces[0] = input_pieces[0]
    for i in range(1,pieces_number):
        output_pieces[i+i] = input_pieces[i]
        output_pieces[i+i-1] = str(i)
    return ''.join(output_pieces)

It takes some reflection (and, even better, some
testing of the boundary cases!!!) to check this
works for input-strings containing %id% at start,
at end, or two or three of them right next to
each other, of course -- but then, one MUST, of
course, ALWAYS test what one writes (and MOST
particularly test boundary/anomalous cases!!!).

Which leads me right into a digression about
unit-testing, a subject which is discussed FAR
too rarely in proportion to its importance...!

A decent unit-test here might be something like
(assuming one has no unit-testing framework in
use -- it WOULD be much better to use one!!!):

def testNumberIDstring():
    testData = (
        ('', ''),      # no input -> no output
        ('xy','xy'),   # no IDs -> no change
        ('%id%','1'),  # just an ID
        ('%id%%id%', '12'),   # just two IDs
        ('xy%id%', 'xy1'),
        ('%id%xy', '1xy'),
        ('xy%id%zt', 'xy1zt'),
        ('%id%xy%id%', '1xy2'),
        ('ax%id%by%id%cz', 'ax1by2cz'),
    )
    errors = 0
    tests = 0
    for input, expected in testData:
        tests += 1
        output = numberIDstring(input)
        if output!=expected:
            errors += 1
            reportTestFailure(tests, errors,
                input, expected, output,
                "numberIDstring")
    if errors==0:
        reportSuccess(tests,
                "numberIDstring")
    else:
        reportFailures(tests, errors,
                "numberIDstring")

Yep, there *IS* a lot of code in such a special
purpose test -- which is why using a unit
testing framework is SO useful: by greatly
reducing the repetitious work of writing test
code, it correspondingly motivates you to do
more and better unit testing!

Personally, I find that Tim Peters' deliciously
simple "doctest.py" framework matches A LOT of
my typical unit-testing needs, and really minimizes
my work in constructing unit-test suites.  But
tastes and needs vary, and one might be well
advised to look around at other Python unit test
frameworks -- each has some strong point and
might be just the ticket for YOUR own use!-)

Alex