Count and replacing strings a texfile
Alex Martelli
aleaxit at yahoo.com
Wed Jan 24 05:28:12 EST 2001
"Greg Jorgensen" <gregj at pobox.com> wrote in message
news:94llph$8dl$1 at nnrp1.deja.com...
> Try this:
>
> ----
> # read all text from the file
> f = open(filename, "r")
> s = f.read()
> f.close()
>
> # split text into a list at every occurence of %id%
> t = s.split("%id%")
> n = len(t) # number of list elements
> result = s[0] # start with first element in list
> # iterate over list, appending counter and next text chunk
> for i in range(1,n):
> result += str(i) + s[i]
>
> print "%s occurences replaced" % (n-1)
> print result
I like this general approach better than the RE-based
ones; however, building up the 'result' string by
successive concatenations is apt to be pretty slow --
remember the typical start string was said to be over
120k and to contain 'several' occurrences of '%id%'.
In general, building a big string by successive + or
+= of many small pieces is O(N squared).
A potentially better variation, roughly O(N)...:
def numberIDstring(input_string):
input_pieces = input_string.split('%id%')
pieces_number = len(input_pieces)
output_pieces = ['']*(pieces_number*2-1)
output_pieces[0] = input_pieces[0]
for i in range(1,pieces_number):
output_pieces[i+i] = input_pieces[i]
output_pieces[i+i-1] = str(i)
return ''.join(output_pieces)
It takes some reflection (and, even better, some
testing of the boundary cases!!!) to check this
works for input-strings containing %id% at start,
at end, or two or three of them right next to
each other, of course -- but then, one MUST, of
course, ALWAYS test what one writes (and MOST
particularly test boundary/anomalous cases!!!).
Which leads me right into a digression about
unit-testing, a subject which is discussed FAR
too rarely in proportion to its importance...!
A decent unit-test here might be something like
(assuming one has no unit-testing framework in
use -- it WOULD be much better to use one!!!):
def testNumberIDstring():
testData = (
('', ''), # no input -> no output
('xy','xy'), # no IDs -> no change
('%id%','1'), # just an ID
('%id%%id%', '12'), # just two IDs
('xy%id%', 'xy1'),
('%id%xy', '1xy'),
('xy%id%zt', 'xy1zt'),
('%id%xy%id%', '1xy2'),
('ax%id%by%id%cz', 'ax1by2cz'),
)
errors = 0
tests = 0
for input, expected in testData:
tests += 1
output = numberIDstring(input)
if output!=expected:
errors += 1
reportTestFailure(tests, errors,
input, expected, output,
"numberIDstring")
if errors==0:
reportSuccess(tests,
"numberIDstring")
else:
reportFailures(tests, errors,
"numberIDstring")
Yep, there *IS* a lot of code in such a special
purpose test -- which is why using a unit
testing framework is SO useful: by greatly
reducing the repetitious work of writing test
code, it correspondingly motivates you to do
more and better unit testing!
Personally, I find that Tim Peters' deliciously
simple "doctest.py" framework matches A LOT of
my typical unit-testing needs, and really minimizes
my work in constructing unit-test suites. But
tastes and needs vary, and one might be well
advised to look around at other Python unit test
frameworks -- each has some strong point and
might be just the ticket for YOUR own use!-)
Alex
More information about the Python-list
mailing list