Text Manipulation in Python

Bernhard Reiter bernhard at alpha1.csd.uwm.edu
Wed Nov 10 14:44:16 EST 1999


On Wed, 10 Nov 1999 08:46:42 GMT, Edward Hasted <edward at corpex.com> wrote:
>I am completely new to Python.
Welcome to a land of programming pleasure.

>We want to use it to alter specific lines in template files, typically 
>something like changing:-
>
>Variable = 1234
>
>Variable = Company Name
>
>The text manipulation strings within Python seem to work sequentially 
>rather than on a line basis.
>
>1. Is this correct?
If I understand you correctly, yes.
But (of course a but) you can relatively easy make python
work on a per line basis.

>2. What is the best way to do text searching and manipulation within 
>Python.

Some other posters suggested looking into other tools, but I would refrain
from that. For one reason python is a unified language and a lot easier to 
learn. The syntax of awk and sed can be tricky in times. And you might 
be able to pull it off with bash scripting or vim replacements rules. Don't.

Use the nice fileinput module to get lines:

-----------
#! /bin/env python
"""" Per line processing example. """"

import sys
import string
import fileinput

def process(line):
	""" Process one inputline and spit out, what we want. """ 
	# get rid of the possible newline and the whitespace around chars
	line=string.strip(line)
	sys.stdout.write(line)

for lines in fileinput.input():
	process(line)
-----------

Now you can use the string module functions to deal with one line,
play with pythons "Slicing" capabilties on the strings or even
pull out the big guns and play the regular expression card with the re module.

The fileinput module can even do in place renaming of files for you.
Oh you call that little program just with the filename(s).

With time you will come to the situation, when you want to pull some
more complicated tricks. Here is a rough example in which I tried
to imitate the way awk enfolds power on all sort of textprocessing tasks.
Don't read it right now, but look into it later... if you feel like it.
it contains a bunch of python related convetions and some variables not
used in this example, but ready for other textprocessing tasks.


Input files look like:

### Opening logfile (channel #whatever), [Mon Nov  2 09:22:49 1999]
[Mon Nov  2 09:23:04 1999]<willi7,0;> Gut.
[Mon Nov  2 09:23:11 1999]<willi7,0;> Wer leitet ? 
[Mon Nov  2 09:23:37 1999]<Bob7,0;> immer der Protokolleur
[Mon Nov  2 09:23:37 1999]#intevation> Lasst mich doch leiten, dann kann jemand anders Protokoll schreiben. :)
[Mon Nov  2 09:24:19 1999]<willi7,0;> Scheint ja geklärt. 

Outputfiles look like: 

### Opening logfile (channel #whatever), [Mon Nov  2 09:22:49 1999]
[09:23:04] <willi> Gut.
[09:23:11] <willi> Wer leitet ? 
[09:23:37] <Bob> immer der Protokolleur
[09:23:37] <Cliff> Lasst mich doch leiten, dann kann jemand anders Protokoll schreiben. :)
[09:24:19] <willi> Scheint ja geklärt. 


-----------
#! /bin/env python
"""Format and clean tirc normal logfile for IRC session logs.  v%(version)s

USAGE:
 %(progname)s [inputfilename(s)] 

Inputfiles will be replaced, if filenames are given.

The default is to use <Cliff> as user.
The time entries on each line will be cutted to include only the time
and not the day or year, which easily can be seen from the beginning
and end of a consecutive session.

Of course this is dependend on the time format tirc uses.
"""
__version__="0.1"
# initial 2.6.1999 Bernhard Reiter


import fileinput
import sys
import string
import re


nick="Cliff"

def process(line,fileinputobject,write):
    """Process one inputline and spit out what is wanted."""

    # share some things between subsequent calls
    global channel
    global nickre

    #f=string.split(line,";")
    #f=map(string.strip,f)

    # imitate gawk a bit   ;-)
    #NF=len(f)
    FILENAME=fileinputobject.filename()
    NR=fileinputobject.lineno()
    FNR=fileinputobject.filelineno()


    if line[0:3]=="###":
	# opening or closing of logfile

	# grab channelname
	matchobj=re.search('\(channel (#[^)]+)\)',line)
	channel=matchobj.group(1)
	sys.stderr.write("Found channel: " + channel + "\n" )
	
	# prepare nick replacement procedure
	nickre=re.compile(re.escape("]" +channel+">"))
	
    if line[0]=="[":
	# normal line

	# replace funny control string
	line=string.replace(line,chr(3)+"7,0;","",1)
	line=nickre.sub("]<"+nick+">",line,1)
	
	# distance between time and normal text
	if line[25:27]=="]<":
		line=line[:26]+" "+line[26:]

	# cut date and year out 
	line="["+line[12:20]+"]"+line[26:]

    write(line)
	


def main():

    if len(sys.argv)==1:
	sys.stderr.write(__doc__ % 
	    {"progname":sys.argv[0], "version":__version__})

	if sys.platform=="win32":
	    import msvcrt
	    msvcrt.getch()
	    sys.stderr.write("\nPress any key to start reading from stdin.\n")
	else:
	    sys.stderr.write("Now reading from stdin.\n")

    fileinputobject=fileinput.input(sys.argv[1:],1,".org")
    outputwritefunction=sys.stdout.write

    try:
	for line in fileinputobject:
	    process(line,fileinputobject,sys.stdout.write)
    finally:
    	fileinputobject.close()
	

if __name__=="__main__":
	main()
-----------

<cranking-out-python-with-hopes-not-to-confuse-you-too-much>ly,
	Bernhard

-- 
Research Assistant, Geog Dept UM-Milwaukee, USA.    (www.uwm.edu/~bernhard)
Free Software Projects and Consulting 			   (intevation.net)  
Association for a Free Informational Infrastructure              (ffii.org)




More information about the Python-list mailing list