Parsing text

Bengt Richter bokr at oz.net
Tue Dec 20 18:43:10 EST 2005


On 20 Dec 2005 08:06:39 -0800, "sicvic" <morange.victor at gmail.com> wrote:

>Not homework...not even in school (do any universities even teach
>classes using python?). Just not a programmer. Anyways I should
>probably be more clear about what I'm trying to do.
Ok, not homework.

>
>Since I cant show the actual output file lets say I had an output file
>that looked like this:
>
>aaaaa bbbbb Person: Jimmy
>Current Location: Denver
>Next Location: Chicago
>----------------------------------------------
>aaaaa bbbbb Person: Sarah
>Current Location: San Diego
>Next Location: Miami
>Next Location: New York
>----------------------------------------------
>
>Now I want to put (and all recurrences of "Person: Jimmy")
>
>Person: Jimmy
>Current Location: Denver
>Next Location: Chicago
>
>in a file called jimmy.txt
>
>and the same for Sarah in sarah.txt
>
>The code I currently have looks something like this:
>
>import re
>import sys
>
>person_jimmy = open('jimmy.txt', 'w') #creates jimmy.txt
>person_sarah = open('sarah.txt', 'w') #creates sarah.txt
>
>f = open(sys.argv[1]) #opens output file
>#loop that goes through all lines and parses specified text
>for line in f.readlines():
>    if  re.search(r'Person: Jimmy', line):
>	person_jimmy.write(line)
>    elif re.search(r'Person: Sarah', line):
>	person_sarah.write(line)
>
>#closes all files
>
>person_jimmy.close()
>person_sarah.close()
>f.close()
>
>However this only would produces output files that look like this:
>
>jimmy.txt:
>
>aaaaa bbbbb Person: Jimmy
>
>sarah.txt:
>
>aaaaa bbbbb Person: Sarah
>
>My question is what else do I need to add (such as an embedded loop
>where the if statements are?) so the files look like this
>
>aaaaa bbbbb Person: Jimmy
>Current Location: Denver
>Next Location: Chicago
>
>and
>
>aaaaa bbbbb Person: Sarah
>Current Location: San Diego
>Next Location: Miami
>Next Location: New York
>
>
>Basically I need to add statements that after finding that line copy
>all the lines following it and stopping when it sees
>'----------------------------------------------'
>
>Any help is greatly appreciated.
>
Ok, I generalized on your theme of extracting file chunks to named files,
where the beginning line has the file name. I made '.txt' hardcoded extension.
I provided a way to direct the output to a (I guess not necessarily sub) directory
Not tested beyond what you see. Tweak to suit.

----< extractfilesegs.py >--------------------------------------------------------
"""
Usage: [python] extractfilesegs [source [outdir [startpat [endpat]]]]
    where source is -tf for test file, a file name, or an open file
          outdir is a directory prefix that will be joined to output file names
          startpat is a regular expression with group 1 giving the extracted file name
          endpat is a regular expression whose match line is excluded and ends the segment
"""
import re, os

def extractFileSegs(linesrc, outdir='extracteddata', start=r'Person:\s+(\w+)', stop='-'*30):
    rxstart = re.compile(start)
    rxstop = re.compile(stop)
    if isinstance(linesrc, basestring): linesrc = open(linesrc)
    lineit = iter(linesrc)
    files = []
    for line in lineit:
        match = rxstart.search(line)
        if not match: continue
        name = match.group(1)
        filename = name.lower() + '.txt'
        filename = os.path.join(outdir, filename)
        #print 'opening file %r'%filename
        files.append(filename)
        fout = open(filename, 'a') # append in case repeats?
        fout.write(match.group(0)+'\n') # did you want aaa bbb stuff?
        for data_line in lineit:
            if rxstop.search(data_line):
                #print 'closing file %r'%filename
                fout.close() # don't write line with ending mark
                fout = None
                break
            else:
                fout.write(data_line)
    if fout:
        fout.close()
        print 'file %r ended with source file EOF, not stop mark'%filename
    return files
    
def get_testfile():
    from StringIO import StringIO
    return StringIO("""\
...irrelevant leading
stuff ...
aaaaa bbbbb Person: Jimmy
Current Location: Denver
Next Location: Chicago
----------------------------------------------
aaaaa bbbbb Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
----------------------------------------------
irrelevant
trailing stuff ...

with a blank line
""")

if __name__ == '__main__':
    import sys
    args = sys.argv[1:]
    if not args: raise SystemExit(__doc__)
    tf = args.pop(0)
    if tf=='-tf': fin = get_testfile()
    else: fin = tf
    if not args:
        files = extractFileSegs(fin)
    elif len(args)==1:
        files = extractFileSegs(fin, args[0])
    elif len(args)==2:
        files = extractFileSegs(fin, args[0], args[1], '^$') # stop on blank line?
    else:
        files = extractFileSegs(fin, args[0], '|'.join(args[1:-1]), args[-1])
    print '\nFiles created:'
    for fname in files:
        print '    "%s"'% fname
    if tf == '-tf':
        for fpath in files:
            print '====< %s >====\n%s============'%(fpath, open(fpath).read())
----------------------------------------------------------------------------------

Running on your test data:

[15:19] C:\pywk\clp>md extracteddata

[15:19] C:\pywk\clp>py24 extractfilesegs.py -tf

Files created:
    "extracteddata\jimmy.txt"
    "extracteddata\sarah.txt"
====< extracteddata\jimmy.txt >====
Person: Jimmy
Current Location: Denver
Next Location: Chicago
============
====< extracteddata\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============

[15:20] C:\pywk\clp>md xd

[15:20] C:\pywk\clp>py24 extractfilesegs.py -tf xd (Jimmy) ----

Files created:
    "xd\jimmy.txt"
====< xd\jimmy.txt >====
Jimmy
Current Location: Denver
Next Location: Chicago
============

[15:21] C:\pywk\clp>py24 extractfilesegs.py -tf xd "Person: (Sarah)" ----

Files created:
    "xd\sarah.txt"
====< xd\sarah.txt >====
Person: Sarah
Current Location: San Diego
Next Location: Miami
Next Location: New York
============

[15:22] C:\pywk\clp>py24 extractfilesegs.py -tf xd "^(irrelevant)"

Files created:
    "xd\irrelevant.txt"
====< xd\irrelevant.txt >====
irrelevant
trailing stuff ...
============

HTH, NO WARRANTIES ;-)


Regards,
Bengt Richter



More information about the Python-list mailing list