Extracting Text file contents using Python

Mike C. Fletcher mfletch at vrtelecom.com
Mon Jun 28 09:26:19 EDT 1999


Note that this algo closes the html table as soon as it opens, not sure if
that's the problem, but you probably want an elif in here...
>         if begin_table:
>                 out.write('<table border = "1">')
>                 begin_table = 0
>         elif line[:5] == "=====":
>                 out.write('</table>')
>                 out.write('<br>')
>                 begin_table = 1
>                 continue

Here's a function that seems to do pretty-much what you want, not really
very clearly written, but that's mostly cause I prefer maps and filters and
such, a per-line-looping function with counters and such would, of course,
work...

import string
DEFRECSEP = '================================='
DEFFIELDSEP = '\n'
DEFKEYSEP = ':'
def loadfile( filename,
        recordsep = DEFRECSEP,
        fieldsep= DEFFIELDSEP,
        keysep=DEFKEYSEP
    ):
    # read in whole file
    tempdata = open( filename).read()
    # break into records
    tempdata = string.split( tempdata, recordsep )
    # Get rid of extra whitespace
    # and null records...
    tempdata = filter( None, map( string.strip, tempdata ))
    # for each record, get the field values
    for i in range( len( tempdata)):
        # split into the lines
        fields = string.split( tempdata[i], fieldsep )
        # strip trailing/leading whitespace
        fields = map( string.strip, fields )
        # get the key-value pairs
        fields = map( string.split, fields, [ keysep ]*len(fields),
[1]*len(fields) )
        # should probably do a strip here too...
        tempdata[i] = fields
    return tempdata

So, you get a structure that's like this
[
    [
        [key, value],
        [key,value],
        ...
    ]
    [
        [key, value],
        [key,value],
        ...
    ]
]

for record in dataset:
    for key, value in record:
        dosomething_with_key_and_value(key, value)

Incidentally, you'd probably find this kind of thing easier with a simple
database module.  Wouldn't be plain-text, but much easier (and faster) to
use.  If you really wanted convenience, you could even use shelve and just
dump instances and/or data structures directly to disk so you don't need to
do any parsing at all.

Hope this helps,
Mike




john <john at mediamanager.com.sg> wrote in message
news:37782f9a.0 at news.smartnet.com.sg...
> Hi,
>
> I am trying to extract the following text file using python and the
program
> & text file are as shown below.
>
> But unfortunately I am unable to get every separate data displayed except
> for the first data and after the separation line which is
> "====================" the next data is not getting picked up by the
> program.
> Could anyone kindly let me know what could be wrong with the program.
>
> Example Text File - But will be similar
> ----------------------------------------------------
>
> Title:Meet Me
> Http Host:www.www.www
> Remote Address:10.2.0.1
> Remote Host:10.2.0.1
> Rating:8
> Date:06/25/99
> Time:17.15.28
> =================================
> Title:Meet Me
> Http Host:www.www.www
> Remote Address:10.2.0.1
> Remote Host:10.2.0.1
> Rating:8
> Date:06/25/99
> Time:17.15.28
> =================================
> Title:Meet Me
> Http Host:www.www.www
> Remote Address:10.2.0.1
> Remote Host:10.2.0.1
> Rating:8
> Date:06/25/99
> Time:17.15.28
> =================================
> Title:Meet Me
> Http Host:www.www.www
> Remote Address:10.2.0.1
> Remote Host:10.2.0.1
> Rating:8
> Date:06/25/99
> Time:17.15.28
> =================================
> Title:Meet Me
> Http Host:www.www.www
> Remote Address:10.2.0.1
> Remote Host:10.2.0.1
> Rating:8
> Date:06/25/99
> Time:17.15.28
> =================================
>
> The python program is as follows:
>
> #!c:/progra~1/python
>
> import string
> import sys
> import cgi
> #import urllib
>
> data_file = open('c:/xitami/cgi-bin/textfiles/FilmFestReview.txt', 'r')
> form = cgi.FieldStorage()
> header = \
> """
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
> <html>
> <head>
> <title>Extract & Write</title>
> </head>
> <body>
> """
>
> footer = \
> """
> </body>
> </html>
> """
>
> out = sys.stdout
>
> begin_table = 1
> out.write(header)
> for line in data_file.readlines():
>         if begin_table:
>                 out.write('<table border = "1">')
>                 begin_table = 0
>         if line[:5] == "=====":
>                 out.write('</table>')
>                 out.write('<br>')
>                 begin_table = 1
>                 continue
>         field_name, field_value = string.split(line, ':')
>         out.write('\t<tr>\n')
>         out.write('\t\t<td>\n')
>         out.write('\t\t\t%s\n' % field_name)
>         out.write('\t\t</td>\n')
>         out.write('\t\t<td>\n')
>         out.write('\t\t\t%s\n' % field_value)
>         out.write('\t\t</td>\n')
>         out.write('\t</tr>\n')
>
>
>
> out.write(footer)
>
>
>
>






More information about the Python-list mailing list