[Tutor] complex file parsing

Magnus Lycka magnus@thinkware.se
Wed, 25 Sep 2002 14:02:48 +0200


At 20:11 2002-09-24 -0500, Tim Wilson wrote:
>What I'd like to know is if there's a decent chance of creating a parser
>that can pull out the data from this record. I'm particularly interested
>in the Descriptor field which consists of one or more major descriptors
>and one or more minor descriptors. Of course, separating all the useful
>bits would be handy. I wonder if this would be an application where
>creating an XML file would be useful?

Doesn't seem to be very difficult, does it?

I'm not absolutely sure that the file came through
to us the way it looked in the original though. Are
there really line breaks in the data fields? Unless
we know the names of all headers, we would have trouble
with lines like:

'''
publication. Libraries that Own Item: 149Connect to the catalog at
'''

How do we know that "publication. Libraries that Own Item" isn't
a field heading, like "Author(s)". And what if the title would be:

Title:          Going back to the
Source: Remebering our ancestry.

or something like that... How would we know that the second line
of the Title data wasn't the Source field?

I'd say the format is broken if it looks like in your mail.
Not even whitespace in the beginning of continuation lines?

But never mind, we can still do things:

import re
author_title =3D=
 re.compile(r'Author\(s\):\s*(.+?)\nTitle:\s*(.+?)\nSource:',
                           re.DOTALL)
for record in author_title.findall(text):
     print "Author: %s\nTitle: %s\n\n" % record

Assuming that the text is in the variable "text", those lines will
print all Authors and Titles. For Descriptors, you might have to
solve it in two steps.

If (as I suspect) the data fields are actually one long line,
it's really very simple. You won't even need the re module.

# N.B. Untested code follows.
#First, read all your data as a long string.
all_data =3D open('whatever','rt').read()
#Then split on record breaks
rb =3D=20
"---------------------------------------------------------------------------=
-----+----------------------------------------------------------------------=
---------+-----------------------------------------"
records =3D all_data.split(rb)
# Put the stuff in a list of dicts
l =3D []
for record in records:
     d =3D {}
     for line in record.split('\n'):
         key, value =3D line.split(':',1)
         d[key.strip()] =3D value.strip()
     l.append(d)

>Database: ERIC
>
>
>Ownership:     FirstSearch indicates your institution subscribes to this
>publication. Libraries that Own Item: 149Connect to the catalog at
>University of Minnesota Libraries
>Accession No:  EJ646012
>Author(s):     Kumar, David D. ; Altschuld James W.
>Title:         Complementary Approaches to Evaluation of Technology in
>Science Education.
>Source:        Journal of Science Education and Technology v11 n2
>p179-191 Jun 2002
>Standard No:   ISSN:          1059-0145
>Clearinghouse: SE566729
>Language:      English
>Abstract:      Discusses an interesting and relevant case involving two
>distinct systematic evaluations, traditional as well as somewhat
>nontraditional, of a science teacher education project with a
>heavy technology emphasis. Reports the complexity of evaluating
>technology projects and the multifaceted ways in which the evaluation
>endeavor could be approached. (Contains 23 references.)
>(Author/YDS)
>SUBJECT(S)
>Descriptor:    (Major):       Educational Technology
>Evaluation Methods
>Teacher Education
>(Minor):       Higher Education
>Science Education
>Science Teachers
>Document Type: Journal Article (CIJE)
>Record Type:   080 Journal Articles; 143 Reports--Research
>Announcement:  CIJSEP2002
>Provider:        OCLC
>Database: ERIC
>---------------------------------------------------------------------------=
-----+----------------------------------------------------------------------=
---------+-----------------------------------------
>
>Ownership:     FirstSearch indicates your institution subscribes to this
>publication. Libraries that Own Item: 149Connect to the catalog at
>University of Minnesota Libraries
>Accession No:  EJ646006
>Author(s):     Marbach-Ad, Gili ; Sokolove, Phillip G.
>Title:         The Use of E-Mail and In-Class Writing To Facilitate
>Student-Instructor Interaction in Large-Enrollment Traditional and
>Active Learning Classes.
>Source:        Journal of Science Education and Technology v11 n2
>p109-119 Jun 2002
>Standard No:   ISSN:          1059-0145
>Clearinghouse: SE566723
>Language:      English



--=20
Magnus Lyck=E5, Thinkware AB
=C4lvans v=E4g 99, SE-907 50 UME=C5
tel: 070-582 80 65, fax: 070-612 80 65
http://www.thinkware.se/  mailto:magnus@thinkware.se