Newbie ? -- SGML metadata extraction

ProvoWallis gshepherd281281 at yahoo.com
Mon Jan 16 18:15:14 EST 2006


Hi,

I'm trying to write a script that will extract the value of an
attribute from an element using the attribute value of another element
as the basis for extraction.

For example, in my situation I have a pre-defined list of main sections
and I want to extract the id attribute of the form element and create a
dictionary of graphic ID and section number pairs but only for the
sections in my pre-defined list but I want to exclude the id value from
any section that does not appear on my list. I.e., I want to know the
id value for the forms that appear in sections 1 and 3 but not in 2.

Boiled down my SGML looks something like this:

<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">

This is what I have come up with on my own so far. My problem is that I
can't seem to pick up the value of the id attribute.

Any advice appreciated.

Greg

###

import os, re, csv

root = raw_input("Enter the path where the program should run: ")
fname = raw_input("Enter name of the CSV file containing the section
numbers: ")
sgmlname = raw_input("Enter name of the SGML file to search: ")
print

given,ext = os.path.splitext(fname)
root_name = os.path.join(root,fname)
n = given + '.new'
outputName = os.path.join(root,n)

reader = csv.reader(open(root_name, 'r'), delimiter=',')

sections = []

for row in reader:
     sections.append(row[0])


inputFile = open(os.path.join(root,sgmlname), 'r')

illoList ={}

while 1:
     lines = inputFile.readlines()
     if not lines:
          break
     for line in lines:

               main = re.search(r'(?i)(?m)(?s)<main-section
no=\"(\w+)\"', line)
               id = re.search(r'(?i)id=\"(.*?tif)\"', line)
               if main is not None and main.group(1) in sections:

                    if id is not None:

                         illoList[illo.group(1)] = main.group(1)




More information about the Python-list mailing list