Newbie Text Processing Question

Mike Meyer mwm at mired.org
Wed Oct 5 00:38:08 EDT 2005


gshepherd281 at earthlink.net writes:
> I'm a total newbie to Python so any and all advice is greatly
> appreciated.

Well, I've got some for you.

> I'm trying to use regular expressions to process text in an SGML file
> but only in one section.

This is generally a bad idea. SGML family languages aren't easy to
parse - even the ones that were designed to be easy to parse - and
generally require very complex regular expessions to get right. It may
be that your SGML data can be parsed by the re you use, but there
are almost certainly valid SGML documents that your parser will not
properly parse.

In general, it's better to use a parser for the language in question.

> So the input would look like this:
>
> <ch-part no="I"><title>RESEARCH GUIDE
> <sec-main no="1.01"><title>content
> <para>content
>
> <sec-main no="2.01"><title>content
> <para>content
>
>
> <ch-part no="II"><title>FORMS
> <sec-main no="3.01"><title>content
>
> <sec-sub1 no="1"><title>content
> <para>content
>
> <sec-sub2 no="1"><title>content
> <para>content


This is funny-looking SGML. Are the the end tags really optional for
all the tag types?

> But no matter what I try I end up changing the entire file rather than
> just one part.

Other have explained why you can't do that, so I'll skip it.

> Here's what I've come up with so far but I can't think of anything
> else.
>
> ***
>
> import os, re
> setpath = raw_input("Enter the path where the program should run: ")
> print
>
> for root, dirs, files in os.walk(setpath):
>      fname = files
>      for fname in files:
>           inputFile = file(os.path.join(root,fname), 'r')
>           line = inputFile.read()
>           inputFile.close()
>
>
>           chpart_pattern = re.compile(r'<ch-part
> no=\"[A-Z]{1,4}\"><title>(RESEARCH)', re.IGNORECASE)

This makes a number of assumptions that are invalid about SGML in
general, but may be valid for your sample text - how attributes are
quoted, the lack of line breaks, which can be added without changing
the content, and the format of the "no" attribute.

>           while 1:
>                if chpart_pattern.search(line):
>                     line = re.sub(r"<sec-main
> no=(\"[0-9]*.[0-9]*\")><title>(.*)", r"<sec-main
> no=\1><title>\2\n<biblio>", line)

Ditto.

Heren's an sgmllib solution that gets does what you do above, except
it writes it to standard out:

#!/usr/bin/env python

import sys
from sgmllib import SGMLParser

datain = """
<ch-part no="I"><title>RESEARCH GUIDE
<sec-main no="1.01"><title>content
<para>content

<sec-main no="2.01"><title>content
<para>content


<ch-part no="II"><title>FORMS
<sec-main no="3.01"><title>content

<sec-sub1 no="1"><title>content
<para>content

<sec-sub2 no="1"><title>content
<para>content
"""

class Parser(SGMLParser):

    def __init__(self):
        # install the handlers with funny names
        setattr(self, "start_ch-part", self.handle_ch_part)

        # And start with chapter 0
        self.ch_num = 0

        SGMLParser.__init__(self)

    def format_attributes(self, attributes):
        return ['%s="%s"' % pair for pair in attributes]

    def unknown_starttag(self, tag, attributes):
        taglist = self.format_attributes(attributes)
        taglist.insert(0, tag)
        sys.stdout.write('<%s>' % ' '.join(taglist))

    def handle_data(self, data):
        sys.stdout.write(data)

    def handle_ch_part(self, attributes):
        """This should be called start_ch-part, but, well, you know."""

        self.unknown_starttag('ch-part', attributes)
        for name, value in attributes:
            if name == 'no':
                self.ch_num = value

    def start_para(self, attributes):
        if self.ch_num == 'I':
            sys.stdout.write('<biblio>\n')
        self.unknown_starttag('para', attributes)


parser  = Parser()
parser.feed(datain)
parser.close()


sgmllib isn't a very good SGML parser - it was written to support
htmllib, and really only handles that subset of sgml well. In
particular, it doesn't really understand DTDs, so can't handle the
missing end tags in your example. You may be able to work around that.

If you can coerce this to XML, then the xml tools in the standard
library will work well. For HTML, I like BeautifulSoup, but that's
mostly because it deals with all the crud on the net that is passed
off as HTML. For SGML - well, I don't have a good answer. Last time I
had to deal with real SGML, I used a C parser that spat out a parse
tree that could be parsed properly.

     <mike
-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.



More information about the Python-list mailing list