Parsing SGML document in Python program

Martin v. Loewis martin at v.loewis.de
Mon Oct 21 17:59:10 EDT 2002


pinard at iro.umontreal.ca (François Pinard) writes:

> This was a while ago, but _if_ I remember well, the documentation was very
> clear when I needed it.  The trick is to read slowly and carefully! :-)

In any case, I think we can safely post the code in question.

Regards,
Martin

def read_sgml_file(name):
    stack = []
    current = []
    attrs = {}
    # Avoid docbk30, which raises some unanalysed interference.
    # Also request UTF-8 processing
    for line in os.popen('SGML_CATALOG_FILES= SP_ENCODING=UTF-8 SP_CHARSET_FIXED=YES nsgmls %s' % name).readlines():
        if line[0] == '(':
            stack.append(current)
            current = [string.lower(line[1:-1])]
            if attrs:
                current.append(attrs)
                attrs = {}
            continue
        if line[0] == ')':
            element = tuple(current)
            current = stack[-1]
            del stack[-1]
            current.append(element)
            continue
        if line[0] == '-':
            line = line[1:-1]
            line = string.replace(line, '\\n', '\n')
            line = string.replace(line, '\\011', '\t')
            line = string.rstrip(line)
            current.append(line)
            continue
        if line[0] == 'A':
            attr = line[1:].split()
            if attr[1] == "IMPLIED":
                continue
            if attr[1] == "TOKEN":
                attrs[attr[0].lower()] = attr[2].lower()
                continue
            raise ValueError,_("Unsupported attribute %s") % `attr`
            continue
        if line[0] == 'C':
            return current[0]
    raise ValueError,_("SGML in `%s' is not conformant.\n") % name




More information about the Python-list mailing list