Parsing SGML document in Python program
Martin v. Loewis
martin at v.loewis.de
Mon Oct 21 17:59:10 EDT 2002
pinard at iro.umontreal.ca (François Pinard) writes:
> This was a while ago, but _if_ I remember well, the documentation was very
> clear when I needed it. The trick is to read slowly and carefully! :-)
In any case, I think we can safely post the code in question.
Regards,
Martin
def read_sgml_file(name):
stack = []
current = []
attrs = {}
# Avoid docbk30, which raises some unanalysed interference.
# Also request UTF-8 processing
for line in os.popen('SGML_CATALOG_FILES= SP_ENCODING=UTF-8 SP_CHARSET_FIXED=YES nsgmls %s' % name).readlines():
if line[0] == '(':
stack.append(current)
current = [string.lower(line[1:-1])]
if attrs:
current.append(attrs)
attrs = {}
continue
if line[0] == ')':
element = tuple(current)
current = stack[-1]
del stack[-1]
current.append(element)
continue
if line[0] == '-':
line = line[1:-1]
line = string.replace(line, '\\n', '\n')
line = string.replace(line, '\\011', '\t')
line = string.rstrip(line)
current.append(line)
continue
if line[0] == 'A':
attr = line[1:].split()
if attr[1] == "IMPLIED":
continue
if attr[1] == "TOKEN":
attrs[attr[0].lower()] = attr[2].lower()
continue
raise ValueError,_("Unsupported attribute %s") % `attr`
continue
if line[0] == 'C':
return current[0]
raise ValueError,_("SGML in `%s' is not conformant.\n") % name
More information about the Python-list
mailing list