[Tutor] making a custom file parser?

Lie Ryan lie.1296 at gmail.com
Sat Jan 7 22:51:58 CET 2012


On 01/08/2012 04:53 AM, Alex Hall wrote:
> Hello all,
> I have a file with xml-ish code in it, the definitions for units in a
> real-time strategy game. I say xml-ish because the tags are like xml,
> but no quotes are used and most tags do not have to end. Also,
> comments in this file are prefaced by an apostrophe, and there is no
> multi-line commenting syntax. For example:
>
> <unit>
> <number=1>
> <name=my unit>
> <canMove=True>
> <canCarry=unit2, unit3, unit4>
> 'this line is a comment
> </unit>
>

The format is closer to sgml than to xml, except for the tag being able 
to have values. I'd say you probably would have a better chance of 
transforming this into sgml than transforming it to xml.

Try this re:

s = re.sub('<([a-zA-Z]+)=([^>]+)>', r'<\1 __attribute__="\2">', s)

and use an SGML parser to parse the result. I find Fredrik Lundh's 
sgmlop to be easier to use for this one, just use easy_install or pip to 
install sgmlop.

import sgmlop

class Unit(object): pass

class handler:
     def __init__(self):
         self.units = {}
     def finish_starttag(self, tag, attrs):
         attrs = dict(attrs)
         if tag == 'unit':
             self.current = Unit()
         elif tag == 'number':
             self.current.number = int(attrs['__attribute__'])
         elif tag == 'canmove':
             self.current.canmove = attrs['__attribute__'] == 'True'
         elif tag in ('name', 'cancarry'):
             setattr(self.current, tag, attrs['__attribute__'])
         else:
             print 'unknown tag', tag, attrs
     def finish_endtag(self, tag):
         if tag == 'unit':
             self.units[self.current.name] = self.current
             del self.current
     def handle_data(self, data):
         if not data.isspace(): print data.strip()

s = '''
<unit>
<number=1>
<name=my unit>
<canMove=True>
<canCarry=your unit, her unit, his unit>
'this line is a comment
</unit>
<unit>
<number=2>
<name=your unit>
<canMove=False>
<canCarry=her unit, his unit>
'this line is a comment
</unit>
<unit>
<number=3>
<name=her unit>
<canMove=True>
<canCarry=her unit>
'this line is a comment
</unit>
<unit>
<number=4>
<name=his unit>
<canMove=True>
<canCarry=his unit, her unit>
'this line is a comment
</unit>
'''
s = re.sub('<([a-zA-Z]+)=([^>]+)>', r'<\1 __attribute__="\2">', s)
parser = sgmlop.SGMLParser()
h = handler()
parser.register(h)
parser.parse(s)
print h.units



More information about the Tutor mailing list