[Tutor] making a custom file parser?

Devin Jeanpierre jeanpierreda at gmail.com
Mon Jan 9 02:19:46 CET 2012


> Parsing XML with regular expressions is generally very bad idea. In
> the general case, it's actually impossible. XML is not what is called
> a regular language, and therefore cannot be parsed with regular
> expressions. You can use regular expressions to grab a limited amount
> of data from a limited set of XML files, but this is dangerous, hard,
> and error-prone.

Python regexes aren't regular, and this isn't XML.

A working XML parser has been written using .NET regexes (sorry, no
citation -- can't find it), and they only have one extra feature
(recursion, of course). And it was dreadfully ugly and nasty and
probably terrible to maintain -- that's the real cost of regexes.

In particular, his data actually does look regular.

> I'll assume that said "(.*)". There's still a few problems: < and >
> shouldn't be escaped, which is why you're not getting any matches.
> Also you shouldn't use * because it is greedy, matching as much as
> possible. So it would match everything in between the first <unit> and
> the last </unit> tag in the file, including other <unit></unit> tags
> that might show up.

On the "can you do work with this with regexes" angle: if units can be
nested, then neither greedy nor non-greedy matching will work. That's
a particular case where regular expressions can't work for your data.

> Test it carefully, ditch elementtree, use as little regexes as
> possible (string functions are your friends! startswith, split, strip,
> et cetera) and you might end up with something that is only slightly
> ugly and mostly works. That said, I'd still advise against it. turning
> the files into valid XML and then using whatever XML parser you fancy
> will probably be easier.

He'd probably do that using regexes.

Easiest way is probably to write a real parser using some PEG or CFG
thingy. Less error-prone.

Overall agree with advice, though. Just being picky. Sorry.

-- Devin


On Sat, Jan 7, 2012 at 3:15 PM, Hugo Arts <hugo.yoshi at gmail.com> wrote:
> On Sat, Jan 7, 2012 at 8:22 PM, Alex Hall <mehgcap at gmail.com> wrote:
>> I had planned to parse myself, but am not sure how to go about it. I
>> assume regular expressions, but I couldn't even find the amount of
>> units in the file by using:
>> unitReg=re.compile(r"\<unit\>(*)\</unit\>")
>> unitCount=unitReg.search(fileContents)
>> print "number of units: "+unitCount.len(groups())
>>
>> I just get an exception that "None type object has no attribute
>> groups", meaning that the search was unsuccessful. What I was hoping
>> to do was to grab everything between the opening and closing unit
>> tags, then read it one at a time and parse further. There is a tag
>> inside a unit tag called AttackTable which also terminates, so I would
>> need to pull that out and work with it separately. I probably just
>> have misunderstood how regular expressions and groups work...
>>
>
> Parsing XML with regular expressions is generally very bad idea. In
> the general case, it's actually impossible. XML is not what is called
> a regular language, and therefore cannot be parsed with regular
> expressions. You can use regular expressions to grab a limited amount
> of data from a limited set of XML files, but this is dangerous, hard,
> and error-prone.
>
> As long as you realize this, though, you could possibly give it a shot
> (here be dragons, you have been warned).
>
>> unitReg=re.compile(r"\<unit\>(*)\</unit\>")
>
> This is probably not what you actually did, because it fails with a
> different error:
>
>>>> a = re.compile(r"\<unit\>(*)\</unit\>")
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py",
> line 188, in compile
>  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py",
> line 243, in _compile
> sre_constants.error: nothing to repeat
>
> I'll assume that said "(.*)". There's still a few problems: < and >
> shouldn't be escaped, which is why you're not getting any matches.
> Also you shouldn't use * because it is greedy, matching as much as
> possible. So it would match everything in between the first <unit> and
> the last </unit> tag in the file, including other <unit></unit> tags
> that might show up. What you want is more like this:
>
> unit_reg = re.compile(r"<unit>(.*?)</unit>")
>
> Test it carefully, ditch elementtree, use as little regexes as
> possible (string functions are your friends! startswith, split, strip,
> et cetera) and you might end up with something that is only slightly
> ugly and mostly works. That said, I'd still advise against it. turning
> the files into valid XML and then using whatever XML parser you fancy
> will probably be easier. Adding quotes and closing tags and removing
> comments with regexes is still bad, but easier than parsing the whole
> thing with regexes.
>
> HTH,
> Hugo
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor


More information about the Tutor mailing list