[Tutor] module to parse XMLish text?

Terry Carroll carroll at tjc.com
Fri Jan 14 03:55:54 CET 2011


Does anyone know of a module that can parse out text with XML-like tags as 
in the example below?  I emphasize the "-like" in "XML-like".  I don't 
think I can parse this as XML (can I?).

Sample text between the dashed lines::

---------------------------------
Blah, blah, blah
<AAA>
<BING ZEBRA>
<BANG ROOSTER>
<BOOM GARBONZO BEAN>
<BLIP>SOMETHING ELSE</BLIP>
<BASH>SOMETHING DIFFERENT</BASH>
</AAA>
---------------------------------

I'd like to be able to have a dictionary (or any other structure, really; 
as long as I can get to the parsed-out pieces) that would look smoothing 
like:

  {"BING" : "ZEBRA",
   "BANG" : "ROOSTER"
   "BOOM" : "GARBONZO BEAN"
   "BLIP" : "SOMETHING ELSE"
   "BASH" : "SOMETHING DIFFERENT"}

The "Blah, blah, blah" can be tossed away, for all I care.

The basic rule is that the tag either has an operand (e.g., <BING ZEBRA>), 
in which case the name is the first word and the content is everything 
else that follows in the tag; or else the tag has no operand, in which 
case it is matched to a corresponding closing tag (e.g., <BLIP>SOMETHING 
ELSE</BLIP>), and the content is the material between the two tags.

I think I can assume there are no nested tags.

I could write a state machine to do this, I suppose, but life's short, and 
I'd rather not re-invent the wheel, if there's a wheel laying around 
somewhere.



More information about the Tutor mailing list