[Tutor] Re: handling string!!

Thu Oct 23 16:53:54 EDT 2003

Daniel Ehrenberg wrote:

<snip>
> I have a somewhat related question. I am trying to
> write a program to parse the simple markup language
> used at Wikipedia.org. For this specific question, the
> markup is the same as in MoinMoin.

Kirk Bailey (who's around here too) has an open source Wiki at 
tinylist.org, written in Python:

http://www.tinylist.org/cgi-bin/wikinehesaed2.py

It handles this kinda thing quite well, I just tested it at the bottom of 
http://www.tinylist.org/cgi-bin/wikinehesaed2.py/SandBox. Perhaps you 
should look at its code.

> '''bold''' -> <strong>bold</strong>
> ''italics'' -> <em>italics</em>
> '''''bold and italics''''' -> <strong><em>bold and
> italics</em></strong>
> '''''b & i'' b''' -> <strong><em>b & i</em> b</strong>
> '''''b & i''' i'' -> <em><strong>b & i</strong> i</em>

<snip>

> would parse the bold parts of the text. It would be
> similar for the code processing italics and the
> combination of bold and italics, doing the ones with
> the most apostrophies first and the least apostrophies
> last (ie. first bold and italics, then bold, then
> italics). However, I don't see how I could do the same
> with the forth and fifth examples. Could you help me
> with that?

You just have to keep track of what you have open and apply the first open, 
last to close principle (use a list to which you append tags when you open 
them and then delete them when you close them starting from the last). In 
your 5th example:

 > '''''b & i''' i'' -> <em><strong>b & i</strong> i</em>

your parser would e.g. first hit ''' (open and append it to the 
OpenTags list), then the '' (open and append it to the OpenTags list). 
When it finds the closing ''', it tries to close the , but it 
notices in the OpenTags list that there are tags before it. It closes those 
first (in this case, the last tag in OpenTags is , so it closes it 
first, but places it in a different list, say RestoreTags), then it closes 
the and reopens the ones in RestoreTags - obviously, these end up 
being on the OpenTags list again. The generated code is then:

<strong><em>b & i</em></strong><em> i</em>

Which is not perfect, but it's valid XHTML :). Making it really intelligent 
would be quite a bit harder, especially if you consider you might be 
nesting more tags.

I'm not sure this is the way Kirk's Wiki does it, but I know it would work 
because I use this same principle in my regular expression tool to 
highlight parentheses.

I'm wondering how you'd handle '''''' though (can be two bolds or three 
italics).

-- 
Yours,

Andrei

=====
Mail address in header catches spam. Real contact info (decode with rot13):
cebwrpg5 at bcrenznvy.pbz. Fcnz-serr! Cyrnfr qb abg hfr va choyvp cbfgf. V 
ernq gur yvfg, fb gurer'f ab arrq gb PP.