Help with regular expression patterns

Michel Perez opsbat at infomed.sld.cu
Fri Nov 28 10:46:07 EST 2008


Hi:
 i'm so newbie in python that i don't get the right idea about regular
expressions. This is what i want to do:
 Extract using python some information and them replace this expresion
for others, i use as a base the wikitext and this is what i do:

<code file="parse.py">
paragraphs = """
= Test '''wikitest'''=
[[Image:image_link.jpg|rigth|thumbnail|200px|"PREMIER"]]

[http://www.google.com.cu]
::''Note: This is just an example to test some regular expressions
stuffs.''

The ''wikitext'' is a text format that helps a lot. In concept is a
simple [[markup]] [[programming_language|language]]. That helps to make
simple create documentations texts.

==Wikitext==

Created by Warn as a ...

<nowiki>[</nowiki> this is a normal <nowiki>sign]</nowiki>
""".split('\n\n')

import re
wikipatterns = {
    'a_nowiki' : re.compile(r"<nowiki>(.\S+)</nowiki>"), # nowiki
    'section' : re.compile(r"\=(.*)\="),        # section one tags
    'sectiontwo' : re.compile(r"\=\=(.*?)\=\="),# section two tags
    'wikilink': re.compile(r"\[\[(.*?)\]\]"),   # links tags
    'link': re.compile(r"\[(.*?)\]"),           # external links tags
    'italic': re.compile(r"\'\'(.*?)\'\'"),     # italic text tags
    'bold' : re.compile(r"\'\'\'(.*?)\'\'\'"),  # bold text tags
}

for pattern in wikipatterns:
    print "===> processing pattern :", pattern, "<=============="
    for paragraph in paragraphs:
        print  wikipatterns[pattern].findall(paragraph)

</code>

But When i run it the result is not what i want, it's something like:

<code>
michel at cerebellum:/local/python$python parser.py
===> processing pattern : bold <============== 
['braille']
[]
[]
[]
[]
[]
===> processing pattern : section <==============
[" Test '''wikitest'''"]
[]
[]
['=Wikitext=']
[]
[]
===> processing pattern : sectiontwo <==============
[]
[]
[]
['Wikitext']
[]
[]
===> processing pattern : link <==============
['[Image:image_link.jpg|rigth|thumbnail|200px|"PREMIER"']
['http://www.google.com.cu']
['[markup', '[programming_language|language']
[]
[]
['</nowiki> this is a normal <nowiki>sign']
===> processing pattern : italic <==============
["'wikitest"]
['Note: This is just an example to test some regular expressions
stuffs.']
['wikitext']
[]
[]
[]
===> processing pattern : wikilink <==============
['Image:image_link.jpg|rigth|thumbnail|200px|"PREMIER"']
[]
['markup', 'programming_language|language']
[]
[]
[]
===> processing pattern : a_nowiki <==============
[]
[]
[]
[]
[]
['sign]']
</code>

In the first case the result it's Ok
In the second the first it's Ok, but the second it's not because second
result it's a level two section not a level one.
In the third result things are Ok
The fourth, the first and thrid result are wrong beacuse they are level
two links, but the second it's Ok.
The fifth it Ok
The sixth shows only one result and it should show two.

Please help.

PS: am really sorry about my technical English.


-- 
Michel Perez                                  )\._.,--....,'``.    
Ulrico Software Group                        /,   _.. \   _\  ;`._ ,.
Nihil est tam arduo et difficile human      `._.-(,_..'--(,_..'`-.;.'
mens vincat.                   Séneca.   ============================= 


---------------------------------------
    Red Telematica de Salud - Cuba
    	  CNICM - Infomed



More information about the Python-list mailing list