extract Infobox contents

Anish Chapagain anishchapagain at gmail.com
Mon Apr 6 18:12:14 EDT 2009


Hi,
I was trying to extract wikipedia Infobox contents which is in format
like given below, from the opened URL page in Python.

{{ Infobox Software
| name                   = Bash
| logo                   = [[Image:bash-org.png|165px]]
| screenshot             = [[Image:Bash demo.png|250px]]
| caption                = Screenshot of bash and [[Bourne shell|sh]]
sessions demonstrating some features
| developer              = [[Chet Ramey]]
| latest release version = 4.0
| latest release date    = {{release date|mf=yes|2009|02|20}}
| programming language   = [[C (programming language)|C]]
| operating system       = [[Cross-platform]]
| platform               = [[GNU]]
| language               = English, multilingual ([[gettext]])
| status                 = Active
| genre                  = [[Unix shell]]
| source model           = [[Free software]]
| license                = [[GNU General Public License]]
| website                = [http://tiswww.case.edu/php/chet/bash/
bashtop.html Home page]
}} //upto this line

I need to extract all data between {{ Infobox ...to }}

Thank's if anyone can help,
am trying with

s1='{{ Infobox'
s2=len(s1)
pos1=data.find("{{ Infobox")
pos2=data.find("\n",pos2)

pat1=data.find("}}")

but am ending up getting one line at top only.

thank you,



More information about the Python-list mailing list