extract Infobox contents

Rhodri James rhodri at wildebst.demon.co.uk
Mon Apr 6 18:41:48 EDT 2009


On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain  
<anishchapagain at gmail.com> wrote:

> Hi,
> I was trying to extract wikipedia Infobox contents which is in format
> like given below, from the opened URL page in Python.
>
> {{ Infobox Software
> | name                   = Bash
> | logo                   = [[Image:bash-org.png|165px]]
> | screenshot             = [[Image:Bash demo.png|250px]]
> | caption                = Screenshot of bash and [[Bourne shell|sh]]
> sessions demonstrating some features
> | developer              = [[Chet Ramey]]
> | latest release version = 4.0
> | latest release date    = {{release date|mf=yes|2009|02|20}}
> | programming language   = [[C (programming language)|C]]
> | operating system       = [[Cross-platform]]
> | platform               = [[GNU]]
> | language               = English, multilingual ([[gettext]])
> | status                 = Active
> | genre                  = [[Unix shell]]
> | source model           = [[Free software]]
> | license                = [[GNU General Public License]]
> | website                = [http://tiswww.case.edu/php/chet/bash/
> bashtop.html Home page]
> }} //upto this line
>
> I need to extract all data between {{ Infobox ...to }}
>
> Thank's if anyone can help,
> am trying with
>
> s1='{{ Infobox'
> s2=len(s1)
> pos1=data.find("{{ Infobox")
> pos2=data.find("\n",pos2)
>
> pat1=data.find("}}")
>
> but am ending up getting one line at top only.

How are you getting your data?  Assuming that you can arrange to get
it one line at a time, here's a quick and dirty way to extract the
infoboxes on a page.

infoboxes = []
infobox = []
reading_infobox = False

for line in feed_me_lines_somehow():
     if line.startswith("{{ Infobox"):
         reading_infobox = True
     if reading_infobox:
         infobox.append(line)
     if line.startswith("}}"):
         reading_infobox = False
         infoboxes.append(infobox)
	infobox = []

You end up with 'infoboxes' containing a list of all the infoboxes
on the page, each held as a list of the lines of their content.
For safety's sake you really should be using regular expressions
rather than 'startswith', but I leave that as an exercise for the
reader :-)

-- 
Rhodri James *-* Wildebeeste Herder to the Masses



More information about the Python-list mailing list