extract Infobox contents
Rhodri James
rhodri at wildebst.demon.co.uk
Mon Apr 6 18:41:48 EDT 2009
On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain
<anishchapagain at gmail.com> wrote:
> Hi,
> I was trying to extract wikipedia Infobox contents which is in format
> like given below, from the opened URL page in Python.
>
> {{ Infobox Software
> | name = Bash
> | logo = [[Image:bash-org.png|165px]]
> | screenshot = [[Image:Bash demo.png|250px]]
> | caption = Screenshot of bash and [[Bourne shell|sh]]
> sessions demonstrating some features
> | developer = [[Chet Ramey]]
> | latest release version = 4.0
> | latest release date = {{release date|mf=yes|2009|02|20}}
> | programming language = [[C (programming language)|C]]
> | operating system = [[Cross-platform]]
> | platform = [[GNU]]
> | language = English, multilingual ([[gettext]])
> | status = Active
> | genre = [[Unix shell]]
> | source model = [[Free software]]
> | license = [[GNU General Public License]]
> | website = [http://tiswww.case.edu/php/chet/bash/
> bashtop.html Home page]
> }} //upto this line
>
> I need to extract all data between {{ Infobox ...to }}
>
> Thank's if anyone can help,
> am trying with
>
> s1='{{ Infobox'
> s2=len(s1)
> pos1=data.find("{{ Infobox")
> pos2=data.find("\n",pos2)
>
> pat1=data.find("}}")
>
> but am ending up getting one line at top only.
How are you getting your data? Assuming that you can arrange to get
it one line at a time, here's a quick and dirty way to extract the
infoboxes on a page.
infoboxes = []
infobox = []
reading_infobox = False
for line in feed_me_lines_somehow():
if line.startswith("{{ Infobox"):
reading_infobox = True
if reading_infobox:
infobox.append(line)
if line.startswith("}}"):
reading_infobox = False
infoboxes.append(infobox)
infobox = []
You end up with 'infoboxes' containing a list of all the infoboxes
on the page, each held as a list of the lines of their content.
For safety's sake you really should be using regular expressions
rather than 'startswith', but I leave that as an exercise for the
reader :-)
--
Rhodri James *-* Wildebeeste Herder to the Masses
More information about the Python-list
mailing list