extract Infobox contents

Rhodri James rhodri at wildebst.demon.co.uk
Tue Apr 7 20:57:35 EDT 2009


On Tue, 07 Apr 2009 12:46:18 +0100, J. Clifford Dyer  
<jcd at sdf.lonestar.org> wrote:

> On Mon, 2009-04-06 at 23:41 +0100, Rhodri James wrote:
>> On Mon, 06 Apr 2009 23:12:14 +0100, Anish Chapagain
>> <anishchapagain at gmail.com> wrote:
>>
>> > Hi,
>> > I was trying to extract wikipedia Infobox contents which is in format
>> > like given below, from the opened URL page in Python.
>> >
>> > {{ Infobox Software
>> > | name                   = Bash
[snip]
>> > | latest release date    = {{release date|mf=yes|2009|02|20}}
>> > | programming language   = [[C (programming language)|C]]
>> > | operating system       = [[Cross-platform]]
>> > | platform               = [[GNU]]
>> > | language               = English, multilingual ([[gettext]])
>> > | status                 = Active
[snip some more]
>> > }} //upto this line
>> >
>> > I need to extract all data between {{ Infobox ...to }}

[snip still more]

>> You end up with 'infoboxes' containing a list of all the infoboxes
>> on the page, each held as a list of the lines of their content.
>> For safety's sake you really should be using regular expressions
>> rather than 'startswith', but I leave that as an exercise for the
>> reader :-)
>>
>
> I agree that startswith isn't the right option, but for matching two
> constant characters, I don't think re is necessary.  I'd just do:
>
> if '}}' in line:
>     pass
>
> Then, as the saying goes, you only have one problem.

That would be the problem of matching lines like:

  | latest release date    = {{release date|mf=yes|2009|02|20}}

would it? :-)

A quick bit of timing suggests that:

   if line.lstrip().startswith("}}"):
     pass

is what we actually want.

-- 
Rhodri James *-* Wildebeeste Herder to the Masses



More information about the Python-list mailing list