Mining strings from a HTML document.

Thu Jan 26 14:46:22 EST 2006

>    def extract(text,s1,s2):
>        ''' Extract strings wrapped between s1 and s2.
>
>        >>> t="""this is a <span>test</span> for <span>extract()</span>
>            that <span>does multiple extract</span> """
>        >>> extract(t,'<span>','</span>')
>        ['test', 'extract()', 'does multiple extract']
>
>        '''
>        beg = [1,0][text.startswith(s1)]
>        tmp = text.split(s1)[beg:]
>        end = [len(tmp), len(tmp)+1][ text.endswith(s2)]
>        return [ x.split(s2)[0] for x in tmp if len(x.split(s2))>1][:end]

> Could you/anyone explain the 4 lines of code to me though? A crash
> course in Python shorthand? What does it mean when you use two sets of
> brackets as in : beg = [1,0][text.startswith(s1)] ?

The idea is using .split( ) to cut the string in different manners.
For a string:

     -----AderickB----ArunsunB------

first cut at A gives you  [-----, derickB------, runsunB-----]   (line-1,2)
2nd cut at B gives you  [ derick, runsun ]                        (line-3,4)

The function uses list comprehension heavily. As Magnus already explained,
line-1 is just a switch. Same as line-3. These two lines exist to solve the
difference between

     -----AderickB----ArunsunB------
     AderickB----ArunsunB------

or

     -----AderickB----ArunsunB------
     -----AderickB----ArunsunB

That is, if the original raw string startswith or ends with s1 or s2, special
consideration should be taken.

Line-2 and -4 are just common practice of list slicing that u should be
able to find in any python tutorial.

Let us know if it's still not clear.

--
~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
Runsun Pan, PhD
python.pan at gmail.com
Nat'l Center for Macromolecular Imaging
http://ncmi.bcm.tmc.edu/ncmi/
~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~