Mining strings from a HTML document.
Runsun Pan
python.pan at gmail.com
Thu Jan 26 14:46:22 EST 2006
> def extract(text,s1,s2):
> ''' Extract strings wrapped between s1 and s2.
>
> >>> t="""this is a <span>test</span> for <span>extract()</span>
> that <span>does multiple extract</span> """
> >>> extract(t,'<span>','</span>')
> ['test', 'extract()', 'does multiple extract']
>
> '''
> beg = [1,0][text.startswith(s1)]
> tmp = text.split(s1)[beg:]
> end = [len(tmp), len(tmp)+1][ text.endswith(s2)]
> return [ x.split(s2)[0] for x in tmp if len(x.split(s2))>1][:end]
> Could you/anyone explain the 4 lines of code to me though? A crash
> course in Python shorthand? What does it mean when you use two sets of
> brackets as in : beg = [1,0][text.startswith(s1)] ?
The idea is using .split( ) to cut the string in different manners.
For a string:
-----AderickB----ArunsunB------
first cut at A gives you [-----, derickB------, runsunB-----] (line-1,2)
2nd cut at B gives you [ derick, runsun ] (line-3,4)
The function uses list comprehension heavily. As Magnus already explained,
line-1 is just a switch. Same as line-3. These two lines exist to solve the
difference between
-----AderickB----ArunsunB------
AderickB----ArunsunB------
or
-----AderickB----ArunsunB------
-----AderickB----ArunsunB
That is, if the original raw string startswith or ends with s1 or s2, special
consideration should be taken.
Line-2 and -4 are just common practice of list slicing that u should be
able to find in any python tutorial.
Let us know if it's still not clear.
--
~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
Runsun Pan, PhD
python.pan at gmail.com
Nat'l Center for Macromolecular Imaging
http://ncmi.bcm.tmc.edu/ncmi/
~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
More information about the Python-list
mailing list