String parsing

Gabriel Genellina gagsl-py2 at yahoo.com.ar
Tue May 8 21:43:44 EDT 2007


En Tue, 08 May 2007 22:09:52 -0300, HMS Surprise <john at datavoiceint.com>  
escribió:

> The string below is a piece of a longer string of about 20000
> characters returned from a web page. I need to isolate the number at
> the end of the line containing 'LastUpdated'. I can find
> 'LastUpdated'  with .find but not sure about how to isolate the
> number. 'LastUpdated' is guaranteed to occur only once. Would
> appreciate it if one of you string parsing whizzes would take a stab
> at it.

> <input type="hidden" name="RFP"		 		value="-1"/>
> <!--<input type="hidden" name="EnteredBy"		value="johnxxxx"/>-->
> <input type="hidden" name="EnteredBy"		value="john"/>
> <input type="hidden" name="ServiceIndex"	value="1"/>
> <input type="hidden" name="LastUpdated" 	value="1178658863"/>
> <input type="hidden" name="NextPage" 		value="../active/active.php"/>
> <input type="hidden" name="ExistingStatus"	value="10" ?>
> <table width="98%" cellpadding="0" cellspacing="0" border="0"
> align="center"

You really should use an html parser here. But assuming that the page will  
not change a lot its structure you could use a regular expression like  
this:

expr = re.compile(r'name\s*=\s*"LastUpdated"\s+value\s*=\s*"(.*?)"',  
re.IGNORECASE)
number = expr.search(text).group(1)
(Handling of "not found" and "duplicate" cases is left as an exercise for  
the reader)

Note that <input value="1178658863" type="hidden" name="LastUpdated" /> is  
as valid as your html, but won't match the expression.

-- 
Gabriel Genellina




More information about the Python-list mailing list