Beginner Q. interrogate html object OR file search?

r0g aioe.org at technicalbloke.com
Thu Dec 3 02:24:03 EST 2009


Mark G wrote:
> Hi all,
> 
> I am new to python and don't yet know the libraries well. What would
> be the best way to approach this problem: I have a html file parsing
> script - the file sits on my harddrive. I want to extract the date
> modified from the meta-data. Should I read through lines of the file
> doing a string.find to look for the character patterns of the meta-
> tag, or should I use a DOM type library to retrieve the html element I
> want? Which is best practice? which occupies least code?
> 
> Regards, Mark


Beautiful soup is the html parser of choice partly as it handles badly
formed html well.

http://www.crummy.com/software/BeautifulSoup/


If the date info occurs at a consistent offset from the start of the tag
then you can use simple string slicing to snip out the date. If not
then, as others suggest, regex is your friend.

If you need to convert a date/time string back into a unix style
timestamp chop the string into bits, put them into a tuple of length 9
and give that to time.mktime()...

def time_to_timestamp( t ):
    return time.mktime( (int(t[0:4]), int(t[5:7]), int(t[8:10]),
int(t[11:13]), int(t[14:16]), int(t[17:19]), 0, 0, 0) )

Note the last 3 values are hardcoded to 0, this is because most
date/time strings I deal with do not encode sub second information, only
 YYYY/MM/DD h:m:s


Roger.



More information about the Python-list mailing list