Extracting patterns after matching a regex

Tue Sep 8 11:06:20 EDT 2009

On Sep 8, 3:53 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> Mart. wrote:
> > On Sep 8, 3:14 pm, "Andreas Tawn" <andreas.t... at ubisoft.com> wrote:
> >>>>> Hi,
> >>>>> I need to extract a string after a matching a regular expression. For
> >>>>> example I have the string...
> >>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
> >>>>> and once I match "FTPHOST" I would like to extract
> >>>>> "e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
> >>>>> problem, I had been trying to match the string using something like
> >>>>> this:
> >>>>> m = re.findall(r"FTPHOST", s)
> >>>>> But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
> >>>>> part. Perhaps I need to find the string and then split it? I had some
> >>>>> help with a similar problem, but now I don't seem to be able to
> >>>>> transfer that to this problem!
> >>>>> Thanks in advance for the help,
> >>>>> Martin
> >>>> No need for regex.
> >>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
> >>>> If "FTPHOST" in s:
> >>>>     return s[9:]
> >>>> Cheers,
> >>>> Drea
> >>> Sorry perhaps I didn't make it clear enough, so apologies. I only
> >>> presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
> >>> thought this easily encompassed the problem. The solution presented
> >>> works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
> >>> when I used this on the actual file I am trying to parse I realised it
> >>> is slightly more complicated as this also pulls out other information,
> >>> for example it prints
> >>> e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
> >>> 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
> >>> 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
> >>> etc. So I need to find a way to stop it before the \r
> >>> slicing the string wouldn't work in this scenario as I can envisage a
> >>> situation where the string lenght increases and I would prefer not to
> >>> keep having to change the string.
> >> If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.
>
> > It is an email which contains information before and after the main
> > section I am interested in, namely...
>
> > FINISHED: 09/07/2009 08:42:31
>
> > MEDIATYPE: FtpPull
> > MEDIAFORMAT: FILEFORMAT
> > FTPHOST: e4ftl01u.ecs.nasa.gov
> > FTPDIR: /PullDir/0301872638CySfQB
> > Ftp Pull Download Links:
> >ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
> > Down load ZIP file of packaged order:
> >ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
> > FTPEXPR: 09/12/2009 08:42:31
> > MEDIA 1 of 1
> > MEDIAID:
>
> > I have been doing this to turn the email into a string
>
> > email = sys.argv[1]
> > f = open(email, 'r')
> > s = str(f.readlines())
>
> To me that seems a strange thing to do. You could just read the entire
> file as a string:
>
>      f = open(email, 'r')
>      s = f.read()
>
> > so FTPHOST isn't the first element, it is just part of a larger
> > string. When I turn the email into a string it looks like...
>
> > 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
> > 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
> > 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
> > \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
> > load ZIP file of packaged order:\r\n',
>
> > So not sure splitting it like you suggested works in this case.
>
>

Within the file are a list of files, e.g.

TOTAL FILES: 2
		FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
		FILESIZE: 11028908

		FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
		FILESIZE: 18975

and what i want to do is get the ftp address from the file and collect
these files to pull down from the web e.g.

MOD13A2.A2007033.h17v08.005.2007101023605.hdf
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml

Thus far I have

#!/usr/bin/env python

import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
\....", s)

ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir

for i in xrange(len(m)):

	print i, ':', len(m)
	file1 = m[i][:-4]		# remove xml bit.
	file2 = m[i]

	urllib.urlretrieve(url, file1)
	urllib.urlretrieve(url, file2)

which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.

Thanks.