Extracting patterns after matching a regex

Fri Sep 11 15:50:14 EDT 2009

On Sep 9, 4:58 pm, Al Fansome <al_fans... at hotmail.com> wrote:
> Mart. wrote:
> > On Sep 8, 4:33 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> >>Mart. wrote:
> >>> On Sep 8, 3:53 pm, MRAB <pyt... at mrabarnett.plus.com> wrote:
> >>>>Mart. wrote:
> >>>>> On Sep 8, 3:14 pm, "Andreas Tawn" <andreas.t... at ubisoft.com> wrote:
> >>>>>>>>> Hi,
> >>>>>>>>> I need to extract a string after a matching a regular expression. For
> >>>>>>>>> example I have the string...
> >>>>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
> >>>>>>>>> and once I match "FTPHOST" I would like to extract
> >>>>>>>>> "e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
> >>>>>>>>> problem, I had been trying to match the string using something like
> >>>>>>>>> this:
> >>>>>>>>> m = re.findall(r"FTPHOST", s)
> >>>>>>>>> But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
> >>>>>>>>> part. Perhaps I need to find the string and then split it? I had some
> >>>>>>>>> help with a similar problem, but now I don't seem to be able to
> >>>>>>>>> transfer that to this problem!
> >>>>>>>>> Thanks in advance for the help,
> >>>>>>>>> Martin
> >>>>>>>> No need for regex.
> >>>>>>>> s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
> >>>>>>>> If "FTPHOST" in s:
> >>>>>>>>     return s[9:]
> >>>>>>>> Cheers,
> >>>>>>>> Drea
> >>>>>>> Sorry perhaps I didn't make it clear enough, so apologies. I only
> >>>>>>> presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
> >>>>>>> thought this easily encompassed the problem. The solution presented
> >>>>>>> works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
> >>>>>>> when I used this on the actual file I am trying to parse I realised it
> >>>>>>> is slightly more complicated as this also pulls out other information,
> >>>>>>> for example it prints
> >>>>>>> e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
> >>>>>>> 'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
> >>>>>>> 0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
> >>>>>>> etc. So I need to find a way to stop it before the \r
> >>>>>>> slicing the string wouldn't work in this scenario as I can envisage a
> >>>>>>> situation where the string lenght increases and I would prefer not to
> >>>>>>> keep having to change the string.
> >>>>>> If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.
> >>>>> It is an email which contains information before and after the main
> >>>>> section I am interested in, namely...
> >>>>> FINISHED: 09/07/2009 08:42:31
> >>>>> MEDIATYPE: FtpPull
> >>>>> MEDIAFORMAT: FILEFORMAT
> >>>>> FTPHOST: e4ftl01u.ecs.nasa.gov
> >>>>> FTPDIR: /PullDir/0301872638CySfQB
> >>>>> Ftp Pull Download Links:
> >>>>>ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
> >>>>> Down load ZIP file of packaged order:
> >>>>>ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
> >>>>> FTPEXPR: 09/12/2009 08:42:31
> >>>>> MEDIA 1 of 1
> >>>>> MEDIAID:
> >>>>> I have been doing this to turn the email into a string
> >>>>> email = sys.argv[1]
> >>>>> f = open(email, 'r')
> >>>>> s = str(f.readlines())
> >>>> To me that seems a strange thing to do. You could just read the entire
> >>>> file as a string:
> >>>>      f = open(email, 'r')
> >>>>      s = f.read()
> >>>>> so FTPHOST isn't the first element, it is just part of a larger
> >>>>> string. When I turn the email into a string it looks like...
> >>>>> 'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
> >>>>> 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
> >>>>> 'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
> >>>>> \n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
> >>>>> load ZIP file of packaged order:\r\n',
> >>>>> So not sure splitting it like you suggested works in this case.
> >>> Within the file are a list of files, e.g.
> >>> TOTAL FILES: 2
> >>>            FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
> >>>            FILESIZE: 11028908
> >>>            FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
> >>>            FILESIZE: 18975
> >>> and what i want to do is get the ftp address from the file and collect
> >>> these files to pull down from the web e.g.
> >>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf
> >>> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
> >>> Thus far I have
> >>> #!/usr/bin/env python
> >>> import sys
> >>> import re
> >>> import urllib
> >>> email = sys.argv[1]
> >>> f = open(email, 'r')
> >>> s = str(f.readlines())
> >>> m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
> >>> \....", s)
> >>> ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
> >>> ftpdir  = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
> >>> url = 'ftp://' + ftphost + ftpdir
> >>> for i in xrange(len(m)):
> >>>    print i, ':', len(m)
> >>>    file1 = m[i][:-4]               # remove xml bit.
> >>>    file2 = m[i]
> >>>    urllib.urlretrieve(url, file1)
> >>>    urllib.urlretrieve(url, file2)
> >>> which works, clearly my match for the MOD13A2* files isn't ideal I
> >>> guess, but they will always occupt those dimensions, so it should
> >>> work. Any suggestions on how to improve this are appreciated.
> >> Suppose the file contains your example text above. Using 'readlines'
> >> returns a list of the lines:
>
> >>  >>> f = open(email, 'r')
> >>  >>> lines = f.readlines()
> >>  >>> lines
> >> ['TOTAL FILES: 2\n', '\t\tFILENAME:
> >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
> >> 11028908\n', '\n', '\t\tFILENAME:
> >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:
> >> 18975\n']
>
> >> Using 'str' on that list then converts it to s string _representation_
> >> of that list:
>
> >>  >>> str(lines)
> >> "['TOTAL FILES: 2\\n', '\\t\\tFILENAME:
> >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE:
> >> 11028908\\n', '\\n', '\\t\\tFILENAME:
> >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\\n', '\\t\\tFILESIZE:
> >> 18975\\n']"
>
> >> That just parsing a lot more difficult.
>
> >> It's much easier to just read the entire file as a single string and
> >> then parse that:
>
> >>  >>> f = open(email, 'r')
> >>  >>> s = f.read()
> >>  >>> s
> >> 'TOTAL FILES: 2\n\t\tFILENAME:
> >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n\t\tFILESIZE:
> >> 11028908\n\n\t\tFILENAME:
> >> MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n\t\tFILESIZE: 18975\n'
> >>  >>> import re
> >>  >>> re.findall(r"FILENAME: (.+)", s)
> >> ['MOD13A2.A2007033.h17v08.005.2007101023605.hdf',
> >> 'MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml']
>
> > If I do it this way I can't seem to not extract the \r at the end of
> > the line.
>
> > In [26]: m = re.search(r"FTPHOST: (.+)", s)
>
> > In [27]: m.group(1)
> > Out[27]: 'e4ftl01u.ecs.nasa.gov\r'
>
> > but if I insert \\r at the end as was previously suggested.
>
> > In [28]: m = re.search(r"FTPHOST: (.+)\\r", s)
>
> > In [29]: m.group(1)
>
> > AttributeError: 'NoneType' object has no attribute 'group'
>
> > Any thoughts?
>
> > Thanks
>
> Just use \r at the end, not \\r. \r is the carriage return character,
> which ends the line. \\r becomes two characters, the character backslash
> "\", followed by the character "r".

Excellent thanks, sorry I thought I had to escape it to access it. If
it helps anyone the script is as follows...Many thanks all for the
help.

#!/usr/bin/env python
import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = f.read()

# match the modis files...
m = re.findall(r"FILENAME: (.+)\r", s)

# get the ftp locations?
ftphost = re.search(r"FTPHOST: (.+)\r", s).group(1)
ftpdir  = re.search(r"FTPDIR: (.+)\r", s).group(1)
url = 'ftp://' + ftphost + ftpdir

for i in xrange(len(m)):
	print i, ':', len(m)	# counter
	modis_file = str(m[i])
	urllib.urlretrieve(url, modis_file)