string, not re (was Re: Re and Me.)

Alex Martelli aleax at aleax.it
Mon Aug 27 07:57:43 EDT 2001


"epoch7" <epoch7 at nocharmailter.net> wrote in message
news:toevhkp1hlem92 at corp.supernews.com...
> This is day 2 of my programming expereience, so please bear with my
> newbie-ishness.
> I'm trying to search for a string inside a file.  The string will be

It's easier if you have the file's contents in memory, but that's easy
as long as it's not so huge as to overwhelm your RAM (so, some tens of
megabytes should be OK, but many hundreds of megabytes probably not):
    thestring = thefile.read()

Alternatively, look into the mmap module (on Windows and Unix-like
systems, only) to get a "memory mapping" of your file, so you can,
again, treat it basically like a big in-memory string (this works
for file sizes up to the space available in your *virtual* memory:
hundreds of megabytes will be OK, though gigabytes may start being
trouble, depending on your platforms).  mmap doesn't QUITE give you
a string (you don't get string methods, sigh...), but...

Anyway, I'll assume in the following that name thestring is bound
to one big string containing the whole file's contents.

> different
> but the characters marking its beginning and end are the same (url= and
&).
> I've spent all day and night just learning the ins and outs of python so

An excellent way to spend time:-).

> this is still a little beyond me.
> re.split() would return the values between just one pattern correct? So

Yep -- avoid module re at the start, unless you really can't do without
it; and often, like here, you can.

All you need is one method of string objects, and I quote from
the library manual (an indispensable friend!)...:

find(sub[, start[, end]])
    Return the lowest index in the string where substring sub
    is found, such that sub is contained in the range [start, end).
    Optional arguments start and end are interpreted as in slice
    notation. Return -1 if sub is not found.

So basically:

where = thestring.find('url=', start)

binds name 'where' to an integer that is the lowest index >= start
such that the substring of thestring starting there equals 'url=',
or -1 if there is no such index (no occurrence of the substring
'url=' within thestring[start:]).

You want all occurrences of strings bracketed between a starting
'url=' and an ending '&', so you need a loop.  You need a variable
to keep track of the 'start', too, of course -- and you know it
will have to begin as 0, and will be -1 when no more occurrences
are found.

So, putting this all together:

results = []        # empty list, at the beginning
start = 0           # start from the start of thestring
while start >= 0:
    start = thestring.find('url=', start)
    if start >= 0:
        start += 4     # skip string 'url=' itself
        end = thestring.find('&', start)
        if end < 0:    # no ending-'&', what now?
            # presumably means "all the rest of thestring"...?
            results.append(thestring[start:])
            start = -1 # nothing more to do
        else:          # found the ending
            results.append(thestring[start:end])
            start = end+1

Now you have in list results the substrings of interest,
i.e. those that were bracketed between a leading 'url='
and a trailing '&'.  Most of the complication comes from
the issue of "what if there IS no trailing & after the
last match for 'url='...?" -- if that's no issue for
you, this can be simplified a lot, e.g. to:

results = []        # empty list, at the beginning
start = 0           # start from the start of thestring
while 1:
    start = thestring.find('url=', start)
    if start<0: break
    start += 4      # skip string 'url=' itself
    end = thestring.find('&', start)
    if end<0: break # should never happen, but...
    results.append(thestring[start:end])
    start = end+1

the while 1:/break is a more natural way to code the
loop, and ignoring a match for 'url=' that is not
followed by any '&' much simpler than accounting
for it -- as you can notice, since there is now no
nesting.  Or you could use method .index() instead
of method .find(): it works the same, except that
when no substring is found, it raises a ValueError
exception rather than returning -1.  You can then
save the "if ...<0" tests in exchange for one try
statement:

results = []        # empty list, at the beginning
start = 0           # start from the start of thestring
try:
    while 1:
        start = 4+thestring.index('url=', start)
        end = thestring.index('&', start)
        results.append(thestring[start:end])
        start = end+1
except ValueError:
    pass

Since you save the "if start < 0" test, you can
now inline the "4+" part that's needed to skip
the leading occurrence of 'url=' itself on all
matches.  Or you could have the try inside the
while, rather than viceversa:

results = []        # empty list, at the beginning
start = 0           # start from the start of thestring
while 1:
    try:
        start = 4+thestring.index('url=', start)
        end = thestring.index('&', start)
        results.append(thestring[start:end])
        start = end+1
    except ValueError:
        break

depending on whatever looks clearer to you (I have
a small suspect that having the try outside may be
slightly faster, but, not having *measured*, I would
not dare hazard a guess -- clarity and simplicity
are MUCH more important than speed in most cases,
anyway -- optimize only after measuring and *IF* you
know you really have to make the code faster).


Alex






More information about the Python-list mailing list