prefix search on a large file

Thu Oct 12 06:08:57 EDT 2006

js  wrote:
> Hello, list.
>
>  I have a list of sentence in text files that I use to filter-out some data.
> I managed the list so badly that now it's become literally a mess.
>
> Let's say the list has a sentence below
>
> 1. "Python has been an important part of Google since the beginning,
> and remains so as the system grows and evolves. "
>
> 2. "Python has been an important part of Google"
>
> 3. "important part of Google"
>
> As you see sentence 2 is a subset of sentence 1
> so I don't need to have sentence 1 on the list.

Are you 100% sure that you wnat to throw away the *longer* sentence?

> (For some reason, it's no problem to have sentence 3.
> Only sentence that has the "same prefix part" is the one I want to remove)
>
> So I decided to clean up the list.
>
> I tried to do this simple brute-force manner,  like

Please don't waste time and bandwith with "like"; post the exact code
that you executed.

>
> ---------------------------------------------------------------------------
> sorted_list = sorted(file('thelist'), key=len)
> for line in sorted_list[:]
# won't compile, missing a colon at end
>   unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
>   sorted_list = list(set(sorted_list) - (unneeded))
#  can't do set - list
# Presuming you mean - set(unneeded), but then produces an empty list
(see later).
# Naming the output "sorted_list" is misleading advertising.
# With sorted_list[:] in an inner loop, it's no wonder that it runs
slowly.
# Test your code on small samples of data and ensure that it is working
correctly before you run it on huge amounts of data.
# Be careful of end cases; note that in my solutions below, the first
or last item needs special handling.

> ....
> ---------------------------------------------------------------------------
>
> This is so slow and not so helpful because the list is
> so big(more than 100M bytes and has about 3 million lines)
> and I have more than 100 lists.

Here's one way of doing it, tested to the extent shown:

C:\junk>type prefixdel.py
from pprint import pprint as pp
data = [
    'foo bar baz',
    'foo bar',
    'foo',
    'food',
    'food', # duplicate
    'xyzzy',
    'plugh',
    'xyzzy and plugh are magic',
    'zzzz',
    ]

"""
> sorted_list = sorted(file('thelist'), key=len)
> for line in sorted_list[:]
>   unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
>   sorted_list = list(set(sorted_list) - set(unneeded))
"""
slist = sorted(data, key=len)
print "OP trial 1 input"; pp(slist)
for line in slist[:]:
    unneeded = [line2 for line2 in slist[:] if line2.startswith(line)]
    slist = list(set(slist) - set(unneeded))
print "OP trial 1 output"; pp(slist)

ilist = sorted(data)
print "sorted input"; pp(ilist)

olist = []
for i in xrange(len(ilist)-1, 0, -1):
    if ilist[i].startswith(ilist[i-1]):
        continue
    olist.append(ilist[i])
olist.append(ilist[0])
print "output (keeping shorter)"; pp(olist)

olist = []
for i in xrange(len(ilist)-1):
    if ilist[i+1].startswith(ilist[i]):
        continue
    olist.append(ilist[i])
olist.append(ilist[-1])
print "output (keeping longer)"; pp(olist)

C:\junk>prefixdel.py
OP trial 1 input
['foo',
 'food',
 'food',
 'zzzz',
 'xyzzy',
 'plugh',
 'foo bar',
 'foo bar baz',
 'xyzzy and plugh are magic']
OP trial 1 output
[]
sorted input
['foo',
 'foo bar',
 'foo bar baz',
 'food',
 'food',
 'plugh',
 'xyzzy',
 'xyzzy and plugh are magic',
 'zzzz']
output (keeping shorter)
['zzzz', 'xyzzy', 'plugh', 'food', 'foo']
output (keeping longer)
['foo bar baz', 'food', 'plugh', 'xyzzy and plugh are magic', 'zzzz']

HTH,
John