prefix search on a large file

js ebgssth at gmail.com
Thu Oct 12 04:45:27 EDT 2006


 Hello, list.

 I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.

Let's say the list has a sentence below

1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "

2. "Python has been an important part of Google"

3. "important part of Google"

As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)

So I decided to clean up the list.

I tried to do this simple brute-force manner,  like

---------------------------------------------------------------------------
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
  unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
  sorted_list = list(set(sorted_list) - (unneeded))
....
---------------------------------------------------------------------------

This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.

I'm not familiar with algorithms/data structure and large-scale data processing,
so any advice, suggestions and recommendations will be appreciated.

Thank you in advance.



More information about the Python-list mailing list