"Newbie" questions - "unique" sorting ?
Cousin Stanley
CousinStanley at hotmail.com
Mon Jun 23 23:35:59 EDT 2003
John ....
{ 1. Good News | 2. Bad News | 3. Good News } ....
1. Good News ....
The last version of word_list.py that I up-loaded
works as expected with your input file producing
an indexed word list with no duplicates ...
2. Bad News ....
It was S L O W E R than the proverbial turtle
in the tar pit ...
Console Output follows ....
python word_list.py JF_In.txt JF_Out.txt
word_list.py
Indexing Words .... . . . . . . . . .
Writing Output File ....
Complete .................
Total Words .... 467381
Unique Words .... 47122
Process Time ........ 23611.15 Seconds
That's 6.56 HOURS and un-acceptable performance !!!!
word_list.py works quickly on smaller files,
but as coded, is an absolute dog for indexing
larger files ....
3. Good News ....
Since I FINALLY figured out that you're mostly interested
in just the URLs and not a general word list,
I coded a pre-process script to extract just the URLs
from the original input file ....
python url_list.py JF_In.txt JF_URLs.txt
Then, use the generated output file from url_list.py
as input to word_list.py to produce the final sorted file
with no dups ....
python word_list.py JF_URLs.txt JF_URLs_Indexed.txt
These two steps worked quickly ...
Download ....
http://fastq.com/~sckitching/Python/word_list.zip
Contains ....
[ url_list.py | word_list.py | JF_URLs_Indexed.txt ]
Let me know if this output looks closer to what you are after ....
--
Cousin Stanley
Human Being
Phoenix, Arizona
More information about the Python-list
mailing list