"Newbie" questions - "unique" sorting ?

Wed Jun 25 08:44:50 EDT 2003

On Mon, 23 Jun 2003 20:35:59 -0700, "Cousin Stanley"
<CousinStanley at hotmail.com> wrote:

Hi Cousin Stanley,

>{ 1. Good News | 2. Bad News | 3. Good News } ....

>  1. Good News ....

>     The last version of word_list.py that I up-loaded
>     works as expected with your input file producing
>     an indexed word list with no duplicates ...

< snip >

>      That's 6.56 HOURS and un-acceptable performance !!!!

I agree.  :-)  Very clever of you to have worked out how long it would
take. I hope you didn't wait over 6 hours to find out !!!

>      word_list.py works quickly on smaller files,
>      but as coded, is an absolute dog for indexing
>      larger files ....

Good. I was hoping it wasn't something that I had done wrong.  :-)

>  3. Good News ....

>     Since I  FINALLY  figured out that you're mostly interested
>     in just the URLs and not a general word list,
>     I coded a pre-process script to extract just the URLs
>     from the original input file ....

>         python url_list.py JF_In.txt JF_URLs.txt

Unless I missed something it does lines starting ftp, http, BUT not
lines that start www . Is that correct ? Or did I give you a file with
no lines starting www ? 

< snip >

>Let me know if this output looks closer to what you are after ....

Very very good......and fast. If I can work out what happened to the
www lines, and fix it, then everything will be great. I then hope to
try this exercise using a different method to see if the numbers come
up the same.

Thank you for such excellent programming.  :-)

Regards, John.