"Newbie" questions - "unique" sorting ?

Cousin Stanley CousinStanley at hotmail.com
Mon Jun 23 23:35:59 EDT 2003


John ....

{ 1. Good News | 2. Bad News | 3. Good News } ....

  1. Good News ....

     The last version of word_list.py that I up-loaded
     works as expected with your input file producing
     an indexed word list with no duplicates ...


  2. Bad News ....

     It was  S L O W E R  than the proverbial turtle
     in the tar pit ...

     Console Output follows ....

         python word_list.py JF_In.txt JF_Out.txt

             word_list.py

                 Indexing Words ....  . . . . . . . . .
                 Writing Output File ....

             Complete .................

                Total  Words .... 467381
                Unique Words .... 47122


        Process Time ........ 23611.15 Seconds

      That's 6.56 HOURS and un-acceptable performance !!!!

      word_list.py works quickly on smaller files,
      but as coded, is an absolute dog for indexing
      larger files ....


  3. Good News ....

     Since I  FINALLY  figured out that you're mostly interested
     in just the URLs and not a general word list,
     I coded a pre-process script to extract just the URLs
     from the original input file ....

         python url_list.py JF_In.txt JF_URLs.txt

     Then, use the generated output file from url_list.py
     as input to word_list.py to produce the final sorted file
     with no dups ....

         python word_list.py JF_URLs.txt JF_URLs_Indexed.txt

     These two steps worked quickly ...


Download ....

    http://fastq.com/~sckitching/Python/word_list.zip

    Contains ....

        [ url_list.py | word_list.py | JF_URLs_Indexed.txt ]

Let me know if this output looks closer to what you are after ....

-- 
Cousin Stanley
Human Being
Phoenix, Arizona






More information about the Python-list mailing list