for in benchmark interested

Thu Apr 15 19:35:51 EDT 1999

Friedrich Dominicus <Friedrich.Dominicus at inka.de> said:
> Some time ago I have opened a thread about optimization in Python. I
> have now done a simple benchmark and used Python for that too. If
> s.o is interested, take a look at: http://edt.vol.cz:81/bench

How about a unix command line implementation, like

  cat file | tr " \t\r" "\n\n\n" | sort | uniq -c | tail +2

It's a bit slower than the C version, but only by a factor of
two or so, and a lot easier to write.

Another solution might be:
  echo "" | spellin > myhlist
  spell -d myhlist < file | sort | uniq -c 

but that strips punctuation so doesn't give the same results.

A third is:
  deroff -w file | sort | uniq -c

which also doesn't give exactly what you want.  Still, it is
arguable that both these cases are better than the other solutions
given different definitions of the word "word."

BTW, the C code has some problems:

*** add (char *)

"count_words.l", line 34: error(3232): a value of type "void *"
cannot be used to initialize an entity of type "char *"
      char* str = malloc(yyleng+1); strcpy(str, yytext);
                  ^

There are a few problems with "inline" that my non-gcc C compiler
doesn't like.  I'm not sure that inline is a C keyword.

I also get a core dump when I run it:

bioreason8> ./count_words_c ./count_words.c 
Bus error (core dumped)
bioreason8> dbx ./count_words_c core
dbx version 7.1 Dec  3 1996 17:03:19
Core from signal SIGBUS: Bus error
(dbx) where
>  0 hash(string = 0x100170e9 = "line") ["/home/u05/dalke/tmp/hash.c":104, 0x100037c8]
   1 search(table = 0x10017010, key = 0x100170e8 = "#line")
["/home/u05/dalke/tmp/hash.c":181, 0x10003b00]
   2 run() ["/home/u05/dalke/tmp/count_words.l":35, 0x100033e0]
   3 yylex() ["/home/u05/dalke/tmp/count_words.l":27, 0x1000198c]
   4 main(argc = 2, argv = 0x7fff2f24) ["/home/u05/dalke/tmp/count_words.l":59,
0x100034fc]
   5 __start()
["/vince/6.2-mar09/work/irix/lib/libc/libc_n32_M3/csu/crt1text.s":166,
0x100015e8]
(dbx) list 92
    92  /*
    93   Hashes a string to produce an unsigned short, which should be
    94   sufficient for most purposes.
    95  */
    96  
    97  unsigned hash(char *string)
    98  {
    99        unsigned int ret_val = 0;
   100        int i;
   101  
(dbx) l
   102        while (*string)
   103        {
 * 104              i = *( int *)string;
   105              ret_val ^= i;
   106              ret_val <<= 1;
   107              string ++;
   108        }
   109        return ret_val;
   110  }
   111  

The conversion on line 104 is really wrong.  It should likely be

  i = (int)(*string)

Also, I tried it (with the fix) against a core dump:
bioreason8> ls -l /usr/tmp/core
-rw-r--r--    1 root     sys      5357568 Apr  8 00:45 /usr/tmp/core
bioreason8> ./count_words_c /usr/tmp/core > /dev/null

After about 6 minutes I killed the job.  Probably doesn't like the
embedded NULs.  This is an example of "fuzz" testing.  See 
http://www.cs.wisc.edu/Dienst/UI/2.0/Describe/ncstrl.uwmadison/CS-TR-95-1268

Finally, I see the lex code is processed with:

count_words.c: count_words.l
        flex -B -ocount_words.c count_words.l

You should add a "-f" to that.  That gave me a 15% speedup over just -B, against
the code at http://caml.inria.fr/caml-list/0899.html, from whence
you based this thread originally.

						Andrew
						dalke at acm.org