Optimizing a text statistics function

Wed Apr 21 12:18:09 EDT 2004

Hello,

             I don't know python as well as most people on this list so 
this is a leading question. 

    In other languages I've used (mainly java although some C, C# VB 
<wash out your mouth>), the way I woud look at speeding this up is to 
avoid loading all the words into memory in one go and then working upon 
them.  I'd create one stream which reads through the file, then passes 
onto a listener each word it finds from the lexing (making the input 
into tokens) and then another stream listening to this which will then 
sort out the detail from these tokens (parsing), finally an output 
stream which put this data wherever it needs to be (DB, screen, file, 
etc).  This means that the program would scale better (if you pass the 
European voting register through your system it would take exponentially 
much longer as you must scan the information twice).

    However as more experienced python programmers have not suggested 
this is this because there is :

a.  Something I'm not getting about python text handling
b. Not easy/possible in python

    I ask this because I've been looking at (C)StringIO and it is OK for 
this kind of behaviour using strings (reading from a serial port and 
passing onto middleware) but it doesn't seem to have the ability to 
remove characters from the buffer once they have been read, therefore 
the buffer will grow and grow as the process runs.  For now I'm having 
to use strings which is less than ideal because they are immutable (I 
was think of writing my own StringBuffer class will will discard 
characters once they have been read from the StringBuffer) and therefore 
my program scales badly.

    However I do agree with the earlier poster - don't optimise for 
speed unless you need to (I'm assuming this is an academic exercise and 
I'm waiting to go to the pub)!!!  Simplicity of design is usually a 
better win.

Cheers,

Neil

Dennis Lee Bieber wrote:

>On Wed, 21 Apr 2004 16:51:56 +0200, Nickolay Kolev <nmkolev at uni-bonn.de>
>declaimed the following in comp.lang.python:
>
>  
>
>>It is really simple - it reads the file in memory, splits it on 
>>whitespace, strips punctuation characters and transforms all remaining 
>>elements to lowercase. It then looks through what has been left and 
>>creates a list of tuples (count, word) which contain each unique word 
>>and the number of time it appears in the text.
>>
>>    
>>
>	Without looking at the code, I'd probably drop the use of the
>tuples for a dictionary keyed by the word, with the data being the
>count. Granted, the output would not be easily sorted...
>
>	Okay, you do have a dictionary...
>
>	I suggest dropping the initialization pass -- for common words
>it's going to be redundant... Dictionaries have a method for supplying a
>default value if a key doesn't exist. See the following:
>
>
>##	for x in strippedWords:
>##		unique[x] = 0
>##
>##	for x in strippedWords:
>##		unique[x] += 1
>
>	for x in strippedWords:
>		unique[x] = unique.get(x, 0) + 1
>
>
>--  
> > ============================================================== <
> >   wlfraed at ix.netcom.com  | Wulfraed  Dennis Lee Bieber  KD6MOG <
> >      wulfraed at dm.net     |       Bestiaria Support Staff       <
> > ============================================================== <
> >           Home Page: <http://www.dm.net/~wulfraed/>            <
> >        Overflow Page: <http://wlfraed.home.netcom.com/>        <
>  
>

-- 

Neil Benn
Senior Automation Engineer
Cenix BioScience
PfotenhauerStrasse 108
D-01307
Dresden
Germany

Tel : +49 (351) 210 1300
e-mail : benn at cenix-bioscience.com
Cenix Website : http://www.cenix-bioscience.com