[Edu-sig] counting lexemes...

Patrick K. O'Brien pobrien@orbtech.com
Mon, 1 Apr 2002 21:21:49 -0600


The tokenize module might do what you want.

"""Tokenization help for Python programs.

generate_tokens(readline) is a generator that breaks a stream of
text into Python tokens.  It accepts a readline-like method which is called
repeatedly to get the next line of input (or "" for EOF).  It generates
5-tuples with these members:

    the token type (see token.py)
    the token (a string)
    the starting (row, column) indices of the token (a 2-tuple of ints)
    the ending (row, column) indices of the token (a 2-tuple of ints)
    the original line (string)

It is designed to match the working of the Python tokenizer exactly, except
that it produces COMMENT tokens for comments and gives type OP for all
operators

Older entry points
    tokenize_loop(readline, tokeneater)
    tokenize(readline, tokeneater=printtoken)
are the same, except instead of generating tokens, tokeneater is a callback
function to which the 5 fields described above are passed as 5 arguments,
each time a new token is found."""

---
Patrick K. O'Brien
Orbtech

> -----Original Message-----
> From: edu-sig-admin@python.org [mailto:edu-sig-admin@python.org]On
> Behalf Of Jeffrey Elkner
> Sent: Monday, April 01, 2002 8:29 PM
> To: edu-sig@python.org
> Subject: [Edu-sig] counting lexemes...
> 
> 
> hi all!
> 
> i got such a great response to my last query that i'm trying another one
> ;-)  is there anything out there already that i can use to parse python,
> c++, and java source files to get a listing and count of the lexemes
> that occur in each?
> 
> i spent the better part of an afternoon writing python scripts to remove
> comments and docstrings so that i could compare line numbers, and i'm
> afraid parsing to get at the lexemes is beyond my ability within the
> time i have left to prepare my thesis.
> 
> anyone suggestions?
> 
> thanks again!
> 
> jeff elkner
> yorktown high school
> arlington, va
> 
> 
> 
> 
> _______________________________________________
> Edu-sig mailing list
> Edu-sig@python.org
> http://mail.python.org/mailman/listinfo/edu-sig