Looking for very simple general purpose tokenizer

Paul McGuire ptmcg at users.sourceforge.net
Mon Jan 19 08:59:23 EST 2004


"Maarten van Reeuwijk" <maarten at remove_this_ws.tn.tudelft.nl> wrote in
message news:bug9ij$30k$1 at news.tudelft.nl...
> Hi group,
>
> I need to parse various text files in python. I was wondering if there was
a
> general purpose tokenizer available. I know about split(), but this
> (otherwise very handy method does not allow me to specify a list of
> splitting characters, only one at the time and it removes my splitting
> operators (OK for spaces and \n's but not for =, / etc. Furthermore I
tried
> tokenize but this specifically for Python and is way too heavy for me. I
am
> looking for something like this:
>
>
> splitchars = [' ', '\n', '=', '/', ....]
> tokenlist = tokenize(rawfile, splitchars)
>
> Is there something like this available inside Python or did anyone already
> make this? Thank you in advance
>
> Maarten
> -- 
> ===================================================================
> Maarten van Reeuwijk                        Heat and Fluid Sciences
> Phd student                             dept. of Multiscale Physics
> www.ws.tn.tudelft.nl                 Delft University of Technology
Maarten -
Please give my pyparsing module a try. You can download it from SourceForge
at http://pyparsing.sourceforge.net. I wrote it for just this purpose, it
allows you to define your own parsing patterns for any text data file, and
the tokenized results are returned in a dictionary or list, as you prefer.
The download includes several examples also - one especially difficult file
parsing solution is shown in the dictExample.py script. And if you get
stuck, send me a sample of what you are trying to parse, and I can try to
give you some pointers (or even tell you if pyparsing isn't necessarily the
most appropriate tool for your job - it happens sometimes!).

-- Paul McGuire

Austin, Texas, USA





More information about the Python-list mailing list