Looking for very simple general purpose tokenizer

Mon Jan 19 05:38:52 EST 2004

Maarten van Reeuwijk wrote:
> Hi group,
> 
> I need to parse various text files in python. I was wondering if there was a
> general purpose tokenizer available. I know about split(), but this
> (otherwise very handy method does not allow me to specify a list of
> splitting characters, only one at the time and it removes my splitting
> operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried 
> tokenize but this specifically for Python and is way too heavy for me. I am
> looking for something like this:
> 
> 
> splitchars = [' ', '\n', '=', '/', ....]
> tokenlist = tokenize(rawfile, splitchars)
> 
> Is there something like this available inside Python or did anyone already
> make this? Thank you in advance

You may use re.findall for that:

 >>> import re
 >>> s = "a = b+c; z = 34;"
 >>> pat = " |=|;|[^ =;]*"
 >>> re.findall(pat, s)
['a', ' ', '=', ' ', 'b+c', ';', ' ', 'z', ' ', '=', ' ', '34', ';', '']

The pattern basically says: match either a space, a '=', a ';', or a sequence of 
any characters that are not space, '=' or ';'. You may have to take care 
beforehands about special characters like \n or \ (very special in regular 
expressions)

HTH
-- 
- Eric Brunel <eric dot brunel at pragmadev dot com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com