Looking for very simple general purpose tokenizer

Alan Kennedy alanmk at hotmail.com
Mon Jan 19 09:38:50 EST 2004


Maarten van Reeuwijk wrote:
> I need to parse various text files in python. I was wondering if
> there was a general purpose tokenizer available. 

Indeed there is: python comes with batteries included. Try the shlex
module.

http://www.python.org/doc/lib/module-shlex.html

Try the following code: it seems to do what you want. If it doesn't,
then please be more specific on your tokenisation rules.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
splitchars = [' ', '\n', '=', '/',]

source = """
thisshouldcome inthree parts
thisshould comeintwo
andso/shouldthis
and=this
"""

import shlex
import StringIO

def prepareToker(toker, splitters): 
  for s in splitters: # resists People's Front of Judea joke ;-D
    if toker.whitespace.find(s) == -1:
      toker.whitespace = "%s%s" % (s, toker.whitespace)
  return toker

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker = prepareToker(toker, splitchars)
for num, tok in enumerate(toker):
  print "%s:%s" % (num, tok)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Note that the use of the iteration based interface in the above code
requires python 2.3. If you need it to run on previous versions,
specify which one.

regards,

-- 
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan:              http://xhaus.com/contact/alan



More information about the Python-list mailing list