Looking for very simple general purpose tokenizer
Alan Kennedy
alanmk at hotmail.com
Mon Jan 19 09:38:50 EST 2004
Maarten van Reeuwijk wrote:
> I need to parse various text files in python. I was wondering if
> there was a general purpose tokenizer available.
Indeed there is: python comes with batteries included. Try the shlex
module.
http://www.python.org/doc/lib/module-shlex.html
Try the following code: it seems to do what you want. If it doesn't,
then please be more specific on your tokenisation rules.
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
splitchars = [' ', '\n', '=', '/',]
source = """
thisshouldcome inthree parts
thisshould comeintwo
andso/shouldthis
and=this
"""
import shlex
import StringIO
def prepareToker(toker, splitters):
for s in splitters: # resists People's Front of Judea joke ;-D
if toker.whitespace.find(s) == -1:
toker.whitespace = "%s%s" % (s, toker.whitespace)
return toker
buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker = prepareToker(toker, splitchars)
for num, tok in enumerate(toker):
print "%s:%s" % (num, tok)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Note that the use of the iteration based interface in the above code
requires python 2.3. If you need it to run on previous versions,
specify which one.
regards,
--
alan kennedy
------------------------------------------------------
check http headers here: http://xhaus.com/headers
email alan: http://xhaus.com/contact/alan
More information about the Python-list
mailing list