Looking for very simple general purpose tokenizer
JanC
usenet_spam at janc.invalid
Tue Jan 20 23:37:51 EST 2004
Maarten van Reeuwijk <maarten at remove_this_ws.tn.tudelft.nl> schreef:
> I found a complication with the shlex module. When I execute the
> following fragment you'll notice that doubles are split. Is there any way
> to avoid numbers this?
>From the docs at <http://www.python.org/doc/current/lib/shlex-objects.html>
wordchars
The string of characters that will accumulate into multi-character
tokens. By default, includes all ASCII alphanumerics and underscore.
> source = """
> $NAMRUN
> Lz = 0.15
> nu = 1.08E-6
> """
>
> import shlex
> import StringIO
>
> buf = StringIO.StringIO(source)
> toker = shlex.shlex(buf)
> toker.comments = ""
> toker.whitespace = " \t\r"
toker.wordchars = toker.wordchars + ".-$" # etc.
> print [tok for tok in toker]
Output:
['\n', '$NAMRUN', '\n', 'Lz', '=', '0.15', '\n', 'nu', '=', '1.08E-6', '\n']
Is this what you want?
--
JanC
"Be strict when sending and tolerant when receiving."
RFC 1958 - Architectural Principles of the Internet - section 3.9
More information about the Python-list
mailing list