Looking for very simple general purpose tokenizer

JanC usenet_spam at janc.invalid
Tue Jan 20 23:37:51 EST 2004


Maarten van Reeuwijk <maarten at remove_this_ws.tn.tudelft.nl> schreef:

> I found a complication with the shlex module. When I execute the 
> following fragment you'll notice that doubles are split. Is there any way 
> to avoid numbers this?

>From the docs at <http://www.python.org/doc/current/lib/shlex-objects.html>

wordchars
    The string of characters that will accumulate into multi-character 
    tokens. By default, includes all ASCII alphanumerics and underscore.

> source = """
>  $NAMRUN
>      Lz      =  0.15
>      nu      =  1.08E-6
> """
> 
> import shlex
> import StringIO
> 
> buf = StringIO.StringIO(source)
> toker = shlex.shlex(buf)
> toker.comments = ""
> toker.whitespace = " \t\r"

toker.wordchars = toker.wordchars + ".-$"   # etc.

> print [tok for tok in toker]


Output:

['\n', '$NAMRUN', '\n', 'Lz', '=', '0.15', '\n', 'nu', '=', '1.08E-6', '\n']

Is this what you want?

-- 
JanC

"Be strict when sending and tolerant when receiving."
RFC 1958 - Architectural Principles of the Internet - section 3.9



More information about the Python-list mailing list