generic tokenizer

Alex Martelli aleaxit at yahoo.com
Wed Sep 1 07:26:22 EDT 2004


Angus Mackay <yeah at right.com> wrote:

> I remember python having a generic tokenizer in the library. all I want
> is to set a list of token seperators and then read tokens out of a 
> stream, the token seperators should be returned as themselves.
> 
> is there anything like this?

Not as such in the standard library: the functions in module tokenizer
do not let you 'set a list of token separators'.  If what you're
tokenizing can fit in a string in memory, module re can help:

>>> x=re.compile('(\s+|,|;)')
>>> for w in x.split('a,b, c;d; e'): print repr(w),'+',
... 
'a' + ',' + 'b' + ',' + '' + ' ' + 'c' + ';' + 'd' + ';' + '' + ' ' +
'e' +


Note that you get empty-string items when two separators abut.

If the limitations of re.split (stuff must fit in memory, &c) are a
problem, then the lexx-like solutions I see somebody else suggested may
be more appropriate for your needs.


Alex



More information about the Python-list mailing list