generic tokenizer
Alex Martelli
aleaxit at yahoo.com
Wed Sep 1 07:26:22 EDT 2004
Angus Mackay <yeah at right.com> wrote:
> I remember python having a generic tokenizer in the library. all I want
> is to set a list of token seperators and then read tokens out of a
> stream, the token seperators should be returned as themselves.
>
> is there anything like this?
Not as such in the standard library: the functions in module tokenizer
do not let you 'set a list of token separators'. If what you're
tokenizing can fit in a string in memory, module re can help:
>>> x=re.compile('(\s+|,|;)')
>>> for w in x.split('a,b, c;d; e'): print repr(w),'+',
...
'a' + ',' + 'b' + ',' + '' + ' ' + 'c' + ';' + 'd' + ';' + '' + ' ' +
'e' +
Note that you get empty-string items when two separators abut.
If the limitations of re.split (stuff must fit in memory, &c) are a
problem, then the lexx-like solutions I see somebody else suggested may
be more appropriate for your needs.
Alex
More information about the Python-list
mailing list