Tokenize a string or split on steroids

Bengt Richter bokr at oz.net
Sat Mar 9 18:42:58 EST 2002


On Sat, 09 Mar 2002 18:10:14 +0100, Fernando Rodríguez <frr at wanadoo.es> wrote:

>On Sat, 09 Mar 2002 11:30:40 GMT, Bob Follek <b.follek at verizon.net> wrote:
>
>
>>If you're unfamiliar with regular expressions, here's a good starting
>>point: http://py-howto.sourceforge.net/regex/regex.html
>
>Thanks. :-) BTW, the strings that must be tokenized contain other non
>alphanumeric characters (parenthese, for example), so I tried another regex:
>[{}].
>
>The result, although usable, is sort of weird:
>
>>>> s = "{one}{two}"
>>>> x1 = re.compile('[{}]')
>>>> x1.split(s)
>['', 'one', '', 'two', '']
>
>Where are those empty strings coming from??? :-?
>I can filter() them out, but I wonder where they come from.... O:-)

Think of the commas in the list as places where your pattern matched.
There was nothing in front of the leading match, so that's represented
by ''. Note that there are two commas between 'one' and 'two' -- your
pattern matched twice, because it was a single character. If you change
the pattern to '[{}]+' you will get a single comma between 'one' and 'two',
but the end matches will be unchanged. Of course with that change,
"{one}{}{}{two}" will give the same result as "{one}{two}", so caveat
<Latin-for-changer>.

You can think of join as replacing the commas with the join string, e.g.,
note what happens when you use join on the result above using '!':

 >>> '!'.join(['', 'one', '', 'two', ''])
 '!one!!two!'

This comma talk is just a mnemonic of course, don't take it literally ;-)

Regards,
Bengt Richter




More information about the Python-list mailing list