String splitting with exceptions

Wed Aug 28 14:31:17 EDT 2013

Neil Cerutti wrote:

> On 2013-08-28, John Levine <johnl at iecc.com> wrote:
>> I have a crufty old DNS provisioning system that I'm rewriting and I
>> hope improving in python.  (It's based on tinydns if you know what
>> that is.)
>>
>> The record formats are, in the worst case, like this:
>>
>> foo.[DOM]::[IP6::4361:6368:6574]:600::
>>
>> What I would like to do is to split this string into a list like this:
>>
>> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
>>
>> Colons are separators except when they're inside square
>> brackets.  I have been messing around with re.split() and
>> re.findall() and haven't been able to come up with either a
>> working separator pattern for split() or a working field
>> pattern for findall().  I came pretty close with findall() but
>> can't get it to reliably match the nothing between two adjacent
>> colons not inside brackets.
>>
>> Any suggestions? I realize I could do it in a loop where I pick
>> stuff off the front of the string, but yuck.
> 
> A little parser, as Skip suggested, is a good way to go.
> 
> The brackets make your string context-sensitive, a difficult
> concept to cleanly parse with a regex.
> 
> I initially hoped a csv module dialect could work, but the quote
> character is (currently) hard-coded to be a single, simple
> character, i.e., I can't tell it to treat [xxx] as "xxx".
> 
> What about Skip's suggestion? A little parser. It might seem
> crass or something, but it really is easier than musceling a
> regex into a context sensitive grammer.
> 
> def dns_split(s):
>     in_brackets = False
>     b = 0 # index of beginning of current string
>     for i, c in enumerate(s):
>         if not in_brackets:
>             if c == "[":
>                 in_brackets = True
>             elif c == ':':
>                 yield s[b:i]
>                 b = i+1
>         elif c == "]":
>             in_brackets = False

I think you need one more yield outside the loop.

>>>> print(list(dns_split(s)))
> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']
> 
> It'll gag on nested brackets (fixable with a counter) and has no
> error handling (requires thought), but it's a start.

Something similar on top of regex:

>>> def split(s):
...     start = level = 0
...     for m in re.compile(r"[[:\]]").finditer(s):
...             if m.group() == "[": level += 1
...             elif m.group() == "]":
...                     assert level
...                     level -= 1
...             elif level == 0:
...                     yield s[start:m.start()]
...                     start = m.end()
...     yield s[start:]
... 
>>> list(split("a[b:c:]:d"))
['a[b:c:]', 'd']
>>> list(split("a[b:c[:]]:d"))
['a[b:c[:]]', 'd']
>>> list(split(""))
['']
>>> list(split(":"))
['', '']
>>> list(split(":x"))
['', 'x']
>>> list(split("[:x]"))
['[:x]']
>>> list(split(":[:x]"))
['', '[:x]']
>>> list(split(":[:[:]:x]"))
['', '[:[:]:x]']
>>> list(split("[:::]"))
['[:::]']
>>> s = "foo.[DOM]::[IP6::4361:6368:6574]:600::"
>>> list(split(s))
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']

Note that there is one more empty string which I believe the OP forgot.