String splitting with exceptions
Peter Otten
__peter__ at web.de
Wed Aug 28 14:31:17 EDT 2013
Neil Cerutti wrote:
> On 2013-08-28, John Levine <johnl at iecc.com> wrote:
>> I have a crufty old DNS provisioning system that I'm rewriting and I
>> hope improving in python. (It's based on tinydns if you know what
>> that is.)
>>
>> The record formats are, in the worst case, like this:
>>
>> foo.[DOM]::[IP6::4361:6368:6574]:600::
>>
>> What I would like to do is to split this string into a list like this:
>>
>> [ 'foo.[DOM]','','[IP6::4361:6368:6574]','600','' ]
>>
>> Colons are separators except when they're inside square
>> brackets. I have been messing around with re.split() and
>> re.findall() and haven't been able to come up with either a
>> working separator pattern for split() or a working field
>> pattern for findall(). I came pretty close with findall() but
>> can't get it to reliably match the nothing between two adjacent
>> colons not inside brackets.
>>
>> Any suggestions? I realize I could do it in a loop where I pick
>> stuff off the front of the string, but yuck.
>
> A little parser, as Skip suggested, is a good way to go.
>
> The brackets make your string context-sensitive, a difficult
> concept to cleanly parse with a regex.
>
> I initially hoped a csv module dialect could work, but the quote
> character is (currently) hard-coded to be a single, simple
> character, i.e., I can't tell it to treat [xxx] as "xxx".
>
> What about Skip's suggestion? A little parser. It might seem
> crass or something, but it really is easier than musceling a
> regex into a context sensitive grammer.
>
> def dns_split(s):
> in_brackets = False
> b = 0 # index of beginning of current string
> for i, c in enumerate(s):
> if not in_brackets:
> if c == "[":
> in_brackets = True
> elif c == ':':
> yield s[b:i]
> b = i+1
> elif c == "]":
> in_brackets = False
I think you need one more yield outside the loop.
>>>> print(list(dns_split(s)))
> ['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '']
>
> It'll gag on nested brackets (fixable with a counter) and has no
> error handling (requires thought), but it's a start.
Something similar on top of regex:
>>> def split(s):
... start = level = 0
... for m in re.compile(r"[[:\]]").finditer(s):
... if m.group() == "[": level += 1
... elif m.group() == "]":
... assert level
... level -= 1
... elif level == 0:
... yield s[start:m.start()]
... start = m.end()
... yield s[start:]
...
>>> list(split("a[b:c:]:d"))
['a[b:c:]', 'd']
>>> list(split("a[b:c[:]]:d"))
['a[b:c[:]]', 'd']
>>> list(split(""))
['']
>>> list(split(":"))
['', '']
>>> list(split(":x"))
['', 'x']
>>> list(split("[:x]"))
['[:x]']
>>> list(split(":[:x]"))
['', '[:x]']
>>> list(split(":[:[:]:x]"))
['', '[:[:]:x]']
>>> list(split("[:::]"))
['[:::]']
>>> s = "foo.[DOM]::[IP6::4361:6368:6574]:600::"
>>> list(split(s))
['foo.[DOM]', '', '[IP6::4361:6368:6574]', '600', '', '']
Note that there is one more empty string which I believe the OP forgot.
More information about the Python-list
mailing list