a splitting headache

Thu Oct 22 13:56:05 EDT 2009

On Oct 22, 8:17 am, David C. Ullrich <dullr... at sprynet.com> wrote:
> On Wed, 21 Oct 2009 14:43:48 -0700 (PDT), Mensanator
>
>
>
>
>
> <mensana... at aol.com> wrote:
> >On Oct 21, 2:46 pm, David C Ullrich <dullr... at sprynet.com> wrote:
> >> On Tue, 20 Oct 2009 15:22:55 -0700, Mensanator wrote:
> >> > On Oct 20, 1:51 pm, David C Ullrich <dullr... at sprynet.com> wrote:
> >> >> On Thu, 15 Oct 2009 18:18:09 -0700, Mensanator wrote:
> >> >> > All I wanted to do is split a binary number into two lists, a list of
> >> >> > blocks of consecutive ones and another list of blocks of consecutive
> >> >> > zeroes.
>
> >> >> > But no, you can't do that.
>
> >> >> >>>> c = '0010000110'
> >> >> >>>> c.split('0')
> >> >> > ['', '', '1', '', '', '', '11', '']
>
> >> >> > Ok, the consecutive delimiters appear as empty strings for reasons
> >> >> > unknown (except for the first one). Except when they start or end the
> >> >> > string in which case the first one is included.
>
> >> >> > Maybe there's a reason for this inconsistent behaviour but you won't
> >> >> > find it in the documentation.
>
> >> >> Wanna bet? I'm not sure whether you're claiming that the behavior is
> >> >> not specified in the docs or the reason for it. The behavior certainly
> >> >> is specified. I conjecture you think the behavior itself is not
> >> >> specified,
>
> >> > The problem is that the docs give a single example
>
> >> >>>> '1,,2'.split(',')
> >> > ['1','','2']
>
> >> > ignoring the special case of leading/trailing delimiters. Yes, if you
> >> > think it through, ',1,,2,'.split(',') should return ['','1','','2','']
> >> > for exactly the reasons you give.
>
> >> > Trouble is, we often find ourselves doing ' 1  2  '.split() which
> >> > returns
> >> > ['1','2'].
>
> >> > I'm not saying either behaviour is wrong, it's just not obvious that the
> >> > one behaviour doesn't follow from the other and the documentation could
> >> > be
> >> > a little clearer on this matter. It might make a bit more sense to
> >> > actually
> >> > mention the slpit(sep) behavior that split() doesn't do.
>
> >> Have you _read_ the docs?
>
> >Yes.
>
> >> They're quite clear on the difference
> >> between no sep (or sep=None) and sep=something:
>
> >I disagree that they are "quite clear". The first paragraph makes no
> >mention of leading or trailing delimiters and they show no example
> >of such usage. An example would at least force me to think about it
> >if it isn't specifically mentioned in the paragraph.
>
> >One could infer from the second paragraph that, as it doesn't return
> >empty stings from leading and trailing whitespace, slpit(sep) does
> >for leading/trailing delimiters. Of course, why would I even be
> >reading
> >this paragraph when I'm trying to understand split(sep)?
>
> A skightly less sarcastic answer than what I just posted:

And a slightly less sarcastic reply.

>
> I don't see why you _should_ need to read the second paragraph
> to infer that leading delimiters will return empty strings when
> you do split(sep). That's exactly what one would expect!

Yes, AFTER you read the docs. But prior to opening them, coupled
with a long history of using split(), there is no reason to expect
such behaviour at all.

> As I pointed out the other day, if you're splitting ',,p' with
> sep = ',' that means you're looking for strings _separated by_
> commas. That means you're asking for [s1, s2, ...] where
> s1 is the part of the string preceding the first comma,
> s2 is the part of the string after the first comma but
> before the second comma, etc. And that means s1 = ''
> here.

It behaves much like the CSV module, which I'm very familiar
with from Excel. But when importing into Excel, you have the
option of treating consecutive delimiters as one, but unlike
split(), a single leading delimiter will NOT be thrown away.
I would wager that the body of Excel users is vastly greater
than the body of Python programmers. It doesn't hurt to
explicitly point out the obvious, because what's obvious may
differ from people's experience.

>
> That's what "split on commas" _means_. It's also exactly
> what you want in typical applications, for example
> parsing comma-separated data. The fact that s.split()
> does _not_ include an empty string at the start if s
> begins with whitespace is that counterintuitive part;
> that's why it's specified in the second paragraph
> (whether you believe it or not, _that's_ what
> confused _me_ once. At which point I read the docs...)
> I suppose it makes sense given a typical use case of
> s.split(), where s is text and we want to find a list of
> the words in s.

Right, what I wanted was to extract the 'words' consisting
of blocks of contiguous 1-bits from a binary number and simply
discard the 0's. I was then going to do the same process only
delimiting on 1's to get blocks of 0's. What I was expecting
was split(sep) to work similar to split() as it is somewhat
unusal for the algorithm to change. I still think the
documentation could do a better job explaining this.

>
> Really. I can't understand why you would _expect_
> s.split(sep) to do anything other than
>
> def split(s, sep):
>   res = []
>   acc = ''
>   for c in s:
>     if c in sep:
>       res.append(acc)
>       acc = ''
>     else:
>       acc = acc + c
>   res.append(acc)
>   return res

A very good example, it should be in the docs. Have a look at
the itertools module docs. There they do a wonderful job of
explaining with numerous cases of "itertools.x is equivalent
to ..."

>
> Really. You're used to the idea that sum_{j=1}^0 c_j
> should be 0, right? That's for exactly the same reason -
> the obvious thing for sum_{j=a}^b c_j to return is
> given by
>
> def sum(c, lower, upper):
>   res = 0
>   j = lower
>   while j <= upper:
>     res = res + c[j]
>     j = j + 1
>   return res
>
>
>
>
>
> >The splitting of real strings is just as important, if not more so,
> >than the behaviour of splitting empty strings. Especially when the
> >behaviour is radically different.
>
> >>>> '010000110'.split('0')
> >['', '1', '', '', '', '11', '']
>
> >is a perfect example. It shows the empty strings generated from the
> >leading and trailing delimiters, and also that you get 3 empty
> >strings
> >between the '1's, not 4. When creating documentation, it is always a
> >good idea to document such cases.
>
> >And you'll then want to compare this to the equivalent whitespace
> >case:
> >>>> ' 1    11 '.split()
> >['1', '11']
>
> >And it wouldn't hurt to point this out:
> >>>> c = '010000110'.split('0')
> >>>> '0'.join(c)
> >'010000110'
>
> >and note that it won't work with the whitespace version.
>
> >No, I have not submitted a request to change the documentation, I was
> >looking for some feedback here. And it seems that no one else
> >considers
> >the documentation wanting.
>
> >> "If sep is given, consecutive delimiters are not grouped together and are
> >> deemed to delimit empty strings (for example, '1,,2'.split(',') returns
> >> ['1', '', '2']). The sep argument may consist of multiple characters (for
> >> example, '1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an
> >> empty string with a specified separator returns [''].
>
> >> If sep is not specified or is None, a different splitting algorithm is
> >> applied: runs of consecutive whitespace are regarded as a single
> >> separator, and the result will contain no empty strings at the start or
> >> end if the string has leading or trailing whitespace. Consequently,
> >> splitting an empty string or a string consisting of just whitespace with
> >> a None separator returns []."
>
> >> >> because your description of what's happening,
>
> >> >> "consecutive delimiters appear as empty strings for reasons
>
> >> >> > unknown (except for the first one). Except when they start or end the
> >> >> > string in which case the first one is included"
>
> >> >> is at best an awkward way to look at it. The delimiters are not
> >> >> appearing as empty strings.
>
> >> >> You're asking to split  '0010000110' on '0'. So you're asking for
> >> >> strings a, b, c, etc such that
>
> >> >> (*) '0010000110' = a + '0' + b + '0' + c + '0' + etc
>
> >> >> The sequence of strings you're getting as output satisfies (*) exactly;
> >> >> the first '' is what appears before the first delimiter, the second ''
> >> >> is what's between the first and second delimiters, etc.
>
> David C. Ullrich
>
> "Understanding Godel isn't about following his formal proof.
> That would make a mockery of everything Godel was up to."
> (John Jones, "My talk about Godel to the post-grads."
> in sci.logic.)