Fastest way to calculate leading whitespace

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Sat May 8 14:46:26 EDT 2010


On Sat, 08 May 2010 10:19:16 -0700, dasacc22 wrote:

> Hi
> 
> This is a simple question. I'm looking for the fastest way to calculate
> the leading whitespace (as a string, ie '    ').

Is calculating the amount of leading whitespace really the bottleneck in 
your application? If not, then trying to shave off microseconds from 
something which is a trivial part of your app is almost certainly a waste 
of your time.


[...]
> a = '    some content\n'
> b = a.strip()
> c = ' '*(len(a)-len(b))


I take it that you haven't actually tested this code for correctness, 
because it's buggy. Let's test it:

>>> leading_whitespace = " "*2 + "\t"*2
>>> a = leading_whitespace + "some non-whitespace text\n"
>>> b = a.strip()
>>> c = " "*(len(a)-len(b))
>>> assert c == leading_whitespace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError


Not only doesn't it get the whitespace right, but it doesn't even get the 
*amount* of whitespace right:

>>> assert len(c) == len(leading_whitespace)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

It doesn't even work correctly if you limit "whitespace" to mean spaces 
and nothing else! It's simply wrong in every possible way.

This is why people say that premature optimization is the root of all 
(programming) evil. Instead of wasting time and energy trying to optimise 
code, you should make it correct first.

Your solutions 2 and 3 are also buggy. And solution 3 can be easily re-
written to be more straightforward. Instead of the complicated:

> def get_leading_whitespace(s):
>     def _get():
>         for x in s:
>             if x != ' ':
>                 break
>             yield x
>     return ''.join(_get())

try this version:

def get_leading_whitespace(s):
    accumulator = []
    for c in s:
        if c in ' \t\v\f\r\n':
            accumulator.append(c)
        else:
            break
    return ''.join(accumulator)

Once you're sure this is correct, then you can optimise it:

def get_leading_whitespace(s):
    t = s.lstrip()
    return s[:len(s)-len(t)]

>>> c = get_leading_whitespace(a)
>>> assert c == leading_whitespace
>>>

Unless your strings are very large, this is likely to be faster than any 
other pure-Python solution you can come up with.


-- 
Steven



More information about the Python-list mailing list