Debugging reason for python running unreasonably slow when adding numbers

Tue Mar 14 03:48:24 EDT 2023

I'm working on an NLP and I got bitten by an unreasonably slow behaviour in Python while operating with small amounts of numbers.

I have the following code:

```python
import random, time
from functools import reduce

def trainPerceptron(perceptron, data):
  learningRate = 0.002
  weights = perceptron['weights']
  error = 0
  for chunk in data:
      input = chunk['input']
      output = chunk['output']

      # 12x slower than equivalent JS
      sum_ = 0
      for key in input:
          v = weights[key]
          sum_ += v

      # 20x slower than equivalent JS
      #sum_ = reduce(lambda acc, key: acc + weights[key], input)

      actualOutput = sum_ if sum_ > 0 else 0

      expectedOutput = 1 if output == perceptron['id'] else 0
      currentError = expectedOutput - actualOutput
      if currentError:
          error += currentError ** 2
          change = currentError * learningRate
          for key in input:
              weights[key] += change

  return error

# Build mock data structure
data = [{
   'input': random.sample(range(0, 5146), 10),
   'output': 0
} for _ in range(11514)]
perceptrons = [{
   'id': i,
   'weights': [0.0] * 5147,
} for i in range(60)] # simulate 60 perceptrons

# Simulate NLU
for i in range(151): # 150 iterations
  hrstart = time.perf_counter()
  for perceptron in perceptrons:
    trainPerceptron(perceptron, data)
  hrend = time.perf_counter()
  print(f'Epoch {i} - Time for training: {int((hrend - hrstart) * 1000)}ms')
```

Running it on my M1 MBP I get the following numbers.

```
Epoch 0 - Time for training: 199ms
Epoch 1 - Time for training: 198ms
Epoch 2 - Time for training: 199ms
Epoch 3 - Time for training: 197ms
Epoch 4 - Time for training: 199ms
...
Epoch 146 - Time for training: 198ms
Epoch 147 - Time for training: 200ms
Epoch 148 - Time for training: 198ms
Epoch 149 - Time for training: 198ms
Epoch 150 - Time for training: 198ms
```

Each epoch is taking around 200ms, which is unreasonably slow given the small amount of numbers that are being processed. I profiled the code with `cProfile` in order to find out what is going on:

```
         655306 function calls (655274 primitive calls) in 59.972 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/1    0.000    0.000   59.972   59.972 {built-in method builtins.exec}
        1    0.005    0.005   59.972   59.972 poc-small.py:1(<module>)
     9060   59.850    0.007   59.850    0.007 poc-small.py:4(trainPerceptron)
        1    0.006    0.006    0.112    0.112 poc-small.py:34(<listcomp>)
    11514    0.039    0.000    0.106    0.000 random.py:382(sample)
   115232    0.034    0.000    0.047    0.000 random.py:235(_randbelow_with_getrandbits)
    11548    0.002    0.000    0.012    0.000 {built-in method builtins.isinstance}
    11514    0.002    0.000    0.010    0.000 <frozen abc>:117(__instancecheck__)
   183616    0.010    0.000    0.010    0.000 {method 'getrandbits' of '_random.Random' objects}
    11514    0.002    0.000    0.008    0.000 {built-in method _abc._abc_instancecheck}
    11514    0.002    0.000    0.006    0.000 <frozen abc>:121(__subclasscheck__)
   115140    0.005    0.000    0.005    0.000 {method 'add' of 'set' objects}
    11514    0.003    0.000    0.004    0.000 {built-in method _abc._abc_subclasscheck}
   115232    0.004    0.000    0.004    0.000 {method 'bit_length' of 'int' objects}
      151    0.003    0.000    0.003    0.000 {built-in method builtins.print}
```

This wasn't too helpful, so I tried with [line_profiler][1]:

```
Timer unit: 1e-06 s

Total time: 55.2079 s
File: poc-small.py
Function: trainPerceptron at line 4

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     4                                           @profile
     5                                           def trainPerceptron(perceptron, data):
     6      1214        301.0      0.2      0.0    learningRate = 0.002
     7      1214        255.0      0.2      0.0    weights = perceptron['weights']
     8      1214        114.0      0.1      0.0    error = 0
     9  13973840    1742427.0      0.1      3.2    for chunk in data:
    10  13973840    1655043.0      0.1      3.0        input = chunk['input']
    11  13973840    1487543.0      0.1      2.7        output = chunk['output']
    12
    13                                                 # 12x slower than equivalent JS
    14  13973840    1210755.0      0.1      2.2        sum_ = 0
    15 139738397   13821056.0      0.1     25.0        for key in input:
    16 139738397   13794656.0      0.1     25.0            v = weights[key]
    17 139738396   14942692.0      0.1     27.1            sum_ += v
    18
    19                                                 # 20x slower than equivalent JS
    20                                                 #sum_ = reduce(lambda acc, key: acc + weights[key], input)
    21
    22  13973839    1618273.0      0.1      2.9        actualOutput = sum_ if sum_ > 0 else 0
    23
    24  13973839    1689194.0      0.1      3.1        expectedOutput = 1 if output == perceptron['id'] else 0
    25  13973839    1365346.0      0.1      2.5        currentError = expectedOutput - actualOutput
    26  13732045    1211916.0      0.1      2.2        if currentError:
    27    241794      38375.0      0.2      0.1            error += currentError ** 2
    28    241794      25377.0      0.1      0.0            change = currentError * learningRate
    29   2417940     271237.0      0.1      0.5            for key in input:
    30   2417940     332890.0      0.1      0.6                weights[key] += change
    31
    32      1213        405.0      0.3      0.0    return error
```

This shows that these 3 lines (that are adding the numbers) are taking the entire runtime budget:

```python
for key in input:
   v = weights[key]
   sum_ += v
```

I thought throwing numpy at the problem in order to speed it up, but `input` has a very small size, which means that the overhead of calling numpy will hurt the performance more than the gains obtained by making the math operations with it.

Anyhow, adding numbers shouldn't be that slow, which makes me believe something weird is going on with Python. In order to confirm my theory, I ported the code to Javascript. This is the result:

```js
function trainPerceptron(perceptron, data) {
  const learningRate = 0.002;
  const weights = perceptron['weights'];
  let error = 0;
  for (const chunk of data) {
    const input = chunk['input'];
    const output = chunk['output'];

    const sum = input.reduce((acc, key) => acc + weights[key], 0);
    const actualOutput = sum > 0 ? sum : 0;

    const expectedOutput = output === perceptron['id'] ? 1 : 0;
    const currentError = expectedOutput - actualOutput;
    if (currentError) {
      error += currentError ** 2;
      const change = currentError * learningRate;
      for (const key in input) {
        weights[key] += change;
      }
    }
  }
  return error;
}

// Build mock data structure
const data = new Array(11514);
for (let i = 0; i < data.length; i++) {
  const inputSet = new Set();
  while (inputSet.size < 10) {
    inputSet.add(Math.floor(Math.random() * 5146));
  }
  const input = Array.from(inputSet);
  data[i] = { input: input, output: 0 };
}

const perceptrons = Array.from({ length: 60 }, (_, i) => ({
  id: i,
  weights: Array.from({ length: 5147 }, () => 0.0),
})); // simulate 60 perceptrons

// Simulate NLU
for (let i = 0; i < 151; i++) { // 150 iterations
  const hrstart = performance.now();
  for (const perceptron of perceptrons) {
    trainPerceptron(perceptron, data);
  }
  const hrend = performance.now();
  console.log(`Epoch ${i} - Time for training: ${Math.floor(hrend - hrstart)}ms`);
}
```

When I run the JS code I get the following numbers:

```
Epoch 0 - Time for training: 30ms
Epoch 1 - Time for training: 18ms
Epoch 2 - Time for training: 17ms
Epoch 3 - Time for training: 17ms
Epoch 4 - Time for training: 17ms
...
Epoch 147 - Time for training: 17ms
Epoch 148 - Time for training: 17ms
Epoch 149 - Time for training: 17ms
Epoch 150 - Time for training: 17ms
```

These numbers confirm my theory. Python is being unreasonably slow. Any idea why or what exactly is making it perform so poorly?

Runtime details:

MacOS Ventura 13.2.1 (22D68)
Macbook Pro M1 Pro 32GB
Python 3.11.0 (native Apple Silicon)

Regards!

  [1]: https://github.com/pyutils/line_profiler