[Ironpython-users] Hashing a directory is magnitudes slower than in cPython

Markus Schaber m.schaber at codesys.com
Tue Feb 25 13:38:36 CET 2014


Hi,

A coworker just consulted me on a performance problem of IronPython vs. cPython.

The attached test script reproduces the problem. 

On cPython 2.7.6, it needs about 1.5 seconds on our test directory (once the OS disk cache is hot), and cPython 3.3 needs about 1.7 seconds, while IronPython needs more than 10 minutes(!).

C:\Users\m.schaber>c:\Python33\python.exe d:\crc-fixed.py "c:\Test Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.721932103156255

C:\Users\m.schaber>c:\Python33\python.exe d:\crc-fixed.py "c:\Test Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.7523154039322837

C:\Users\m.schaber>python d:\crc-fixed.py "c:\Test Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.44541429616

C:\Users\m.schaber>python d:\crc-fixed.py "c:\Test Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 1.40604227074

C:\Users\m.schaber>"c:\Program Files (x86)\IronPython 2.7\ipy.exe" d:\crc.py "c:\Test Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 602.745100044

C:\Users\m.schaber>"c:\Program Files (x86)\IronPython 2.7\ipy.exe" d:\crc.py "c:\Test Specifications\AutotestRepository"
Examining: c:\Test Specifications\AutotestRepository
Checksum: f7ff573eb219b0ce79bd204e3625b5e2
Seconds: 607.252915722


My first guess was that it's a problem of the cPython 8 Bit strings vs. .NET strings, which cause expensive conversions. (I also guess that a Python 3 based IronPython will fix this issue.)

One idea to fix this may be to add an overload to MD5Type.update() which directly accepts strings (and maybe one accepting byte arrays), to avoid the call to the conversion functions.

On a closer look, there's the additional (and IMHO much worse) problem that the update() method seems not to work incrementally:

private void update(IList<byte> newBytes) {
    byte[] updatedBytes = new byte[_bytes.Length + newBytes.Count];
    Array.Copy(_bytes, updatedBytes, _bytes.Length);
    newBytes.CopyTo(updatedBytes, _bytes.Length);
    _bytes = updatedBytes;
    _hash = GetHasher().ComputeHash(_bytes);
}

In our use-case, this means that every file which is read leads to a reallocation and copying and recalculation of the MD5 sum of all the data which was read until now. This is suboptimal from memory and performance perspective.

I'm not an expert on the .NET crypto APIs, but I guess there should be some incremental API available there which could be exploited.

If not, we could try to find a suitable pure .NET implementation like http://archive.msdn.microsoft.com/SilverlightMD5.

A less intrusive workaround may be to collect the bytes using a MemoryStream, and feeding it to ComputeHash() only on demand, when someone actually requests the hash result via digest() or hexdigest().


PS: Our use-case of MD5 is purely for technical data integrity, not against malicious users, cryptographic security is not required.

Best regards

Markus Schaber

CODESYS(r) a trademark of 3S-Smart Software Solutions GmbH

Inspiring Automation Solutions

3S-Smart Software Solutions GmbH
Dipl.-Inf. Markus Schaber | Product Development Core Technology
Memminger Str. 151 | 87439 Kempten | Germany
Tel. +49-831-54031-979 | Fax +49-831-54031-50

E-Mail: m.schaber at codesys.com | Web: http://www.codesys.com | CODESYS store: http://store.codesys.com
CODESYS forum: http://forum.codesys.com

Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner | Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915



More information about the Ironpython-users mailing list