[Ironpython-users] Hashing a directory is magnitudes slower than in cPython

Thu Feb 27 13:27:48 CET 2014

On Thu, Feb 27, 2014 at 11:11 AM, Markus Schaber <m.schaber at codesys.com> wrote:
> Hi,
>
> I'm just trying to sum it up:
>
> 1) The current code:
>    - High memory usage.
>    - High load on the large object heap.
>    - Limited by the available amount of memory (which might be considered a violation of the Python API).
>    - High CPU usage when used incrementally (quadratic to the number of blocks added).
>
> 2) Optimizing with MemoryStream and lazy calculation:
>    - High memory usage.
>    - High load on the large object heap.
>    - Limited by the available amount of memory (which might be considered a violation of the Python API).
>    + Optimal CPU usage when the hash is only fetched once.
>    ± Better than current code, but still not optimal when hash is incrementally fetched several times.
>
> 3) Optimizing with jagged arrays and lazy calculation:
>    - High memory usage.
>    + Improved or no impact on the large object heap (depending on the exact implementation)
>    - Limited by the available amount of memory (which might be considered a violation of the Python API).
>    + Optimal CPU usage when the hash is only fetched once.
>    ± Better than current code, but still not optimal when hash is incrementally fetched several times.
>
> 4) Using the existing .NET incremental APIs
>    + Low, constant memory usage.
>    + No impact on the large object heap.
>    + No limit of data length by the amount of memory.
>    + Optimal CPU usage when the hash is only fetched once.
>    - Breaks when hash is incrementally fetched several times (which likely is a violation of the Python API).
>
> 5) Finding or porting a different Hash implementation in C#:
>    + Low, constant memory usage.
>    + No impact on the large object heap.
>    + No limit of data length by the amount of memory.
>    + Optimal CPU usage when the hash is only fetched once.
>    + Optimal CPU usage when the hash is incrementally fetched several times.
>
> I've a local prototype implemented for 2), but I'm not sure whether that's the best way to go...

Good analysis!

My preference would be for (4), raising an exception if .update() is
called after .digest(), or .copy() is called at all. As a fallback, an
extra parameter to hashlib.new (&c) that triggers (2), for cases where
its needed - I can't say for sure, but I would think calling .update()
after .digest() would be rare, and so would .copy() (damn you Google
for shutting down code search). At least then the common case is fast
and edge cases are (usually) possible.

>
> Maybe we should google for purely managed implementations of the hash codes with a sensible license...
>

There seems to be for MD5 and SHA1 but not SHA2 or RIPEMD160. They
could be ported from the public domain Crypto++ library, but that
seems like a lot of work for an edge case.

- Jeff