[Ironpython-users] Hashing a directory is magnitudes slower than in cPython

Tue Feb 25 17:53:33 CET 2014

On Tue, Feb 25, 2014 at 12:38 PM, Markus Schaber <m.schaber at codesys.com> wrote:
> Hi,
>
> A coworker just consulted me on a performance problem of IronPython vs. cPython.
>
> ... snip ...
>
> On a closer look, there's the additional (and IMHO much worse) problem that the update() method seems not to work incrementally:
>
> private void update(IList<byte> newBytes) {
>     byte[] updatedBytes = new byte[_bytes.Length + newBytes.Count];
>     Array.Copy(_bytes, updatedBytes, _bytes.Length);
>     newBytes.CopyTo(updatedBytes, _bytes.Length);
>     _bytes = updatedBytes;
>     _hash = GetHasher().ComputeHash(_bytes);
> }
>
> In our use-case, this means that every file which is read leads to a reallocation and copying and recalculation of the MD5 sum of all the data which was read until now. This is suboptimal from memory and performance perspective.
>
> I'm not an expert on the .NET crypto APIs, but I guess there should be some incremental API available there which could be exploited.

http://ironpython.codeplex.com/workitem/34022

I've also CC'd Emmanuel Chomarat, who was investigating a fix for
this. Unfortunately I don't think there's an easy solution based on
how the .NET APIs are constructed. Quoting from Emmanuel's email to me
a while back:

"I am now using TransformBlock / TransformBlockFinal to compute the
current hash with a linear complexity ( whereas we had before n**2)
but I am still facing an issue.
First we need to have a copy operator, this is not possible because we
can not share the hash instance between two objects in .net, the only
way to make it consistent with what python is doing is by keeping a
copy of the full data in MEMORY in order to create a new instance with
these data when copy is called.
The second thing is that digest can be called several times in python
with some new data added/updated to the hash , in C# as soon as
TransformBlockFinal has been called once we can not anymore add more
data to the stream. Once again I have been able to use the same
previous technic but at a memory cost + computation cost if we call
serveral times digest/hexdigest.

I don't see any to escape this pb with MS api that does not expose
internal states as the underlying md5 lib in python does."

Basically, there's a mismatch between what .NET provides and what
Python needs for perfect compatibility. Keeping all data in memory is
not desirable, but neither is failing some operations. And I would
*really* prefer not to have to reimplement all of the cryptographic
hash functions Python has.

One option is to default to not buffering and failing on certain
operations, and offer a constructor flag that enables buffering to
allow the otherwise-impossible operations. Not my favourite idea, but
workable.

- Jeff