[Ironpython-users] Hashing a directory is magnitudes slower than in cPython

Emmanuel Chomarat emmanuel.chomarat at gmail.com
Tue Feb 25 18:08:43 CET 2014


The main issue from what I remember is related to the fact that we can
interleave update and digest calls in CPython. This is impossible in .Net
(at least with the provided API). The workaround to this issue is to store
a local buffer of everything that has already been hashed, in order to
rehash it if someone ask for digest in the middle of the operation. The
exec time was very good and there were now degradation on the execution
time, the problem was that the caching was growing linearly with the amount
of data. If we are not caching data there is a way to make it quick and
easy on the memory but it does not then follow python API (
http://docs.python.org/2/library/hashlib.html)


On Tue, Feb 25, 2014 at 5:58 PM, Curt Hagenlocher <curt at hagenlocher.org>wrote:

> "Basically, there's a mismatch between what .NET provides and what Python
> needs for perfect compatibility."
>
> Yes. I think I remember implementing this and that's exactly the problem I
> ran into. I think we looked into incorporating a modified version of the
> BCL code directly into IronPython, but at least in those days, that was a
> pretty hard thing to get done. We ran into a similar issue when
> implementing the compression API.
>
> You could get around the problem in the client code with an "if
> sys.platform == 'cli'" and then use the .NET classes directly.
>
>
>
> On Tue, Feb 25, 2014 at 8:53 AM, Jeff Hardy <jdhardy at gmail.com> wrote:
>
>> On Tue, Feb 25, 2014 at 12:38 PM, Markus Schaber <m.schaber at codesys.com>
>> wrote:
>> > Hi,
>> >
>> > A coworker just consulted me on a performance problem of IronPython vs.
>> cPython.
>> >
>> > ... snip ...
>> >
>> > On a closer look, there's the additional (and IMHO much worse) problem
>> that the update() method seems not to work incrementally:
>> >
>> > private void update(IList<byte> newBytes) {
>> >     byte[] updatedBytes = new byte[_bytes.Length + newBytes.Count];
>> >     Array.Copy(_bytes, updatedBytes, _bytes.Length);
>> >     newBytes.CopyTo(updatedBytes, _bytes.Length);
>> >     _bytes = updatedBytes;
>> >     _hash = GetHasher().ComputeHash(_bytes);
>> > }
>> >
>> > In our use-case, this means that every file which is read leads to a
>> reallocation and copying and recalculation of the MD5 sum of all the data
>> which was read until now. This is suboptimal from memory and performance
>> perspective.
>> >
>> > I'm not an expert on the .NET crypto APIs, but I guess there should be
>> some incremental API available there which could be exploited.
>>
>> http://ironpython.codeplex.com/workitem/34022
>>
>> I've also CC'd Emmanuel Chomarat, who was investigating a fix for
>> this. Unfortunately I don't think there's an easy solution based on
>> how the .NET APIs are constructed. Quoting from Emmanuel's email to me
>> a while back:
>>
>> "I am now using TransformBlock / TransformBlockFinal to compute the
>> current hash with a linear complexity ( whereas we had before n**2)
>> but I am still facing an issue.
>> First we need to have a copy operator, this is not possible because we
>> can not share the hash instance between two objects in .net, the only
>> way to make it consistent with what python is doing is by keeping a
>> copy of the full data in MEMORY in order to create a new instance with
>> these data when copy is called.
>> The second thing is that digest can be called several times in python
>> with some new data added/updated to the hash , in C# as soon as
>> TransformBlockFinal has been called once we can not anymore add more
>> data to the stream. Once again I have been able to use the same
>> previous technic but at a memory cost + computation cost if we call
>> serveral times digest/hexdigest.
>>
>> I don't see any to escape this pb with MS api that does not expose
>> internal states as the underlying md5 lib in python does."
>>
>> Basically, there's a mismatch between what .NET provides and what
>> Python needs for perfect compatibility. Keeping all data in memory is
>> not desirable, but neither is failing some operations. And I would
>> *really* prefer not to have to reimplement all of the cryptographic
>> hash functions Python has.
>>
>> One option is to default to not buffering and failing on certain
>> operations, and offer a constructor flag that enables buffering to
>> allow the otherwise-impossible operations. Not my favourite idea, but
>> workable.
>>
>> - Jeff
>> _______________________________________________
>> Ironpython-users mailing list
>> Ironpython-users at python.org
>> https://mail.python.org/mailman/listinfo/ironpython-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ironpython-users/attachments/20140225/a5b966a0/attachment.html>


More information about the Ironpython-users mailing list