[Ironpython-users] Hashing a directory is magnitudes slower than in cPython

Pawel Jasinski pawel.jasinski at gmail.com
Thu Feb 27 21:10:07 CET 2014


I just checked, the part in question is covered by MIT.

On Thu, Feb 27, 2014 at 8:48 PM, Slide <slide.o.mix at gmail.com> wrote:
> I had asked this question before, but there was some hesitation based on the
> licensing of Mono. I'm not sure if that is an issue anymore.
>
>
> On Thu, Feb 27, 2014 at 12:16 PM, Pawel Jasinski <pawel.jasinski at gmail.com>
> wrote:
>>
>> Is there any reason not to use code out of mono?
>> It looks like it supports SHA2 and RIPEMD160.
>> https://bugzilla.xamarin.com/show_bug.cgi?id=11703
>>
>> On Thu, Feb 27, 2014 at 2:10 PM, Markus Schaber <m.schaber at codesys.com>
>> wrote:
>> > Hi,
>> >
>> > Von: Jeff Hardy [mailto:jdhardy at gmail.com]
>> >> On Thu, Feb 27, 2014 at 11:11 AM, Markus Schaber
>> >> <m.schaber at codesys.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I'm just trying to sum it up:
>> >> >
>> >> > 1) The current code:
>> >> >    - High memory usage.
>> >> >    - High load on the large object heap.
>> >> >    - Limited by the available amount of memory (which might be
>> >> > considered a
>> >> violation of the Python API).
>> >> >    - High CPU usage when used incrementally (quadratic to the number
>> >> > of
>> >> blocks added).
>> >> >
>> >> > 2) Optimizing with MemoryStream and lazy calculation:
>> >> >    - High memory usage.
>> >> >    - High load on the large object heap.
>> >> >    - Limited by the available amount of memory (which might be
>> >> > considered a
>> >> violation of the Python API).
>> >> >    + Optimal CPU usage when the hash is only fetched once.
>> >> >    ± Better than current code, but still not optimal when hash is
>> >> incrementally fetched several times.
>> >> >
>> >> > 3) Optimizing with jagged arrays and lazy calculation:
>> >> >    - High memory usage.
>> >> >    + Improved or no impact on the large object heap (depending on the
>> >> > exact
>> >> implementation)
>> >> >    - Limited by the available amount of memory (which might be
>> >> > considered a
>> >> violation of the Python API).
>> >> >    + Optimal CPU usage when the hash is only fetched once.
>> >> >    ± Better than current code, but still not optimal when hash is
>> >> incrementally fetched several times.
>> >> >
>> >> > 4) Using the existing .NET incremental APIs
>> >> >    + Low, constant memory usage.
>> >> >    + No impact on the large object heap.
>> >> >    + No limit of data length by the amount of memory.
>> >> >    + Optimal CPU usage when the hash is only fetched once.
>> >> >    - Breaks when hash is incrementally fetched several times (which
>> >> > likely
>> >> is a violation of the Python API).
>> >> >
>> >> > 5) Finding or porting a different Hash implementation in C#:
>> >> >    + Low, constant memory usage.
>> >> >    + No impact on the large object heap.
>> >> >    + No limit of data length by the amount of memory.
>> >> >    + Optimal CPU usage when the hash is only fetched once.
>> >> >    + Optimal CPU usage when the hash is incrementally fetched several
>> >> > times.
>> >> >
>> >> > I've a local prototype implemented for 2), but I'm not sure whether
>> >> > that's
>> >> > the best way to go...
>> >>
>> >> Good analysis!
>> >>
>> >> My preference would be for (4), raising an exception if .update() is
>> >> called
>> >> after .digest(), or .copy() is called at all. As a fallback, an extra
>> >> parameter to hashlib.new (&c) that triggers (2), for cases where its
>> >> needed -
>> >> I can't say for sure, but I would think calling .update() after
>> >> .digest()
>> >> would be rare, and so would .copy() (damn you Google for shutting down
>> >> code
>> >> search). At least then the common case is fast and edge cases are
>> >> (usually)
>> >> possible.
>> >
>> > Do you think asking on some cPython lists could give usable feedback how
>> > common it is to call copy() or to continue feeding data after calling
>> > digest()?
>> >
>> >> > Maybe we should google for purely managed implementations of the hash
>> >> > codes
>> >> > with a sensible license...
>> >>
>> >> There seems to be for MD5 and SHA1 but not SHA2 or RIPEMD160. They
>> >> could be
>> >> ported from the public domain Crypto++ library, but that seems like a
>> >> lot of
>> >> work for an edge case.
>> >
>> > Yes, that seems to be a lot of work.
>> >
>> > On the other hand, it's the 100% solution. :-)
>> >
>> > Best regards
>> >
>> > Markus Schaber
>> >
>> > CODESYS® a trademark of 3S-Smart Software Solutions GmbH
>> >
>> > Inspiring Automation Solutions
>> >
>> > 3S-Smart Software Solutions GmbH
>> > Dipl.-Inf. Markus Schaber | Product Development Core Technology
>> > Memminger Str. 151 | 87439 Kempten | Germany
>> > Tel. +49-831-54031-979 | Fax +49-831-54031-50
>> >
>> > E-Mail: m.schaber at codesys.com | Web: http://www.codesys.com | CODESYS
>> > store: http://store.codesys.com
>> > CODESYS forum: http://forum.codesys.com
>> >
>> > Managing Directors: Dipl.Inf. Dieter Hess, Dipl.Inf. Manfred Werner |
>> > Trade register: Kempten HRB 6186 | Tax ID No.: DE 167014915
>> >
>> > _______________________________________________
>> > Ironpython-users mailing list
>> > Ironpython-users at python.org
>> > https://mail.python.org/mailman/listinfo/ironpython-users
>> _______________________________________________
>> Ironpython-users mailing list
>> Ironpython-users at python.org
>> https://mail.python.org/mailman/listinfo/ironpython-users
>
>
>
>
> --
> Website: http://earl-of-code.com


More information about the Ironpython-users mailing list