[Catalog-sig] hash tags

PJ Eby pje at telecommunity.com
Fri Mar 8 23:50:37 CET 2013


On Fri, Mar 8, 2013 at 4:32 PM, Donald Stufft <donald at stufft.io> wrote:
> Here's some more information pulled straight from Wikiepdia:

Trust me, I've read a LOT of Wikipedia (and even more from other
sites, including at least the conclusions of a number of cryptography
papers) about hashing attacks recently, because I was seeing
inconsistencies in what people are saying about hashes and their
weaknesses and so forth.  99.9% of the discussion about attacks on
hashes have to do with collision attacks, prefix attacks, and length
extension attacks, all of which are extremely relevant for
*cryptographic* purposes.  Specifically, the use of hashes to verify
identity, authority, repudiability, etc...  which emphatically do
*not* apply to the use of an MD5 as a checksum to verify a correct
download.

All of these attacks depend on *something else* being at stake besides
the integrity of the original message.  For example length-extension
attacks bypass the need to know a "secret" used in a naive hash-based
signature scheme (which is why you're supposed to use HMAC for such
things), while collision attacks let you trick a signer into signing
something that you can later replace with something altered.

The current use of #md5 tags isn't subject to either of these kinds of
attack, because:

1. There is no "secret" to be revealed, and
2. The author and signer are the same person

So the only type of attack I've found out about thus far, in my
(admittedly few) hours of study on the subject, that is relevant to
the way we use MD5 on PyPI at present is the so-called "second
pre-image" attack, which is when you're given an existing message and
a hash, and have to create a new message with the same hash...  while
also incorporating something useful in the new message.

The most recent report I saw on second pre-image attacks against full
MD5 estimated a 2**127 strength, meaning that even if you could
process a great many billion tries per second, it would take you
thousands of years to come up with a file that could masquerade as an
existing download.  (And most people's computers and/or internet
connections would choke on the massive file sizes needed for the
still-theoretical Kelsey-Schneier generalized preimage attack, which
in any case would apply equally to just about any other hash we could
currently put out in the field. i.e., it's not specific to a
particular hash algorithm, it just relies on certain properties of the
algorithm.)

So, yeah, MD5 is *cryptographically* broken, sure.  But it's not
broken for *data integrity*.  And in the PyPI use case, the
"cryptographic" part is all in the SSL being used to fetch the MD5
link in the first place.


> Here's the important highlights:
>
>     - specifically, a group of researchers described how to create a pair of files that share the same MD5 checksum

Right, that's what's called a "collision attack".  It means that you
can go out *ahead of time*, and make two files with the same checksum,
one good, one evil.  It does *not* mean you get to take an existing
file, and then make a second file with the same checksum.  (The latter
is a "second preimage" attack, which is *not* broken

Hash collision attacks in PyPI would basically require an author to
upload a special version of their package that looked innocent, and
then they could later switch that version out with one that's harmful.
 And the *way* that this works is that you specially generate *both*
files, in advance.  Which means that the author themselves is
compromised, so the threat is moot.  The author can already upload
compromised code (either through being evil or having their PC
hijacked), and what #md5 it has is 100% irrelevant.

That is, there's nothing stopping an evil author or an author with a
compromised PC from simply uploading a new file with a new MD5,
because PyPI will pass it along in exactly the same way.  Changing
hash algorithms will not affect this threat vector in the slightest.

Given these facts, it makes no sense to fuss over the hash algorithm
in current use, since a concurrent goal here is to switch to file
formats that can be directly signed using, you know, *actual*
cryptography.  ;-)

The new .wheel format makes provisions for modern signature
techniques.  It'd be good if sdists also did.  Then the #md5 tag can
die a natural death, hopefully within the year replaced by a hashtag
that say, fingerprints the author's public key as registered with
PyPI, or something of that sort.  In the meantime, there's no actual
threat here, so bikeshedding what to replace it with *while keeping
the current system* is like rearranging office furniture in a building
that's about to have demolition charges set underneath it.  ;-)


More information about the Catalog-SIG mailing list