PyPI password rules

Thu Aug 28 01:08:22 EDT 2014

On Thu, Aug 28, 2014 at 2:28 PM, Skip Montanaro <skip at pobox.com> wrote:
>
> On Wed, Aug 27, 2014 at 10:32 PM, Chris Angelico <rosuav at gmail.com> wrote:
>>
>> I'm not sure I understand how your 'common' value works, though. Does
>> the default 0.6 mean you take the 60% most common words? Those above
>> the 60th percentile of frequency? Something else?
>
>
> Yes, basically. A word has to pass the following hurdles before being deemed
> "common":
>
> * length >= 4
> * all lower case
> * no punctuation
> * not already "emitted" (made it to the common list)
> * seen this word at least 10 times
> * have seen at least 100 words
>
> Then and only then, if its word count places it in the top T percent of all
> seen words (T defaults to 60%), is it added to the "emitted" or common word
> list. Only words in that list are chosen as password material. Further, the
> dict command allows you to identify words in the common list which aren't in
> your computer's words file. You can give any of them (or any other word you
> don't like) as arguments to the "bad" command.

Interesting. I suspect this may have issues, as you're doing these
checks progressively; something that's common in the early posts will
be weighted without regard to subsequent posts (you're requiring 100
unique words before recording anything, but that's still not all that
many).

Minstrel Hall's Polly doesn't care about dictionaries at all (in fact,
she's set herself up as the sole authority on word and letter
frequencies, for the sake of a Scrabble variant called "Pollylogy"; I
guess the awesomeness of being a parrot went to her head); I'm not
100% sure from looking at your code, but I think your use of the
dictionary is purely advisory? So both systems are quite capable of
returning non-words (flaskapp came up in your example).

> I won't pretend to understand all that entropy stuff, and I realize that
> given my 35k+ messages and my somewhat severe constraints, I have only
> deemed 1057 words from my corpus as "worthy" so far. That's about 10 bits of
> entropy per word? That obviously improves the chances my passwords can be
> guessed, but I suspect I can lower my T value sufficiently to increase the
> pool of candidate words to whatever amount of entropy you require. I agree
> though, it is a bit backwards from how the XKCD 936 thing works.

In this case, "entropy" is just the number of possible passwords it
could generate. With 1057 words, yes, that's about 10 bits per word.
If you simply pick a word and use that as your password, that's
equivalent to a 10-bit number - if someone were to get hold of your
code and try to crack your password by trying every possibility, s/he
would just have to try each word in succession, and could do so
exhaustively by counting up to 1057. If you take two words, completely
independently, then a brute-force password attack would have to try
the first word from your list with every possible word as its pair,
then try the second with every possible pair, etc, etc, etc; so you've
squared your number of possibilities (there are now 1117249, or
1116192 if you ignore word doublings) - that's 20 bits of entropy,
because it's equivalent to a 20-bit number (which would give 1048576
possibilities).

> I just realized something. To keep it from taking forever to start up before
> I had a pickle save file, I limited the messages to those since 2014-08-22.
> Not too many. Not sure how to deal with that, but for the moment, I
> initialize Polly.latest to 2014-05-01 in my sandbox (not checked in). That
> will considerably increase the number of messages scanned. While it's doing
> that (in a separate thread), I can watch the progress with the stat command
> at the ? prompt:

I like the threaded model. Minstrel Hall is already a threaded server,
with every connected client having one thread; Polly adds to her
corpus on the thread of everyone who speaks, and generates passwords
on the requestor's thread, so it comes to the same thing. My data
structures are fairly simple, and there's just one big shared mapping
(dictionary) that maintains a half-cooked data set; the 936 generator
builds everything from that, and since the data set is only 77,106
elements in size, it actually doesn't need to worry about performance
or anything. Although I think Python's dict isn't designed to cope
with *any* key mutation during iteration, so that may still require
the semaphore.

> Hmmm... I realize now that I'm not seeing all messages, at least I don't
> think so. So much to learn about IMAP...

Hmm, can't help there, sorry.

ChrisA