Template language for random string generation

Paul Wolf paulwolf333 at gmail.com
Mon Aug 11 01:06:39 EDT 2014


On Sunday, 10 August 2014 17:31:01 UTC+1, Steven D'Aprano  wrote:
> Devin Jeanpierre wrote:
> 
> 
> 
> > On Fri, Aug 8, 2014 at 2:01 AM, Paul Wolf <paulwolf333 at gmail.com> wrote:
> 
> >> This is a proposal with a working implementation for a random string
> 
> >> generation template syntax for Python. `strgen` is a module for
> 
> >> generating random strings in Python using a regex-like template language.
> 
> >> Example:
> 
> >>
> 
> >>     >>> from strgen import StringGenerator as SG
> 
> >>     >>> SG("[\l\d]{8:15}&[\d]&[\p]").render()
> 
> >>     u'F0vghTjKalf4^mGLk'
> 
> > 
> 
> > Why aren't you using regular expressions? I am all for conciseness,
> 
> > but using an existing format is so helpful...
> 
> 
> 
> You've just answered your own question:
> 
> 
> 
> > Unfortunately, the equivalent regexp probably looks like
> 
> > r'(?=.*[0-9])(?=.*[A-Z])(?=.*[a-z])[a-zA-Z0-9]{8:15}'
> 
> 
> 
> Apart from being needlessly verbose, regex syntax is not appropriate because
> 
> it specifies too much, specifies too little, and specifies the wrong
> 
> things. It specifies too much: regexes like ^ and $ are meaningless in this
> 
> case. It specifies too little: there's no regex for the "shuffle operator".
> 
> And it specifies the wrong things: regexes like (?= ...) as used in your
> 
> example are for matching, not generating strings, and it isn't clear
> 
> what "match any character but don't consume any of the string" means when
> 
> generating strings.
> 
> 
> 
> Personally, I think even the OP's specified language is too complex. For
> 
> example, it supports literal text, but given the use-case (password
> 
> generators) do we really want to support templates like "password[\d]"? I
> 
> don't think so, and if somebody did, they can trivially say "password" +
> 
> SG('[\d]').render().
> 
> 
> 
> Larry Wall (the creator of Perl) has stated that one of the mistakes with
> 
> Perl's regular expression mini-language is that the Huffman coding is
> 
> wrong. Common things should be short, uncommon things can afford to be
> 
> longer. Since the most common thing for password generation is to specify
> 
> character classes, they should be short, e.g. d rather than [\d] (one
> 
> character versus four).
> 
> 
> 
> The template given could potentially be simplified to:
> 
> 
> 
> "(LD){8:15}&D&P"
> 
> 
> 
> where the round brackets () are purely used for grouping. Character codes
> 
> are specified by a single letter. (I use uppercase to avoid the problem
> 
> that l & 1 look very similar. YMMV.) The model here is custom format codes
> 
> from spreadsheets, which should be comfortable to anyone who is familiar
> 
> with Excel or OpenOffice. If you insist on having the facility to including
> 
> literal text in your templates, might I suggest:
> 
> 
> 
> "'password'd"  # Literal string "password", followed by a single digit.
> 
> 
> 
> but personally I believe that for the use-case given, that's a mistake.
> 
> 
> 
> Alternatively, date/time templates use two-character codes like %Y %m etc,
> 
> which is better than 
> 
> 
> 
> 
> 
> 
> 
> > (I've been working on this kind of thing with regexps, but it's still
> 
> > incomplete.)
> 
> > 
> 
> >> * Uses SystemRandom class (if available, or falls back to Random)
> 
> > 
> 
> > This sounds cryptographically weak. Isn't the normal thing to do to
> 
> > use a cryptographic hash function to generate a pseudorandom sequence?
> 
> 
> 
> I don't think that using a good, but not cryptographically-strong, random
> 
> number generator to generate passwords is a serious vulnerability. What's
> 
> your threat model? Attacks on passwords tend to be one of a very few:
> 
> 
> 
> - dictionary attacks (including tables of common passwords and 
> 
>   simple transformations of words, e.g. 'pas5w0d');
> 
> 
> 
> - brute force against short and weak passwords;
> 
> 
> 
> - attacking the hash function used to store passwords (not the password
> 
>   itself), e.g. rainbow tables;
> 
> 
> 
> - keyloggers or some other way of stealing the password (including
> 
>   phishing sites and the ever-popular "beat them with a lead pipe 
> 
>   until they give up the password");
> 
> 
> 
> - other social attacks, e.g. guessing that the person's password is their
> 
>   date of birth in reverse.
> 
> 
> 
> But unless the random number generator is *ridiculously* weak ("9, 9, 9, 9,
> 
> 9, 9, ...") I can't see any way to realistically attack the password
> 
> generator based on the weakness of the random number generator. Perhaps I'm
> 
> missing something?
> 
> 
> 
> 
> 
> > Someone should write a cryptographically secure pseudorandom number
> 
> > generator library for Python. :(
> 
> 
> 
> Here, let me google that for you :-)
> 
> 
> 
> https://duckduckgo.com/html/?q=python+crypto
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

I should clarify that the use case of password generation is only one of the use cases out of several that strgen is intended to support. It is also for: 

Test data generation: 

    [\l]{1:20}&[._]{0:1}@[\l]{15}.(com|net|org)

email addresses that use word characters and might have a period or an underscore in the first part. Or

	((john|robert|harry)|(mary|agnes|shelly)) (smith|jones|taylor)
	
produce names with roughly equal distribution of female/male first names. I contemplated - but did not implement - a feature where you can give strgen named functions that generate the required string (using whatever selection process that implementation chooses): 

	($malefirstname|$femalefirstname) $lastname

where

	def malefirstname():
		# get a name from the database at random

Voucher generation:

	[\d]{10}
	
10-digit voucher numbers. 

In none of the foregoing is security a concern, it should be noted. 

> Since the most common thing for password generation is to specify 
> character classes, they should be short, e.g. d rather than [\d] (one 
> character versus four).

But you assume only standard character classes and not custom ones like "[aeiuy]", not to mention unicode ranges outside of the English language. 

> If you insist on having the facility to including 
literal text in your templates, 

I do :-), as per above.

> might I suggest: 
"'password'd"  # Literal string "password", followed by a single digit.

As per above, I think the more verbose notation for character classes is necessary. Although your suggestion is not a bad one. I could have taken a route where you define the character classes with aliases and then construct a very lean template. That is effectively what the - unimplemented - function expressions do in the example above. 

The ability to produce weak passwords ('[abc]{3}') is something I chose not to take up in the strgen module because it should be (mostly) agnostic about what constitutes good security and to support a broader set of use cases as per above.



More information about the Python-list mailing list