Best approach to create humongous amount of files

Peter Otten __peter__ at web.de
Wed May 20 11:59:33 EDT 2015


Tim Chase wrote:

> On 2015-05-20 22:58, Chris Angelico wrote:
>> On Wed, May 20, 2015 at 9:44 PM, Parul Mogra <scoria.799 at gmail.com>
>> wrote:
>> > My objective is to create large amount of data files (say a
>> > million *.json files), using a pre-existing template file
>> > (*.json). Each file would have a unique name, possibly by
>> > incorporating time stamp information. The files have to be
>> > generated in a folder specified.
> [snip]
>> try a simple sequential integer.
>> 
>> All you'd need would be a loop that creates a bunch of files... most
>> of your code will be figuring out what parts of the template need to
>> change. Not too difficult.
> 
> If you store your template as a Python string-formatting template,
> you can just use string-formatting to do your dirty work:
> 
> 
>   import random
>   HOW_MANY = 1000000
>   template = """{
>     "some_string": "%(string)s",
>     "some_int": %(int)i
>     }
>     """
> 
>   wordlist = [
>     word.rstrip()
>     for word in open('/usr/share/dict/words')
>     ]
>   wordlist[:] = [ # just lowercase all-alpha words
>     word
>     for word in wordlist
>     if word.isalpha() and word.islower()
>     ]
> 
>   for i in xrange(HOW_MANY):
>     fname = "data_%08i.json" % i
>     with open(fname, "w") as f:
>       f.write(template % {
>         "string_value": random.choice(wordlist),
>         "int_value": random.randint(0, 1000),
>         })

Just a quick reminder: if the data is user-provided you have to sanitize it:

>>> template = """{"access": "restricted", "user": "%(user)s"}"""
>>> json.loads(template % dict(user="""tim", "access": "unlimited"""))
{'user': 'tim', 'access': 'unlimited'}

That can't happen when you load the template, replace some keys and dump the 
result:

>>> template = json.loads("""{"access": "restricted", "user": 
"placeholder"}""")
>>> template["user"] = """tim", "access": "unlimited"""
>>> json.dumps(template)
'{"user": "tim\\", \\"access\\": \\"unlimited", "access": "restricted"}'
>>> json.loads(_)
{'user': 'tim", "access": "unlimited', 'access': 'restricted'}
>>> _["access"]
'restricted'

I expect that performance will be dominated by I/O; if that's correct the 
extra work of serializing the JSON should not do much harm.




More information about the Python-list mailing list