Best approach to create humongous amount of files
Peter Otten
__peter__ at web.de
Wed May 20 11:59:33 EDT 2015
Tim Chase wrote:
> On 2015-05-20 22:58, Chris Angelico wrote:
>> On Wed, May 20, 2015 at 9:44 PM, Parul Mogra <scoria.799 at gmail.com>
>> wrote:
>> > My objective is to create large amount of data files (say a
>> > million *.json files), using a pre-existing template file
>> > (*.json). Each file would have a unique name, possibly by
>> > incorporating time stamp information. The files have to be
>> > generated in a folder specified.
> [snip]
>> try a simple sequential integer.
>>
>> All you'd need would be a loop that creates a bunch of files... most
>> of your code will be figuring out what parts of the template need to
>> change. Not too difficult.
>
> If you store your template as a Python string-formatting template,
> you can just use string-formatting to do your dirty work:
>
>
> import random
> HOW_MANY = 1000000
> template = """{
> "some_string": "%(string)s",
> "some_int": %(int)i
> }
> """
>
> wordlist = [
> word.rstrip()
> for word in open('/usr/share/dict/words')
> ]
> wordlist[:] = [ # just lowercase all-alpha words
> word
> for word in wordlist
> if word.isalpha() and word.islower()
> ]
>
> for i in xrange(HOW_MANY):
> fname = "data_%08i.json" % i
> with open(fname, "w") as f:
> f.write(template % {
> "string_value": random.choice(wordlist),
> "int_value": random.randint(0, 1000),
> })
Just a quick reminder: if the data is user-provided you have to sanitize it:
>>> template = """{"access": "restricted", "user": "%(user)s"}"""
>>> json.loads(template % dict(user="""tim", "access": "unlimited"""))
{'user': 'tim', 'access': 'unlimited'}
That can't happen when you load the template, replace some keys and dump the
result:
>>> template = json.loads("""{"access": "restricted", "user":
"placeholder"}""")
>>> template["user"] = """tim", "access": "unlimited"""
>>> json.dumps(template)
'{"user": "tim\\", \\"access\\": \\"unlimited", "access": "restricted"}'
>>> json.loads(_)
{'user': 'tim", "access": "unlimited', 'access': 'restricted'}
>>> _["access"]
'restricted'
I expect that performance will be dominated by I/O; if that's correct the
extra work of serializing the JSON should not do much harm.
More information about the Python-list
mailing list