Reading in large logfiles, and processing lines in batches - maximising throughput?

Wed Sep 16 05:27:44 EDT 2015

I'm using Python to parse metrics out of logfiles.

The logfiles are fairly large (multiple GBs), so I'm keen to do this in a reasonably performant way.

The metrics are being sent to a InfluxDB database - so it's better if I can batch multiple metrics into a batch ,rather than sending them individually.

Currently, I'm using the grouper() recipe from the itertools documentation to process multiples lines in "chunks" - I then send the collected points to the database:

    def grouper(iterable, n, fillvalue=None):
        "Collect data into fixed-length chunks or blocks"
        # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
        args = [iter(iterable)] * n
        return zip_longest(fillvalue=fillvalue, *args)

    with open(args.input_file, 'r') as f:
        line_counter = 0
        for chunk in grouper(f, args.batch_size):
            json_points = []
            for line in chunk:
                line_counter +=1
                # Do some processing
                json_points.append(some_metrics)
            if json_points:
                write_points(logger, client, json_points, line_counter)

However, not every line will produce metrics - so I'm batching on the number of input lines, rather than on the items I send to the database.

My question is, would it make sense to simply have a json_points list that accumulated metrics, check the size each iteration and then send them off when it reaches a certain size. Eg.:

    BATCH_SIZE = 1000

    with open(args.input_file, 'r') as f:
        json_points = []
        for line_number, line in enumerate(f):
            # Do some processing
            json_points.append(some_metrics)
            if len(json_points) >= BATCH_SIZE:
                write_points(logger, client, json_points, line_counter)
                json_points = []

Also, I originally used grouper because I thought it better to process lines in batches, rather than individually. However, is there actually any throughput advantage from doing it this way in Python? Or is there a better way of getting better throughput?

We can assume for now that the CPU load of the processing is fairly light (mainly string splitting, and date parsing).