Reading in large logfiles, and processing lines in batches - maximising throughput?

Victor Hooi victorhooi at gmail.com
Wed Sep 16 05:27:44 EDT 2015


I'm using Python to parse metrics out of logfiles.

The logfiles are fairly large (multiple GBs), so I'm keen to do this in a reasonably performant way.

The metrics are being sent to a InfluxDB database - so it's better if I can batch multiple metrics into a batch ,rather than sending them individually.

Currently, I'm using the grouper() recipe from the itertools documentation to process multiples lines in "chunks" - I then send the collected points to the database:

    def grouper(iterable, n, fillvalue=None):
        "Collect data into fixed-length chunks or blocks"
        # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
        args = [iter(iterable)] * n
        return zip_longest(fillvalue=fillvalue, *args)

    with open(args.input_file, 'r') as f:
        line_counter = 0
        for chunk in grouper(f, args.batch_size):
            json_points = []
            for line in chunk:
                line_counter +=1
                # Do some processing
                json_points.append(some_metrics)
            if json_points:
                write_points(logger, client, json_points, line_counter)

However, not every line will produce metrics - so I'm batching on the number of input lines, rather than on the items I send to the database.

My question is, would it make sense to simply have a json_points list that accumulated metrics, check the size each iteration and then send them off when it reaches a certain size. Eg.:

    BATCH_SIZE = 1000

    with open(args.input_file, 'r') as f:
        json_points = []
        for line_number, line in enumerate(f):
            # Do some processing
            json_points.append(some_metrics)
            if len(json_points) >= BATCH_SIZE:
                write_points(logger, client, json_points, line_counter)
                json_points = []

Also, I originally used grouper because I thought it better to process lines in batches, rather than individually. However, is there actually any throughput advantage from doing it this way in Python? Or is there a better way of getting better throughput?

We can assume for now that the CPU load of the processing is fairly light (mainly string splitting, and date parsing).



More information about the Python-list mailing list