Accumulating points in batch for sending off

Victor Hooi victorhooi at gmail.com
Fri Sep 4 18:09:00 EDT 2015


Hi,

I'm using Python to parse out  metrics from logfiles, and ship them off to a database called InfluxDB, using their Python driver (https://github.com/influxdb/influxdb-python).

With InfluxDB, it's more efficient if you pack in more points into each message.

Hence, I'm using the grouper() recipe from the itertools documentation (https://docs.python.org/3.6/library/itertools.html), to process the data in chunks, and then shipping off the points at the end of each chunk:

  def grouper(iterable, n, fillvalue=None):
      "Collect data into fixed-length chunks or blocks"
      # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx
      args = [iter(iterable)] * n
      return zip_longest(fillvalue=fillvalue, *args)
  ....
  for chunk in grouper(parse_iostat(f), 500):
      json_points = []
      for block in chunk:
          if block:
              try:
                  for i, line in enumerate(block):
                      # DO SOME STUFF
              except ValueError as e:
                  print("Bad output seen - skipping")
      client.write_points(json_points)
      print("Wrote in {} points to InfluxDB".format(len(json_points)))


However, for some parsers, not every line will yield a datapoint.

I'm wondering if perhaps rather than trying to chunk the input, it might be better off just calling len() on the points list each time, and sending it off when it's ready. E.g.:

    #!/usr/bin/env python3

    json_points = []
    _BATCH_SIZE = 2

    for line_number, line in enumerate(open('blah.txt', 'r')):
        if 'cat' in line:
            print('Found cat on line {}'.format(line_number + 1 ))
            json_points.append(line_number)
            print("json_points contains {} points".format(len(json_points)))
        if len(json_points) >= _BATCH_SIZE:
            # print("json_points contains {} points".format(len(json_points)))
            print('Sending off points!')
            json_points = []
            
    print("Loop finished. json_points contains {} points".format(len(json_points)))
    print('Sending off points!')

Does the above seem reasonable? Any issues you see? Or are there any other more efficient approaches to doing this?



More information about the Python-list mailing list