[BangPypers] Question on making hive python UDF object persistent

Wed Jan 23 12:07:40 EST 2019

Hi All

I'm trying to code a Hive UDF in python, which loads a pickle object
(basically a set of linear model weights). These weights that are read from
the pickle, are used to score a set of observations from a hive table. Once
I have computed the scores, I would also want to update the weights, based
on the truth value that I receive from the same Hive table, so that the
next observation is scored on the updated weights.

Something like this:

Python UDF code:

import pickle

import sys

import numpy as np

betas = pickle.load(open('B.pkl','rb'))

for line in sys.stdin:

    data = line.strip().split('\t')

    X = np.array(data[:-1])

    y = np.array(data[-1])

    ycap = sigmoid(np.dot(betas,X))

    new_beta = np.dot(np.dot(np.linalg.inv(np.dot(X.T,X)),X.T),y)

I did read about making a python object in hive udf persistent across all
the cores (stateful udtf). Can anyone help me with a sample code?

Thanks in advance!

Pramod