[scikit-learn] merging the predicted labels with original dataframe

Ruchika Nayyar ruchika.work at gmail.com
Thu Jul 20 11:23:12 EDT 2017


Hi Scikit-learn Users,

I am analyzing some proxy logs to use Machine learning to classify the
events recorded as either "OBSERVED" or "BLOCKED". This is a little snippet
of my code:
The input file is a csv with tokenized string fields.

**************
# load the file
M = pd.read_csv("output100k.csv").fillna('')

# define the fields to use
min_df = 0.001
max_df = .7
TxtCols = ['request__tokens', 'requestClientApplication__tokens',
           'destinationZoneURI__tokens','cs-categories__tokens',
           'fileType__tokens', 'requestMethod__tokens','tcp_status1',
           'app','tcp_status2','dhost'
          ]
NumCols = ['rt', 'out', 'in', 'time-taken','rt_length', 'dt_length']

# vectorize the fields
TfidfModels = [TfidfVectorizer(min_df = min_df, max_df=max_df).fit(M[t])
for t in TxtCols]

# define the columns of sparse matrix
X = hstack([m.transform(M[n].fillna('')) for m,n in zip(TfidfModels,
TxtCols)] + \
               [csr_matrix(pd.to_numeric(M[n]).fillna(-1).values).T for n
in NumCols])

# target variable
Y = M.act.values

## Define train/test parts and scale them
X_train, X_test, y_train, y_test = tts(X, Y, test_size=0.2)
scaler = StandardScaler(with_mean=False, with_std=True)
scaler.fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)


# define the model and train
clf = MLPClassifier(activation='logistic',
solver='lbfgs').fit(X_train,y_train)
# use the model to predict on X_test and convert into a data frame
df=pd.DataFrame(clf.predict(X_test))

**

199845  OBSERVED
199846  OBSERVED

[199847 rows x 1 columns]>

**

Now at the end I have a DataFrame with 20K entries with just one column
"Label", how di I connect it to the main dataframe M, since I want to do
some
investigations on this outcome ?

Any help?

Thanks,
Ruchika
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170720/c89891f7/attachment.html>


More information about the scikit-learn mailing list