[scikit-learn] Text classification of large dataet

Wed Dec 20 13:32:35 EST 2017

Ranjana,

have a look at this example 
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html

Since you have a lot of RAM, you may not need to make all the 
classification pipeline out-of-core, a start with your current code 
could be to write a generator that loads and pre-processes the text in 
chunks then feed it one document at the time to CountVecotorizer.fit (it 
accepts an iterable). To reduce the memory usage, filtering too frequent 
tokens (instead of the infrequent ones) could help too. Make sure you L2 
normalize your data before the classifier. You could use 
SGDClassifier(loss='log') or LogisticRegression with a sag or saga 
solver. The multiclasss="multinomial" parameter might be also worth 
trying, particularly since you have so many classes.

-- 
Roman

On 19/12/17 15:38, Ranjana Girish wrote:
> Hai all,
>
> I am doing text classification. I have around 10 million data to be
> classified to around 7k category.
>
> Below is the code I am using
>
> /# Importing the libraries/
> /i*mport pandas as pd*/
> /*import nltk*/
> /*from nltk.corpus import stopwords*/
> /*from nltk.tokenize import word_tokenize*/
> /*from nltk.stem.wordnet import WordNetLemmatizer*/
> /*from nltk.stem.porter import PorterStemmer*/
> /*import re*/
> /*from sklearn.feature_extraction.text import CountVectorizer*/
> /*import random*/
> /*from sklearn.naive_bayes import MultinomialNB,GaussianNB*/
> /*from sklearn.metrics import accuracy_score*/
> /*from sklearn.metrics import precision_recall_curve*/
> /*from sklearn.metrics import average_precision_score*/
> /*from sklearn import feature_selection*/
> /*from scipy.sparse import csr_matrix*/
> /*from scipy import sparse*/
> /*import sys*/
> /*from sklearn import preprocessing*/
> /*import numpy as np*/
> /*import pickle*/
> /* */
> /*sys.setrecursionlimit(200000000)*/
> /*
> */
> /*random.seed(20000)*/
> /*
> */
> /*
> */
> /*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
> "ISO-8859-1")*/
> /*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding =
> "ISO-8859-1")*/
> /*
> */
> /*dataset=pd.concat([trainset1,trainset2])*/
> /*
> */
> /*dataset=dataset.dropna()*/
> /*
> */
> /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
> ' ')*/
> /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
> ' ')*/
> /*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*/
> /*
> */
> /*del trainset1*/
> /*del trainset2  */
> /*
> */
> /*stop = stopwords.words('english')*/
> /*lemmatizer = WordNetLemmatizer()*/
> /*
> */
> /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
> + r'|'.join(stop) + r')\b\s*', ' ')*/
> /*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
> ')*/
> /*dataset['ProductDescription']
> =dataset['ProductDescription'].apply(word_tokenize)*/
> /*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*/
> /*POS_LIST = [NOUN, VERB, ADJ, ADV]*/
> /*for tag in POS_LIST:*/
> /*    dataset['ProductDescription'] =
> dataset['ProductDescription'].apply(lambda x:
> list(set([lemmatizer.lemmatize(item,tag) for item in x])))*/
> /*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda
> x : " ".join(x))*/
> /*
> */
> /*countvec = CountVectorizer(min_df=0.00008)*/
> /*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*/
> /*documenttermmatrix.shape*/
> /*column=countvec.get_feature_names()*/
> /*filename1 = 'columnnamessample10mastermerge.sav'*/
> /*pickle.dump(column, open(filename1, 'wb'))*/
> /*
> */
> /*y_train=dataset['classpath']*/
> /*y_train=dataset['classpath'].tolist()*/
> /*labels_train= preprocessing.LabelEncoder()*/
> /*labels_train.fit(y_train)*/
> /*y1_train=labels_train.transform(y_train)*/
> /*
> */
> /*del dataset*/
> /*del countvec*/
> /*del column*/
> /*
> */
> /*
> */
> /*clf = MultinomialNB()*/
> /*model=clf.fit(documenttermmatrix,y_train)*/
> /*
> */
> /*
> */
> /*
> */
> *
> *
> /*
> */
> /*filename2 = 'modelnaivebayessample10withfs.sav'*/
> /*pickle.dump(model, open(filename2, 'wb'))*/
> /
> /
> /
> /
> I am using system with *128 GB RAM.*
>
> As I was unable to train all 10 million data, I did *stratified
> sampling* and the trainset reduced to 2.3 million
>
> Still I was unable to Train  2.3 million data
>
> I got*memory error* when i used *random forest (nestimator=30),**Naive
> Bayes* and *SVM*
>
>
> /
> /
> *I have stucked*
> *
> *
> *
> *
>
> *Can Anyone please tell whether any memory leak in my code and  how to
> use system with 128 GB RAM effectively*
>
>
> Thanks
> Ranjana
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>