The multi-class classification is to assign an instance to one of the sets of classes. scikit-learn uses a strategy called one-vs.-all, or one-vs.-the-rest, to support multi-class classification. The goal of multi-class classification is to assign an instance to one of the sets of classes
We can download the data from https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data, the use following steps:
Import the files:
import pandas as pd import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model.logistic import LogisticRegression from sklearn.cross_validation import train_test_split from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV
Load the train file:
df= pd.read_csv('train.tsv',header=0,delimiter='\t') df.head(10)
PhraseId | SentenceId | Phrase | Sentiment | |
---|---|---|---|---|
0 | 1 | 1 | A series of escapades demonstrating the adage … | 1 |
1 | 2 | 1 | A series of escapades demonstrating the adage … | 2 |
2 | 3 | 1 | A series | 2 |
3 | 4 | 1 | A | 2 |
4 | 5 | 1 | series | 2 |
5 | 6 | 1 | of escapades demonstrating the adage that what… | 2 |
6 | 7 | 1 | of | 2 |
7 | 8 | 1 | escapades demonstrating the adage that what is… | 2 |
8 | 9 | 1 | escapades | 2 |
9 | 10 | 1 | demonstrating the adage that what is good for … | 2 |
Check the ratio of sentiments:
df['Sentiment'].value_counts() df['Sentiment'].value_counts()/df['Sentiment'].count()
Now run the classification:
def Classify(): pipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),('clf', LogisticRegression())]) parameters = {'vect__max_df': (0.25, 0.5), 'vect__ngram_range': ((1, 1), (1, 2)), 'vect__use_idf': (True, False), 'clf__C': (0.1, 1, 10),} df = pd.read_csv('train.tsv', header=0, delimiter='\t') X, y = df['Phrase'], df['Sentiment'].as_matrix() X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5) grid_search = GridSearchCV(pipeline, parameters, n_jobs=3,verbose=1, scoring='accuracy') grid_search.fit(X_train, y_train) print ('Best score: %0.3f' % grid_search.best_score_) print ('Best parameters set:') best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()): print ('\t%s: %r' % (param_name, best_parameters[param_name])) Classify()
Multi-label classification
Multi-label classification, in which each instance can be assigned a subset of the set of classes. Examples of multi-label classification include assigning tags to messages posted on a forum, and classifying the objects present in an image. There are two groups of approaches for multi-label classification,
Problem transformation methods are techniques that cast the original multi-label problem as a set of single-label classification problems.