The multi-class classification is to assign an instance to one of the sets of classes. scikit-learn uses a strategy called one-vs.-all, or one-vs.-the-rest, to support multi-class classification. The goal of multi-class classification is to assign an instance to one of the sets of classes
We can download the data from https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data, the use following steps:
Import the files:
import pandas as pd import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model.logistic import LogisticRegression from sklearn.cross_validation import train_test_split from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV
Load the train file:
df= pd.read_csv('train.tsv',header=0,delimiter='\t')
df.head(10)
| PhraseId | SentenceId | Phrase | Sentiment | |
|---|---|---|---|---|
| 0 | 1 | 1 | A series of escapades demonstrating the adage … | 1 |
| 1 | 2 | 1 | A series of escapades demonstrating the adage … | 2 |
| 2 | 3 | 1 | A series | 2 |
| 3 | 4 | 1 | A | 2 |
| 4 | 5 | 1 | series | 2 |
| 5 | 6 | 1 | of escapades demonstrating the adage that what… | 2 |
| 6 | 7 | 1 | of | 2 |
| 7 | 8 | 1 | escapades demonstrating the adage that what is… | 2 |
| 8 | 9 | 1 | escapades | 2 |
| 9 | 10 | 1 | demonstrating the adage that what is good for … | 2 |
Check the ratio of sentiments:
df['Sentiment'].value_counts() df['Sentiment'].value_counts()/df['Sentiment'].count()
Now run the classification:
def Classify():
pipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),('clf', LogisticRegression())])
parameters = {'vect__max_df': (0.25, 0.5),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'clf__C': (0.1, 1, 10),}
df = pd.read_csv('train.tsv', header=0, delimiter='\t')
X, y = df['Phrase'], df['Sentiment'].as_matrix()
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
grid_search = GridSearchCV(pipeline, parameters, n_jobs=3,verbose=1, scoring='accuracy')
grid_search.fit(X_train, y_train)
print ('Best score: %0.3f' % grid_search.best_score_)
print ('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print ('\t%s: %r' % (param_name, best_parameters[param_name]))
Classify()
Multi-label classification
Multi-label classification, in which each instance can be assigned a subset of the set of classes. Examples of multi-label classification include assigning tags to messages posted on a forum, and classifying the objects present in an image. There are two groups of approaches for multi-label classification,
Problem transformation methods are techniques that cast the original multi-label problem as a set of single-label classification problems.