Multi-class classification

The multi-class classification is to assign an instance to one of the sets of classes. scikit-learn uses a strategy called one-vs.-all, or one-vs.-the-rest, to support multi-class classification. The goal of multi-class classification is to assign an instance to one of the sets of classes

We can download the data from https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data, the use following steps:

Import the files:

import pandas as pd
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV

Load the train file:

df= pd.read_csv('train.tsv',header=0,delimiter='\t')
df.head(10)
PhraseId SentenceId Phrase Sentiment
0 1 1 A series of escapades demonstrating the adage … 1
1 2 1 A series of escapades demonstrating the adage … 2
2 3 1 A series 2
3 4 1 A 2
4 5 1 series 2
5 6 1 of escapades demonstrating the adage that what… 2
6 7 1 of 2
7 8 1 escapades demonstrating the adage that what is… 2
8 9 1 escapades 2
9 10 1 demonstrating the adage that what is good for … 2

Check the ratio of sentiments:

df['Sentiment'].value_counts()
df['Sentiment'].value_counts()/df['Sentiment'].count()

Now run the classification:

def Classify():
    pipeline = Pipeline([('vect', TfidfVectorizer(stop_words='english')),('clf', LogisticRegression())])
    parameters = {'vect__max_df': (0.25, 0.5),
                  'vect__ngram_range': ((1, 1), (1, 2)),
                  'vect__use_idf': (True, False),
                  'clf__C': (0.1, 1, 10),}
    df = pd.read_csv('train.tsv', header=0, delimiter='\t')
    X, y = df['Phrase'], df['Sentiment'].as_matrix()
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5)
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=3,verbose=1, scoring='accuracy')

    grid_search.fit(X_train, y_train)
    print ('Best score: %0.3f' % grid_search.best_score_)
    print ('Best parameters set:')
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print ('\t%s: %r' % (param_name, best_parameters[param_name]))

Classify()
Fitting 3 folds for each of 24 candidates, totalling 72 fits
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:  1.1min
[Parallel(n_jobs=3)]: Done  72 out of  72 | elapsed:  2.2min finished
Best score: 0.622
Best parameters set:
	clf__C: 10
	vect__max_df: 0.25
	vect__ngram_range: (1, 2)
	vect__use_idf: False

Multi-label classification

Multi-label classification, in which each instance can be assigned a subset of the set of classes. Examples of multi-label classification include assigning tags to messages posted on a forum, and classifying the objects present in an image. There are two groups of approaches for multi-label classification,

Problem transformation methods are techniques that cast the original multi-label problem as a set of single-label classification problems.

 

 

Leave a Reply