Task:

Classify Movie plots by genre using word embeddings techniques

Github Repository:

https://goo.gl/ppHX65

TODO:

  • [_] Provide links to relevant papers
  • [_] Further explanations where necessary
  • [_] Include equations where necessary
  • [_] Add 3D embedding plots

Requirements:

Software:


Computing Resources:


  • Operating System: Preferably Linux or MacOS (Windows break but you can try it out)
  • RAM: 4GB
  • Disk Space: 8GB (mostly to store word embeddings)
In [31]:
%load_ext autoreload
%autoreload 2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
In [32]:
import logging
logging.root.handlers = []  # Jupyter messes up logging so needs a reset
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
from smart_open import smart_open
import pandas as pd
import numpy as np
from numpy import random
import gensim
import nltk
from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from gensim.models import Word2Vec
from sklearn.neighbors import KNeighborsClassifier
from sklearn import linear_model
from nltk.corpus import stopwords
from helpers import *

%matplotlib inline

Data Exploration


In [115]:
# read reviews
df = pd.read_csv('data/movie_reviews/tagged_plots_movielens.csv')
df = df.dropna()

# count the number of words
df['plot'].apply(lambda x: len(x.split(' '))).sum()
Out[115]:
171156

Exercise: Plot tokens by frequency

Note: Usually at least 500K words are suggested to train word2vec model, so performance may be unsual here

We only have (1/5th) therefore some models may not peform very well.

In [34]:
df[:10]
Out[34]:
Unnamed: 0 movieId plot tag
0 0 1 A little boy named Andy loves to be in his roo... animation
1 1 2 When two kids find and play a magical board ga... fantasy
2 2 3 Things don't seem to change much in Wabasha Co... comedy
3 3 6 Hunters and their prey--Neil and his professio... action
4 4 7 An ugly duckling having undergone a remarkable... romance
5 5 9 Some terrorists kidnap the Vice President of t... action
6 6 10 James Bond teams up with the lone survivor of ... action
7 7 15 Morgan Adams and her slave, William Shaw, are ... action
8 8 17 When Mr. Dashwood dies, he must leave the bulk... romance
9 9 18 This movie features the collaborative director... comedy
In [35]:
# classes
my_tags = ['sci-fi' , 'action', 'comedy', 'fantasy', 'animation', 'romance']
df.tag.value_counts().plot(kind="bar", rot=0)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9eadcc4e80>

Note: Unbalanced dataset; Comedy has significantly more examples than the rest of the classes (40% overall)

TODO: Do some more text mining exploration and statistical inferences here

In [36]:
# split the data (90/10)
train_data, test_data = train_test_split(df, test_size=0.1, random_state=42)
In [37]:
len(test_data), len(train_data)
Out[37]:
(243, 2184)
In [38]:
# distribution of the test data
test_data.tag.value_counts().plot(kind="bar", rot=0)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9ead443828>

Train Naive Models (Baselines)


  • Bag of words
  • N-grams
  • TF-IDF

1. Bag of Words

Using scikitlearn CountVectorizer

In [39]:
# training
count_vectorizer = CountVectorizer(
    analyzer="word", tokenizer=nltk.word_tokenize,
    preprocessor=None, stop_words='english', max_features=3000)

# choose what are the features
train_data_features = count_vectorizer.fit_transform(train_data['plot'])

Excercise: Authors only use logistic regression but we can also train with naive bayesian, random forest, gradient boosting, and deep neural networks

In [40]:
logreg_model = linear_model.LogisticRegression(n_jobs=1, C=1e5)
logreg_model = logreg_model.fit(train_data_features, train_data['tag'])
In [41]:
# oberve some features
count_vectorizer.get_feature_names()[2899:2910]
Out[41]:
['warrior',
 'warriors',
 'wars',
 'washington',
 'watch',
 'watches',
 'watching',
 'water',
 'waters',
 'way',
 'wayne']
In [42]:
# amount of features detected (3000)
train_data_features[0]
Out[42]:
<1x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 43 stored elements in Compressed Sparse Row format>
In [43]:
word_embeddings.predict(count_vectorizer, logreg_model, test_data, my_tags)
accuracy 0.423868312757
confusion matrix
 [[21  2 10  1  4  4]
 [ 4 10  8  0  3  6]
 [10 11 45  3 16  1]
 [ 1  5  3  4  3  0]
 [ 1  3 15  2 11  3]
 [ 8  5  6  2  0 12]]
(row=expected, col=predicted)

Note 1: Classifier performs slighly better than chance. 40% if the classifier said all were comedy and 42% with the bag of word model

In [44]:
# words for the action genre
comedy_tag_id = word_embeddings.get_tag_index(my_tags, "comedy")
comedy_words = word_embeddings.most_influential_words(logreg_model, count_vectorizer, \
                                                      comedy_tag_id, 3000)
comedy_words = pd.DataFrame(comedy_words)
In [45]:
comedy_words[:10]
Out[45]:
0 1
0 mistaken 11.810842
1 jewish 10.668508
2 suspects 10.104922
3 comedy 9.949616
4 dealer 9.775865
5 comedian 9.634339
6 operation 9.512068
7 stuart 9.323364
8 actress 9.232942
9 dimension 9.097288
In [46]:
# words for the action genre
animation_tag_id = word_embeddings.get_tag_index(my_tags, "animation")
animation_words = word_embeddings.most_influential_words(logreg_model, count_vectorizer,\
                                                         animation_tag_id, 3000)
animation_words = pd.DataFrame(animation_words)
In [47]:
# the most influential words for the animation category
animation_words[:10]
Out[47]:
0 1
0 nazi 8.946895
1 sisters 8.870921
2 troubled 8.720853
3 affair 8.603380
4 revealed 8.499657
5 relationships 8.112722
6 decide 8.016373
7 spending 7.919799
8 wolf 7.470311
9 photographer 7.268472
In [48]:
# words for the action genre
fantasy_tag_id = word_embeddings.get_tag_index(my_tags, "fantasy")
fantasy_words = word_embeddings.most_influential_words(logreg_model, count_vectorizer,\
                                                       fantasy_tag_id, 3000)
fantasy_words = pd.DataFrame(fantasy_words)

# the most influential words for the animation category
fantasy_words[:10]
Out[48]:
0 1
0 national 15.013773
1 chosen 7.682396
2 princess 6.555631
3 prove 6.072531
4 beast 5.901359
5 fan 5.886863
6 moving 5.641193
7 fantasies 5.636695
8 director 5.516928
9 angel 5.485528
In [49]:
# Check what are the words that overlap between comedy and animation
word_embeddings.check_word_overlap(comedy_words[:500][0], animation_words[:500][0])
Out[49]:
[[['jay', 12],
  ['insane', 114],
  ['lawyer', 116],
  ['environment', 162],
  ['marry', 166],
  ['did', 170],
  ['sons', 222],
  ['hunting', 302],
  ['boys', 318],
  ['asked', 323],
  ['jobs', 336],
  ['opportunity', 341],
  ['heading', 408],
  ['threatened', 414],
  ['american', 427],
  ['brooklyn', 428],
  ['turned', 429],
  ['late', 445],
  ['chinese', 450],
  ['president', 453],
  ['ready', 456],
  ['look', 469],
  ['england', 488]],
 23]

2. Character N-grams

A character n-gram is a chunk of a document of length n. It is a poor man's tokenizer but sometimes works well. The parameter n depends on language and the corpus. We choose length between 3 and 6 characters and to only focus on 3k most popular ones. 3K is a fair comparison since the previous bag of words model was this size as well.

In [50]:
n_gram_vectorizer = CountVectorizer(
    analyzer="char",
    ngram_range=([2,5]),
    tokenizer=None,
    preprocessor=None,
    max_features=3000)

charn_model = linear_model.LogisticRegression(n_jobs=1, C=1e5)

train_data_features = n_gram_vectorizer.fit_transform(train_data['plot'])

charn_model = charn_model.fit(train_data_features, train_data['tag'])
In [51]:
# some features
n_gram_vectorizer.get_feature_names()[100:120]
Out[51]:
[' chil',
 ' ci',
 ' cit',
 ' cl',
 ' co',
 ' col',
 ' com',
 ' come',
 ' comp',
 ' con',
 ' cont',
 ' cou',
 ' cr',
 ' cre',
 ' cu',
 ' d',
 ' da',
 ' dan',
 ' day',
 ' de']
In [52]:
word_embeddings.predict(n_gram_vectorizer, charn_model, test_data, my_tags)
accuracy 0.399176954733
confusion matrix
 [[17  3 12  1  5  4]
 [ 4  9  9  4  1  4]
 [12 10 41  6 15  2]
 [ 2  1  2  5  1  5]
 [ 5  1 15  1 11  2]
 [ 8  1  7  3  0 14]]
(row=expected, col=predicted)

Note: The model performs really bad :_(, even worst than chance. Perhaps we can improve it by removing stopwords. 30% accuracy

3. TF-IDF

Term Frequency - Inverse Document Frequency is useful to rank importance of words to documents

In [53]:
tf_vect = TfidfVectorizer(
    min_df=2, tokenizer=nltk.word_tokenize,
    preprocessor=None, stop_words='english')
train_data_features = tf_vect.fit_transform(train_data['plot'])

tfidf_model = linear_model.LogisticRegression(n_jobs=1, C=1e5)
tfidf_model = tfidf_model.fit(train_data_features, train_data['tag'])
In [54]:
tf_vect.get_feature_names()[1000:1010]
Out[54]:
['caesar',
 'cage',
 'caine',
 'cal',
 'calhoun',
 'california',
 'californians',
 'called',
 'calling',
 'callous']
In [55]:
word_embeddings.predict(tf_vect, tfidf_model, test_data, my_tags)
accuracy 0.465020576132
confusion matrix
 [[23  2 12  2  2  1]
 [ 3 10 10  1  3  4]
 [ 9  6 49  0 21  1]
 [ 3  4  1  4  2  2]
 [ 1  2 20  0 11  1]
 [ 9  2  5  1  0 16]]
(row=expected, col=predicted)

Let us do some analysis on the model now

In [56]:
# words for the comedy genre
comedy_tag_id = word_embeddings.get_tag_index(my_tags, "comedy")
comedy_words = word_embeddings.most_influential_words(tfidf_model, tf_vect, comedy_tag_id, 3000)
comedy_words[0:10]
Out[56]:
[['comedy', 24.5992977289301],
 ['jewish', 21.111621307928754],
 ['mistaken', 20.906784380688975],
 ['kindergarten', 20.550050838076675],
 ['duo', 20.039310378863604],
 ['dimension', 19.730576933476108],
 ['beloved', 19.655031499523833],
 ['dealer', 19.190813922478519],
 ['street', 18.766362320888405],
 ['tramp', 18.567543420444927]]
In [57]:
# words for the fantasy genre
fantasy_tag_id = word_embeddings.get_tag_index(my_tags, "fantasy")
fantasy_words = word_embeddings.most_influential_words(tfidf_model, tf_vect, fantasy_tag_id, 3000)
fantasy_words[0:10]
Out[57]:
[['edward', 17.405228729634185],
 ['magical', 16.525441688448286],
 ['kingdom', 15.293991691029369],
 ['adventures', 14.936271113826516],
 ['percy', 14.809172140901389],
 ['moving', 14.203174472333552],
 ['land', 14.153599202700454],
 ['demon', 14.140054691071217],
 ['fan', 14.042094086775878],
 ['king', 13.98948584958028]]

Now let us observe what are the most influential words for a given movie plot

In [58]:
word_embeddings.most_influential_words_doc(train_data['plot'][0], fantasy_words)
Out[58]:
['boy',
 'loves',
 'playing',
 'come',
 'worry',
 'family',
 'does',
 'mother',
 'action',
 'quickly',
 'tries',
 'rid',
 'ruthless',
 'toy']

Note: As you can see from the list above, there are words, such as does and tries that are not so relevant to the fantasy category. Therefore, we can still do better with this model.

Excercise: Head to scikit-learn and try to use different parameter for the TF-IDF vectorizer, such as modifying the ngram_range. It might improve the performance of the classifier.

Averaging Word Vectors (word2vec)

SOURCE: https://code.google.com/archive/p/word2vec/

First we are going to use a pretrained model open source by Google (word2vec)

In [59]:
wv = gensim.models.KeyedVectors.load_word2vec_format(
    "data/google_word2vec/GoogleNews-vectors-negative300.bin.gz",
    binary=True)
wv.init_sims(replace=True)
2017-06-28 22:11:11,385 : INFO : loading projection weights from data/google_word2vec/GoogleNews-vectors-negative300.bin.gz
2017-06-28 22:13:00,195 : INFO : loaded (3000000, 300) matrix from data/google_word2vec/GoogleNews-vectors-negative300.bin.gz
2017-06-28 22:13:00,195 : INFO : precomputing L2-norms of word weight vectors

Excercise: Create a function for exploring the word2vec file and find ways to visualize the results

Preprocessing: Here we tokenize both the training and testing datasets before creating the vectors with word2vec. Think of this as a filtering step. In this case words with len(word)<2 are removed.

In [60]:
# test_tokenized = test_data.apply(lambda r: word2vec_helpers.w2v_tokenize_text(r['plot']), axis=1).values
In [61]:
# train_tokenized = train_data.apply(lambda r: word2vec_helpers.w2v_tokenize_text(r['plot']), axis=1).values
In [62]:
# store the data as pickle for retrieving later
#pickle_helpers.convert_to_pickle(train_tokenized, "data/movie_reviews/train.p")

# retrieve pickled version of test and train tokenized
test_tokenized = pickle_helpers.load_from_pickle("data/movie_reviews/test.p")
train_tokenized = pickle_helpers.load_from_pickle("data/movie_reviews/train.p")
In [63]:
train_tokenized[0:10]
Out[63]:
array([ list(['Turkish', 'and', 'his', 'close', 'friend/accomplice', 'Tommy', 'get', 'pulled', 'into', 'the', 'world', 'of', 'match', 'fixing', 'by', 'the', 'notorious', 'Brick', 'Top', 'Things', 'get', 'complicated', 'when', 'the', 'boxer', 'they', 'had', 'lined', 'up', 'gets', 'badly', 'beaten', 'by', 'Pitt', "'pikey", 'slang', 'for', 'an', 'Irish', 'Gypsy', 'who', 'comes', 'into', 'the', 'equation', 'after', 'Turkish', 'an', 'unlicensed', 'boxing', 'promoter', 'wants', 'to', 'buy', 'caravan', 'off', 'the', 'Irish', 'Gypsies', 'They', 'then', 'try', 'to', 'convince', 'Pitt', 'not', 'only', 'to', 'fight', 'for', 'them', 'but', 'to', 'lose', 'for', 'them', 'too', 'Whilst', 'all', 'this', 'is', 'going', 'on', 'huge', 'diamond', 'heist', 'takes', 'place', 'and', 'fistful', 'of', 'motley', 'characters', 'enter', 'the', 'story', 'including', "'Cousin", 'Avi', "'Boris", 'The', 'Blade', "'Franky", 'Four', 'Fingers', 'and', "'Bullet", 'Tooth', 'Tony', 'Things', 'go', 'from', 'bad', 'to', 'worse', 'as', 'it', 'all', 'becomes', 'about', 'the', 'money', 'the', 'guns', 'and', 'the', 'damned', 'dog']),
       list(['In', 'the', 'early', '1960', "'s", 'sixteen', 'year', 'old', 'Jenny', 'Mellor', 'lives', 'with', 'her', 'parents', 'in', 'the', 'London', 'suburb', 'of', 'Twickenham', 'On', 'her', 'father', "'s", 'wishes', 'everything', 'that', 'Jenny', 'does', 'is', 'in', 'the', 'sole', 'pursuit', 'of', 'being', 'accepted', 'into', 'Oxford', 'as', 'he', 'wants', 'her', 'to', 'have', 'better', 'life', 'than', 'he', 'Jenny', 'is', 'bright', 'pretty', 'hard', 'working', 'but', 'also', 'naturally', 'gifted', 'The', 'only', 'problems', 'her', 'father', 'may', 'perceive', 'in', 'her', 'life', 'is', 'her', 'issue', 'with', 'learning', 'Latin', 'and', 'her', 'dating', 'boy', 'named', 'Graham', 'who', 'is', 'nice', 'but', 'socially', 'awkward', 'Jenny', "'s", 'life', 'changes', 'after', 'she', 'meets', 'David', 'Goldman', 'man', 'over', 'twice', 'her', 'age', 'David', 'goes', 'out', 'of', 'his', 'way', 'to', 'show', 'Jenny', 'and', 'her', 'family', 'that', 'his', 'interest', 'in', 'her', 'is', 'not', 'improper', 'and', 'that', 'he', 'wants', 'solely', 'to', 'expose', 'her', 'to', 'cultural', 'activities', 'which', 'she', 'enjoys', 'Jenny', 'quickly', 'gets', 'accustomed', 'to', 'the', 'life', 'to', 'which', 'David', 'and', 'his', 'constant', 'companions', 'Danny', 'and', 'Helen', 'have', 'shown', 'her', 'and', 'Jenny', 'and', 'David', "'s", 'relationship', 'does', 'move', 'into', 'becoming', 'romantic', 'one', 'However', 'Jenny', 'slowly', 'learns', 'more', 'about', 'David', 'and', 'by', 'association', 'Danny', 'and', 'Helen', 'and', 'specifically', 'how', 'they', 'make', 'their', 'money', 'Jenny', 'has', 'to', 'decide', 'if', 'what', 'she', 'learns', 'about', 'them', 'and', 'leading', 'such', 'life', 'is', 'worth', 'forgoing', 'her', 'plans', 'of', 'higher', 'eduction', 'at', 'Oxford']),
       list(['Ollie', 'Trinkie', 'is', 'publicist', 'who', 'has', 'great', 'girlfriend', 'Gertrude', 'whom', 'he', 'marries', 'and', 'they', 'are', 'expecting', 'baby', 'but', 'while', 'he', 'is', 'looking', 'forward', 'to', 'being', 'father', 'he', 'does', "n't", 'lighten', 'his', 'workload', 'Gertrude', 'gives', 'birth', 'but', 'dies', 'in', 'the', 'process', 'Ollie', 'does', "n't", 'live', 'up', 'to', 'his', 'responsibilities', 'as', 'father', 'Eventually', 'the', 'strain', 'and', 'pressure', 'of', 'losing', 'his', 'wife', 'and', 'being', 'father', 'gets', 'to', 'him', 'and', 'he', 'has', 'breakdown', 'which', 'leads', 'to', 'his', 'termination', 'So', 'with', 'nothing', 'much', 'to', 'do', 'he', 'tries', 'to', 'be', 'good', 'father', 'to', 'his', 'daughter', 'Gertie', 'He', 'also', 'meets', 'young', 'woman', 'name', 'Maya', 'who', 'likes', 'him', 'but', 'he', 'is', 'still', 'not', 'over', 'his', 'wife']),
       list(['The', 'film', 'shows', 'the', 'day', 'when', 'Rui', 'and', 'Vani', 'first', 'met', 'It', 'was', 'at', 'their', 'wedding', 'with', 'other', 'partners', 'Vani', 'was', 'going', 'to', 'marry', 'Srgio', 'and', 'Rui', 'would', 'marry', 'Marta', 'in', 'the', 'same', 'church', 'following', 'Vani', "'s", 'marriage', 'While', 'waiting', 'for', 'the', 'ceremony', 'they', 'begin', 'to', 'talk', 'Complications', 'ensue']),
       list(['Underneath', 'the', 'sands', 'of', 'Egypt', 'Anubis', 'an', 'ancient', 'evil', 'spirit', 'has', 'awakened', 'It', "'s", 'up', 'to', 'Yugi', 'who', 'defeated', 'Anubis', 'centuries', 'ago', 'to', 'use', 'his', 'skill', 'and', 'determination', 'to', 'rid', 'the', 'world', 'of', 'evil', 'once', 'again']),
       list(['After', 'successful', 'mission', 'against', 'drug', 'lords', 'the', 'efficient', 'Captain', 'Damien', 'Tomaso', 'is', 'framed', 'at', 'home', 'with', 'three', 'kilograms', 'of', 'heroin', 'planted', 'by', 'the', 'police', 'in', 'his', 'kitchen', 'and', 'he', 'is', 'arrested', 'Meanwhile', 'group', 'of', 'teenagers', 'film', 'the', 'action', 'of', 'dirty', 'agents', 'led', 'by', 'Roland', 'from', 'the', 'security', 'agency', 'executing', 'policemen', 'in', 'their', 'car', 'and', 'then', 'leaving', 'the', 'car', 'with', 'the', 'corpses', 'in', 'the', '13th', 'District', 'to', 'blame', 'the', 'gangs', 'and', 'begin', 'civil', 'war', 'Behind', 'these', 'events', 'the', 'corrupt', 'chief', 'of', 'the', 'security', 'agency', 'Walter', 'Gassman', 'had', 'received', 'huge', 'amount', 'as', 'kickback', 'from', 'the', 'constructor', 'Harriburton', 'that', 'has', 'interest', 'to', 'construct', 'buildings', 'in', 'the', 'poor', 'area', 'and', 'uses', 'the', 'situation', 'to', 'force', 'the', 'President', 'of', 'France', 'to', 'authorize', 'to', 'nuke', 'five', 'towers', 'in', 'the', 'district', 'The', 'teenager', 'with', 'the', 'film', 'is', 'hunted', 'by', 'the', 'police', 'but', 'he', 'delivers', 'the', 'memory', 'card', 'to', 'Leito', 'meanwhile', 'Damien', 'calls', 'him', 'from', 'the', 'precinct', 'asking', 'for', 'help', 'The', 'friends', 'team-up', 'with', 'five', 'dangerous', 'bosses', 'to', 'gather', 'evidences', 'to', 'prove', 'to', 'the', 'president', 'that', 'Gassman', 'has', 'provoked', 'the', 'conflict', 'in', 'their', 'district']),
       list(['Two', 'New', 'Yorkers', 'are', 'accused', 'of', 'murder', 'in', 'rural', 'Alabama', 'while', 'on', 'their', 'way', 'back', 'to', 'college', 'and', 'one', 'of', 'their', 'cousins', '--', 'an', 'inexperienced', 'loudmouth', 'lawyer', 'not', 'accustomed', 'to', 'Southern', 'rules', 'and', 'manners', '--', 'comes', 'in', 'to', 'defend', 'them']),
       list(['man', 'finds', 'out', 'that', 'what', 'you', 'do', "n't", 'say', 'to', 'friend', 'is', 'just', 'as', 'important', 'as', 'what', 'you', 'do', 'is', 'this', 'story', 'of', 'how', 'far', 'you', 'can', 'bend', 'brotherly', 'bond', 'before', 'it', 'snaps', 'Since', 'college', 'confirmed', 'bachelor', 'Ronny', 'Vaughn', 'and', 'happily', 'married', 'Nick', 'James', 'have', 'been', 'through', 'thick', 'and', 'thin', 'Now', 'partners', 'in', 'an', 'auto', 'design', 'firm', 'the', 'two', 'pals', 'are', 'vying', 'to', 'land', 'dream', 'project', 'that', 'would', 'launch', 'their', 'company', 'Ronny', "'s", 'girlfriend', 'Beth', 'Connelly', 'and', 'Nick', "'s", 'wife', 'Geneva', 'Ryder', 'are', 'by', 'their', 'sides', 'But', 'Ronny', "'s", 'world', 'is', 'turned', 'upside', 'down', 'when', 'he', 'inadvertently', 'sees', 'Geneva', 'out', 'with', 'another', 'man', 'and', 'makes', 'it', 'his', 'mission', 'to', 'get', 'answers', 'As', 'the', 'amateur', 'investigation', 'dissolves', 'into', 'mayhem', 'he', 'learns', 'that', 'Nick', 'has', 'few', 'secrets', 'of', 'his', 'own', 'Now', 'with', 'the', 'clock', 'ticking', 'and', 'pressure', 'mounting', 'on', 'the', 'biggest', 'presentation', 'of', 'their', 'careers', 'Ronny', 'must', 'decide', 'what', 'will', 'happen', 'if', 'he', 'reveals', 'the', 'truth', 'to', 'his', 'best', 'friend']),
       list(['Ben', 'Sobol', 'Psychiatrist', 'has', 'few', 'problems', 'His', 'son', 'spies', 'on', 'his', 'patients', 'when', 'they', 'open', 'up', 'their', 'heart', 'his', 'parents', 'do', "n't", 'want', 'to', 'attend', 'his', 'upcoming', 'wedding', 'and', 'his', 'patients', 'problems', 'do', "n't", 'challenge', 'him', 'at', 'all', 'Paul', 'Vitti', 'Godfather', 'has', 'few', 'problems', 'as', 'well', 'Sudden', 'anxiety', 'attacks', 'in', 'public', 'certain', 'disability', 'to', 'kill', 'people', 'and', 'his', 'best', 'part', 'ceasing', 'service', 'when', 'needed', 'One', 'day', 'Ben', 'unfortunately', 'crashes', 'into', 'one', 'of', 'Vitti', "'s", 'cars', 'The', 'exchange', 'of', 'Ben', "'s", 'business', 'card', 'is', 'followed', 'by', 'business', 'visit', 'of', 'Don', 'Paul', 'Vitti', 'himself', 'who', 'wants', 'to', 'be', 'free', 'of', 'inner', 'conflict', 'within', 'two', 'weeks', 'before', 'all', 'the', 'Mafia', 'Dons', 'meet', 'Now', 'Ben', 'Sobol', 'feels', 'somewhat', 'challenged', 'as', 'his', 'wedding', 'is', 'soon', 'his', 'only', 'patient', 'keeps', 'him', #39;busy', 'by', 'regarding', 'Ben', "'s", 'duty', 'as', '24', 'hour', 'standby', 'and', 'the', 'feds', 'keep', 'forcing', 'him', 'to', 'spy', 'on', 'Paul', 'Vitti', 'And', 'how', 'do', 'you', 'treat', 'patient', 'who', 'usually', 'solves', 'problems', 'with', 'gun']),
       list(['robot', 'malfunction', 'creates', 'havoc', 'and', 'terror', 'for', 'unsuspecting', 'vacationers', 'at', 'futuristic', 'adult-themed', 'amusement', 'park'])], dtype=object)

Next, we have to convert the words into a distributed representation (in other words, each token must be conversed into vector). Since we are using word2vec the convesion can be done easily.

Now that we have the vectors, we can obtain vectors for an entire document, by averaging word vectors. (Warning: very naive appraoch!)

alt txt

In [64]:
X_train_word_average = word2vec_helpers.word_averaging_list(wv,train_tokenized)
In [65]:
X_test_word_average = word2vec_helpers.word_averaging_list(wv,test_tokenized)
In [66]:
X_train_word_average[0]
Out[66]:
array([ 0.05545449,  0.05910575,  0.04342501,  0.07156485, -0.06337085,
       -0.04693267,  0.02318664, -0.0889559 ,  0.06699731,  0.08280275,
       -0.02325549, -0.11232544, -0.01920838,  0.04011356, -0.11890385,
        0.03739267,  0.0560286 ,  0.08226582,  0.02232919, -0.04910114,
       -0.00497308,  0.04151743,  0.02096887, -0.00108779,  0.08315163,
        0.00165053, -0.10190015,  0.0887218 ,  0.04294581,  0.02587495,
       -0.01478133,  0.02559877, -0.03495957, -0.00126466,  0.03815735,
        0.00277985, -0.00618378, -0.0127198 ,  0.04081891,  0.07088073,
        0.09200442, -0.08722563,  0.11710441, -0.04684571, -0.02880416,
       -0.00468458, -0.05068978, -0.00313703,  0.06019693,  0.06222438,
       -0.01847018,  0.11511943, -0.00197043, -0.07153045, -0.03648838,
        0.04271036, -0.02165449, -0.06775223,  0.01724528, -0.07145506,
       -0.04500109,  0.0923021 , -0.07771284, -0.11304772, -0.03642419,
       -0.02397631,  0.00947447,  0.06566761, -0.04847883,  0.07447033,
        0.04596359,  0.00533706,  0.07440901,  0.01862758, -0.10930904,
       -0.05191687,  0.09581147,  0.13321283,  0.02818674,  0.09916298,
        0.00275499, -0.11262483,  0.03205725, -0.00655283, -0.03152488,
       -0.09904213, -0.10709953,  0.10126126, -0.01183448,  0.00032518,
        0.0346867 ,  0.05184319, -0.05309206, -0.13063876, -0.04088032,
       -0.05249476,  0.04398097,  0.01841562, -0.02550774,  0.01531992,
       -0.05255707, -0.05905483,  0.0204562 , -0.02640847, -0.05222479,
       -0.03798182, -0.01838683, -0.08310767,  0.02429862, -0.06920657,
        0.00807706,  0.02362138, -0.00099598, -0.03800139,  0.06231945,
        0.03415101,  0.0561602 , -0.01016205,  0.08438386,  0.07533283,
       -0.13695887, -0.01100177, -0.10242955,  0.03188829, -0.04064427,
       -0.02914017, -0.06386539, -0.06155058, -0.00615242,  0.00615733,
       -0.03677453, -0.10083614, -0.09915902, -0.01499352, -0.0428905 ,
       -0.07342237,  0.00816098, -0.01920961, -0.05493748,  0.04522207,
        0.07613138, -0.05877322,  0.0417468 , -0.01223148,  0.03953735,
        0.01429599, -0.02857977, -0.07490996, -0.07850028,  0.03174346,
        0.06443751,  0.01268353, -0.08018931,  0.06602806, -0.03417638,
        0.00344341, -0.06512572, -0.08546869, -0.02787689, -0.0415597 ,
       -0.02506224,  0.041864  ,  0.05230035,  0.01846709, -0.00877439,
       -0.10832384,  0.03634784, -0.06550397,  0.03707062, -0.0529377 ,
       -0.15230809, -0.06808924, -0.0105572 , -0.09720243, -0.02477943,
       -0.0201604 ,  0.08111939, -0.06768905,  0.0002988 ,  0.0077303 ,
       -0.04402705, -0.08054816, -0.00552738,  0.028412  , -0.03715033,
       -0.04948618,  0.00337431,  0.03152485,  0.1285221 ,  0.04539691,
        0.06230238,  0.00652147,  0.06113409,  0.02588119, -0.05525883,
       -0.00190705, -0.04055991, -0.01651839, -0.08220793, -0.12156011,
        0.05581678,  0.0460983 , -0.02329248,  0.02569922,  0.04048427,
        0.01138046, -0.04218122,  0.01308791,  0.03205488, -0.0349777 ,
       -0.01670968,  0.0374642 ,  0.01190601,  0.06392512, -0.10639601,
        0.03272881,  0.10028993, -0.00271489, -0.10370444, -0.016493  ,
       -0.01203196, -0.01359196, -0.0107542 , -0.03691358,  0.07650792,
       -0.0165319 ,  0.05781083,  0.03541211,  0.06619678, -0.03761706,
        0.04056923, -0.08521704,  0.04542974,  0.05765659,  0.01986002,
       -0.0354447 ,  0.03333832, -0.05173995,  0.06272335,  0.01096975,
        0.04950256, -0.00945826, -0.00653899, -0.1073802 ,  0.01505139,
        0.00831405,  0.0658185 ,  0.06472088,  0.02112927, -0.0161819 ,
        0.03234134,  0.01384802,  0.05193314,  0.05432371,  0.05399175,
       -0.01561716,  0.07086278,  0.00388288, -0.04355494, -0.05447726,
       -0.025471  ,  0.02247729, -0.08989162,  0.06539894,  0.08100092,
        0.13270888, -0.06876201,  0.02076347, -0.05209624,  0.02469205,
        0.06893449,  0.0873263 ,  0.07789882,  0.04711369,  0.0824732 ,
       -0.08542187, -0.04682464, -0.15431304, -0.04475014, -0.01971488,
        0.00911232, -0.0603571 , -0.0022078 ,  0.04911681,  0.04853116,
        0.00946618, -0.12347805,  0.00823581,  0.00914641,  0.02190395,
       -0.0688283 ,  0.03558169, -0.11885665,  0.02563123, -0.02871157,
        0.0118563 ,  0.00246904, -0.02010039,  0.03884799, -0.0357489 ], dtype=float32)

Further exploration of word2vec generated vectors

Some tips here: https://radimrehurek.com/gensim/models/keyedvectors.html

In [67]:
wv.syn0norm[wv.vocab['Murtuza'].index] # (300,)
#wv.vocab['woman'].index
Out[67]:
array([ -8.43075514e-02,   9.63514894e-02,   6.45992905e-02,
         6.89789057e-02,   3.52193899e-02,   1.74272116e-02,
        -6.05846494e-02,  -9.63514894e-02,  -7.00738132e-02,
         2.60951947e-02,  -2.97448728e-02,   4.58034538e-02,
        -4.89056818e-02,   3.55843566e-02,  -6.05846494e-02,
        -3.67248803e-03,   2.93799043e-02,  -3.04748081e-02,
        -3.10222600e-02,   4.96356152e-02,  -1.32118329e-01,
        -6.60591647e-02,   1.71534847e-02,   6.67890981e-02,
         7.81031027e-02,  -9.78113636e-02,  -7.25373439e-03,
         4.31118160e-03,  -7.52746034e-03,  -9.48916152e-02,
        -1.00001164e-01,  -3.50369066e-02,   8.50374922e-02,
         1.72447264e-02,  -3.48544233e-02,  -5.21903895e-02,
        -3.22996452e-02,   1.53286457e-02,  -4.58490755e-03,
        -5.83948418e-02,   9.34317484e-02,   7.95629695e-02,
        -2.14418564e-02,   2.40878724e-02,  -2.64601633e-02,
        -8.72272924e-02,  -6.47247536e-04,  -6.60591647e-02,
         9.63514894e-02,  -9.16981511e-03,  -4.41610999e-02,
         2.59127114e-02,   1.31388396e-01,  -7.26285875e-02,
         2.08031628e-02,   1.40512586e-02,  -1.79746617e-02,
         7.00738132e-02,  -4.10588719e-02,  -1.13869943e-01,
        -1.40512586e-02,  -2.64601633e-02,  -8.46725181e-02,
        -3.04748081e-02,  -4.43435833e-02,   6.49642646e-02,
         9.34317484e-02,  -2.68251300e-02,   5.72999381e-02,
         4.85407114e-02,  -4.43435833e-02,  -2.82850023e-03,
        -5.80298752e-02,  -4.12413590e-02,  -6.38693571e-02,
        -1.71534847e-02,  -7.22636208e-02,   1.38687752e-02,
         8.54024589e-02,  -5.62050343e-02,   5.62050343e-02,
         1.49636781e-02,  -1.76781265e-03,  -5.98547123e-02,
        -5.29203266e-02,   1.96170174e-02,   8.61323923e-02,
        -3.01098414e-02,  -5.29203266e-02,   3.97814848e-02,
         3.30752041e-03,  -1.35038078e-01,  -9.30667818e-02,
         4.39101830e-04,  -1.21169299e-01,  -6.60591647e-02,
         4.44804458e-03,   8.61323923e-02,   7.11687142e-03,
         3.95990014e-02,   4.92706485e-02,  -5.36502600e-02,
         5.14604561e-02,   6.02196828e-02,   1.00731105e-01,
         2.39053890e-02,   2.66426466e-02,   1.29563557e-02,
         2.07119212e-02,   1.46899521e-02,  -3.01098414e-02,
        -3.10222600e-02,   8.94171000e-02,  -6.71540722e-02,
         1.97082590e-02,  -1.54198883e-02,  -3.99639718e-02,
        -5.43801971e-02,   4.41610999e-02,  -1.33578196e-01,
        -8.71360581e-03,   1.19709425e-01,  -1.23359106e-01,
         1.36862909e-02,  -3.32120657e-02,   8.04069627e-04,
        -1.21169299e-01,   2.91974209e-02,   7.26285875e-02,
         5.83948418e-02,  -5.25553562e-02,   8.72272924e-02,
         3.30295824e-02,  -1.47446975e-01,  -2.17155814e-02,
         9.53478273e-03,   4.01464552e-02,  -1.08121699e-02,
        -2.29017269e-02,   9.23368409e-02,   1.05384439e-02,
         6.70628250e-03,   4.72177053e-03,   7.11687133e-02,
        -1.00731105e-01,  -2.60951947e-02,  -1.24887412e-03,
         7.22636208e-02,   7.59132951e-02,   4.54384871e-02,
        -3.55843566e-02,  -7.33585209e-02,  -1.08030461e-01,
        -4.34311628e-02,   1.42337428e-02,  -1.44527242e-01,
        -1.70622431e-02,   1.77921783e-02,   4.21537757e-02,
         5.63419005e-03,  -7.55483285e-02,  -5.07305190e-02,
         3.72267105e-02,  -3.06572914e-02,  -3.75916809e-02,
        -4.81757447e-02,   8.97820666e-02,  -2.35404205e-02,
         4.58034538e-02,   1.23176621e-02,  -1.00001164e-01,
        -7.52746034e-03,  -5.91247790e-02,  -7.73731694e-02,
         1.82483885e-02,   3.39420028e-02,   1.39417693e-01,
        -7.08037466e-02,   6.79752463e-03,   6.02196828e-02,
         1.04380779e-01,   3.64967771e-02,  -1.08121699e-02,
         1.42337428e-02,   1.15421051e-02,  -6.58652745e-04,
         8.10228437e-02,   3.13644159e-05,   1.45987107e-03,
         9.48916189e-03,  -6.82489723e-02,   1.00001164e-01,
        -1.34308144e-01,   2.33807470e-04,  -5.40152304e-02,
        -5.07305190e-02,  -3.08397766e-02,  -3.86865847e-02,
        -3.13872285e-02,  -4.78107780e-02,  -7.02562928e-03,
        -3.44438315e-03,  -1.25001455e-02,  -5.58400676e-02,
         6.16795532e-02,  -7.73731694e-02,  -2.13506147e-02,
         3.95990014e-02,   2.57302281e-02,   1.98907424e-02,
         2.97448728e-02,  -4.12413590e-02,  -6.60591647e-02,
        -2.79200338e-02,   7.44534209e-02,   5.18254228e-02,
        -8.61323923e-02,   1.14599876e-01,  -7.99279436e-02,
        -2.88324542e-02,   2.19893083e-02,   3.01098414e-02,
        -4.70808409e-02,   3.17521952e-02,  -7.00738132e-02,
         6.93438724e-02,  -2.15330981e-02,  -6.45992905e-02,
         1.61498226e-02,  -5.65700047e-02,  -2.21717916e-02,
        -3.32120657e-02,  -8.07491131e-03,  -1.88870821e-02,
         1.37775326e-02,  -2.70076152e-02,   1.02920912e-01,
        -3.77741642e-02,   2.66882684e-03,   8.61323923e-02,
        -7.11687142e-03,   3.72267105e-02,   6.43255701e-03,
        -5.54751009e-02,  -4.19712923e-02,   1.24089038e-02,
        -3.37595195e-02,   3.10222600e-02,  -2.21717916e-02,
        -9.07857344e-03,   1.01461038e-01,   5.54751009e-02,
        -7.52746034e-03,   5.32852933e-02,   5.76649085e-02,
         3.19346786e-02,  -1.18979491e-01,  -4.43435833e-02,
        -5.94897456e-02,  -3.86865847e-02,  -2.93799043e-02,
         6.35043904e-02,  -1.25001455e-02,  -8.35776180e-02,
         1.89783238e-02,   6.56941980e-02,  -3.79566476e-02,
        -2.48178076e-02,   3.57668400e-02,   4.99549648e-03,
         7.44534209e-02,  -1.54746339e-01,  -3.64967771e-02,
         5.36502600e-02,   6.31394237e-02,  -3.10222600e-02,
        -6.56941952e-03,  -2.07119212e-02,   5.10954857e-02,
         2.90149376e-02,  -2.45212717e-03,  -6.89789057e-02,
        -1.05840653e-01,   1.97082590e-02,   2.48178076e-02,
        -1.03559606e-02,   3.95990014e-02,   3.24821323e-02,
         1.66972745e-02,   1.44162271e-02,  -1.92520488e-02,
        -3.21171619e-02,   4.05114219e-02,  -8.10228437e-02,
        -8.68623257e-02,   5.94897456e-02,  -4.85407114e-02,
         3.30295824e-02,  -3.04748081e-02,   6.45992905e-02], dtype=float32)
In [68]:
wv.most_similar(positive=['woman', 'king'], negative=['man'])
Out[68]:
[('queen', 0.7118191719055176),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411403656006)]
In [120]:
wv.most_similar(positive=['GOP', 'Trump'], negative=['money'])
Out[120]:
[('Donald_Trump', 0.492326557636261),
 ('Mexican', 0.43394654989242554),
 ('Aguilar_Zinser', 0.4289255142211914),
 ('México', 0.4275927245616913),
 ('Mexcio', 0.4245557487010956),
 ('Jorge_Castañeda', 0.42415398359298706),
 ('Celaya_Guanajuato', 0.4219090938568115),
 ('Guatemala', 0.41712844371795654),
 ('Bosques_de_las_Lomas', 0.41683387756347656),
 ('Peru', 0.4160282015800476)]

For fun, let us try to manually compute the cosine similarity of some of our word vectors, which is computed as follows: alt txt

In [117]:
word_embeddings.cosine_measure(wv.syn0norm[wv.vocab['gun'].index], wv.syn0norm[wv.vocab['pistol'].index])
Out[117]:
0.76924075586436913

KNN Classification

Now we train a KNN and a logistic regression classifier and observe how they perfom on these word-averaging document features

Read about KNN- https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

alt txt

In [71]:
knn_naive_dv = KNeighborsClassifier(n_neighbors=3, n_jobs=1, algorithm='brute', metric='cosine' )
knn_naive_dv.fit(X_train_word_average, train_data.tag)
Out[71]:
KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='cosine',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')
In [72]:
predicted = knn_naive_dv.predict(X_test_word_average)
In [73]:
word_embeddings.evaluate_prediction(predicted, test_data.tag, my_tags)
accuracy 0.38683127572
confusion matrix
 [[22  4  6  1  6  3]
 [ 9  4  9  4  5  0]
 [18  4 44  3 13  4]
 [ 7  3  1  2  1  2]
 [13  0  9  1 11  1]
 [13  2  1  4  2 11]]
(row=expected, col=predicted)

KNN doesn't perform well (only 38%), even lower than chance.

In [74]:
test_data.iloc()[56]['plot']
Out[74]:
'Scruffy but irresistibly attractive Yau Muk-yan, without a job or a place to live, moves in with sensitive, shy piano tuner Chan Kar-fu. Both are disturbed, then obsessed, by the amateurish piano playing of upstairs neighbour Mok Man-yee. Obsession turns to romance, and romance to fantasy. The film is structured in four "movements": two themes (Yau Muk-yan, Mok Man-yee), a duet (Yau Muk-yan & Mok Man-yee), and a set of variations (a wild fantasy of Chan Kar-fu in his new novel).'
In [75]:
wv.most_similar(positive=[X_test_word_average[56]], restrict_vocab=100000, topn=30)[0:20]
Out[75]:
[('just', 0.5038168430328369),
 ('but', 0.5023055076599121),
 ('in', 0.49800050258636475),
 ('the', 0.4957137107849121),
 ('so', 0.4889165759086609),
 ('actually', 0.4778650999069214),
 ('By_TBT_staff', 0.4743911623954773),
 ('even', 0.47195717692375183),
 ('really', 0.4685917794704437),
 ('anyway', 0.4660709500312805),
 ('vice_versa', 0.45745164155960083),
 ('one', 0.4536522626876831),
 ('sort', 0.4488521218299866),
 ('that', 0.4462577700614929),
 ('dreamy', 0.44418084621429443),
 ('You_EIG', 0.44372493028640747),
 ('kind', 0.442863404750824),
 ('then', 0.44108718633651733),
 ('obviously', 0.4360683262348175),
 ('Chan', 0.43542975187301636)]

The problem with the is that the average of this particular document fall under some area that is not related to the original document.

Excercise: Try to remove stopwords, the function is already provided. Try to figure out how to solve this using the current implementation of the function. Also, report the accuracy you achieved.

Doc2Vec

Paper: https://cs.stanford.edu/~quocle/paragraph_vector.pdf It is a semi-supervised approach since a weak label or tag (you can put many) is introduced into training documents before modeling. Read more about semi-supervised appraoches here: https://en.wikipedia.org/wiki/Semi-supervised_learning

alt txt

In [76]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
In [77]:
train_tagged = train_data.apply(
    lambda r: TaggedDocument(words=word_embeddings.tokenize_text(r['plot']), tags=[r.tag]), axis=1)
In [78]:
test_tagged = test_data.apply(
    lambda r: TaggedDocument(words=word_embeddings.tokenize_text(r['plot']), tags=[r.tag]), axis=1)
In [79]:
test_tagged.values[50]
Out[79]:
TaggedDocument(words=['troubled', 'psychologist', 'is', 'sent', 'to', 'investigate', 'the', 'crew', 'of', 'an', 'isolated', 'research', 'station', 'orbiting', 'bizarre', 'planet'], tags=['sci-fi'])
In [80]:
trainsent = train_tagged.values
testsent = test_tagged.values
In [81]:
trainsent
Out[81]:
array([ TaggedDocument(words=['turkish', 'and', 'his', 'close', 'friend/accomplice', 'tommy', 'get', 'pulled', 'into', 'the', 'world', 'of', 'match', 'fixing', 'by', 'the', 'notorious', 'brick', 'top', 'things', 'get', 'complicated', 'when', 'the', 'boxer', 'they', 'had', 'lined', 'up', 'gets', 'badly', 'beaten', 'by', 'pitt', "'pikey", 'slang', 'for', 'an', 'irish', 'gypsy', 'who', 'comes', 'into', 'the', 'equation', 'after', 'turkish', 'an', 'unlicensed', 'boxing', 'promoter', 'wants', 'to', 'buy', 'caravan', 'off', 'the', 'irish', 'gypsies', 'they', 'then', 'try', 'to', 'convince', 'pitt', 'not', 'only', 'to', 'fight', 'for', 'them', 'but', 'to', 'lose', 'for', 'them', 'too', 'whilst', 'all', 'this', 'is', 'going', 'on', 'huge', 'diamond', 'heist', 'takes', 'place', 'and', 'fistful', 'of', 'motley', 'characters', 'enter', 'the', 'story', 'including', "'cousin", 'avi', "'boris", 'the', 'blade', "'franky", 'four', 'fingers', 'and', "'bullet", 'tooth', 'tony', 'things', 'go', 'from', 'bad', 'to', 'worse', 'as', 'it', 'all', 'becomes', 'about', 'the', 'money', 'the', 'guns', 'and', 'the', 'damned', 'dog'], tags=['comedy']),
       TaggedDocument(words=['in', 'the', 'early', '1960', "'s", 'sixteen', 'year', 'old', 'jenny', 'mellor', 'lives', 'with', 'her', 'parents', 'in', 'the', 'london', 'suburb', 'of', 'twickenham', 'on', 'her', 'father', "'s", 'wishes', 'everything', 'that', 'jenny', 'does', 'is', 'in', 'the', 'sole', 'pursuit', 'of', 'being', 'accepted', 'into', 'oxford', 'as', 'he', 'wants', 'her', 'to', 'have', 'better', 'life', 'than', 'he', 'jenny', 'is', 'bright', 'pretty', 'hard', 'working', 'but', 'also', 'naturally', 'gifted', 'the', 'only', 'problems', 'her', 'father', 'may', 'perceive', 'in', 'her', 'life', 'is', 'her', 'issue', 'with', 'learning', 'latin', 'and', 'her', 'dating', 'boy', 'named', 'graham', 'who', 'is', 'nice', 'but', 'socially', 'awkward', 'jenny', "'s", 'life', 'changes', 'after', 'she', 'meets', 'david', 'goldman', 'man', 'over', 'twice', 'her', 'age', 'david', 'goes', 'out', 'of', 'his', 'way', 'to', 'show', 'jenny', 'and', 'her', 'family', 'that', 'his', 'interest', 'in', 'her', 'is', 'not', 'improper', 'and', 'that', 'he', 'wants', 'solely', 'to', 'expose', 'her', 'to', 'cultural', 'activities', 'which', 'she', 'enjoys', 'jenny', 'quickly', 'gets', 'accustomed', 'to', 'the', 'life', 'to', 'which', 'david', 'and', 'his', 'constant', 'companions', 'danny', 'and', 'helen', 'have', 'shown', 'her', 'and', 'jenny', 'and', 'david', "'s", 'relationship', 'does', 'move', 'into', 'becoming', 'romantic', 'one', 'however', 'jenny', 'slowly', 'learns', 'more', 'about', 'david', 'and', 'by', 'association', 'danny', 'and', 'helen', 'and', 'specifically', 'how', 'they', 'make', 'their', 'money', 'jenny', 'has', 'to', 'decide', 'if', 'what', 'she', 'learns', 'about', 'them', 'and', 'leading', 'such', 'life', 'is', 'worth', 'forgoing', 'her', 'plans', 'of', 'higher', 'eduction', 'at', 'oxford'], tags=['romance']),
       TaggedDocument(words=['ollie', 'trinkie', 'is', 'publicist', 'who', 'has', 'great', 'girlfriend', 'gertrude', 'whom', 'he', 'marries', 'and', 'they', 'are', 'expecting', 'baby', 'but', 'while', 'he', 'is', 'looking', 'forward', 'to', 'being', 'father', 'he', 'does', "n't", 'lighten', 'his', 'workload', 'gertrude', 'gives', 'birth', 'but', 'dies', 'in', 'the', 'process', 'ollie', 'does', "n't", 'live', 'up', 'to', 'his', 'responsibilities', 'as', 'father', 'eventually', 'the', 'strain', 'and', 'pressure', 'of', 'losing', 'his', 'wife', 'and', 'being', 'father', 'gets', 'to', 'him', 'and', 'he', 'has', 'breakdown', 'which', 'leads', 'to', 'his', 'termination', 'so', 'with', 'nothing', 'much', 'to', 'do', 'he', 'tries', 'to', 'be', 'good', 'father', 'to', 'his', 'daughter', 'gertie', 'he', 'also', 'meets', 'young', 'woman', 'name', 'maya', 'who', 'likes', 'him', 'but', 'he', 'is', 'still', 'not', 'over', 'his', 'wife'], tags=['comedy']),
       ...,
       TaggedDocument(words=['the', 'factual', 'story', 'of', 'spaniard', 'ramon', 'sampedro', 'who', 'fought', 'thirty-year', 'campaign', 'in', 'favor', 'of', 'euthanasia', 'and', 'his', 'own', 'right', 'to', 'die'], tags=['romance']),
       TaggedDocument(words=['in', 'south', 'boston', 'the', 'state', 'police', 'force', 'is', 'waging', 'war', 'on', 'irish-american', 'organized', 'crime', 'young', 'undercover', 'cop', 'billy', 'costigan', 'leonardo', 'dicaprio', 'is', 'assigned', 'to', 'infiltrate', 'the', 'mob', 'syndicate', 'run', 'by', 'gangland', 'chief', 'frank', 'costello', 'jack', 'nicholson', 'while', 'billy', 'quickly', 'gains', 'costello', "'s", 'confidence', 'colin', 'sullivan', 'matt', 'damon', 'hardened', 'young', 'criminal', 'who', 'has', 'infiltrated', 'the', 'state', 'police', 'as', 'an', 'informer', 'for', 'the', 'syndicate', 'is', 'rising', 'to', 'position', 'of', 'power', 'in', 'the', 'special', 'investigation', 'unit', 'each', 'man', 'becomes', 'deeply', 'consumed', 'by', 'their', 'double', 'lives', 'gathering', 'information', 'about', 'the', 'plans', 'and', 'counter-plans', 'of', 'the', 'operations', 'they', 'have', 'penetrated', 'but', 'when', 'it', 'becomes', 'clear', 'to', 'both', 'the', 'mob', 'and', 'the', 'police', 'that', 'there', 'is', 'mole', 'in', 'their', 'midst', 'billy', 'and', 'colin', 'are', 'suddenly', 'in', 'danger', 'of', 'being', 'caught', 'and', 'exposed', 'to', 'the', 'enemy', 'and', 'each', 'must', 'race', 'to', 'uncover', 'the', 'identity', 'of', 'the', 'other', 'man', 'in', 'time', 'to', 'save', 'themselves', 'but', 'is', 'either', 'willing', 'to', 'turn', 'on', 'their', 'friends', 'and', 'comrades', 'they', "'ve", 'made', 'during', 'their', 'long', 'stints', 'undercover'], tags=['action']),
       TaggedDocument(words=['the', 'topalovic', 'family', 'has', 'been', 'in', 'the', 'burial', 'business', 'for', 'generations', 'when', 'the', 'old', '150', 'yrs', 'old', 'pantelija', 'dies', 'five', 'generations', 'of', 'his', 'heirs', 'start', 'to', 'fight', 'for', 'the', 'inheritance'], tags=['comedy'])], dtype=object)
In [82]:
# train the model with simple gensim doc2vec api
# size (dimensions of features)
# dm is the training algorithm (distribute memory (1) or distributed bag of words)
doc2vec_model = Doc2Vec(trainsent, workers=1, size=5, iter=20, dm=1)

train_targets, train_regressors = zip(
    *[(doc.tags[0], doc2vec_model.infer_vector(doc.words, steps=20)) for doc in trainsent])
2017-06-28 22:13:28,285 : WARNING : consider setting layer size to a multiple of 4 for greater performance
2017-06-28 22:13:28,286 : INFO : collecting all words and their counts
2017-06-28 22:13:28,287 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2017-06-28 22:13:28,324 : INFO : collected 17168 word types and 6 unique tags from a corpus of 2184 examples and 150640 words
2017-06-28 22:13:28,325 : INFO : Loading a fresh vocabulary
2017-06-28 22:13:28,335 : INFO : min_count=5 retains 3631 unique words (21% of original 17168, drops 13537)
2017-06-28 22:13:28,336 : INFO : min_count=5 leaves 128953 word corpus (85% of original 150640, drops 21687)
2017-06-28 22:13:28,345 : INFO : deleting the raw counts dictionary of 17168 items
2017-06-28 22:13:28,346 : INFO : sample=0.001 downsamples 43 most-common words
2017-06-28 22:13:28,347 : INFO : downsampling leaves estimated 92782 word corpus (72.0% of prior 128953)
2017-06-28 22:13:28,348 : INFO : estimated required memory for 3631 words and 5 dimensions: 1962060 bytes
2017-06-28 22:13:28,357 : INFO : resetting layer weights
2017-06-28 22:13:28,407 : INFO : training model with 1 workers on 3631 vocabulary and 5 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2017-06-28 22:13:29,414 : INFO : PROGRESS: at 42.23% examples, 799030 words/s, in_qsize 2, out_qsize 0
2017-06-28 22:13:30,419 : INFO : PROGRESS: at 84.13% examples, 795161 words/s, in_qsize 2, out_qsize 0
2017-06-28 22:13:30,797 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-06-28 22:13:30,798 : INFO : training on 3012800 raw words (1899475 effective words) took 2.4s, 795208 effective words/s
In [83]:
test_targets, test_regressors = zip(
    *[(doc.tags[0], doc2vec_model.infer_vector(doc.words, steps=20)) for doc in testsent])

Let us see what our data looks like after training and transormations

In [84]:
test_targets[0:5]
Out[84]:
('comedy', 'romance', 'sci-fi', 'comedy', 'fantasy')
In [85]:
test_regressors[0:5]
Out[85]:
(array([-0.30981189,  0.78878409,  0.27795053, -1.40723765, -0.78624779], dtype=float32),
 array([-0.14812587,  0.67659032,  0.30580592, -0.62821662, -1.07823062], dtype=float32),
 array([-0.40741968,  0.41884324,  0.28939512, -0.16581796, -0.11567296], dtype=float32),
 array([-0.16268265,  0.93405986,  0.48112747, -0.88357091, -0.70414579], dtype=float32),
 array([ 0.06507298,  0.17371821,  0.09560882, -0.10981997, -0.2437499 ], dtype=float32))
In [86]:
d2v_model = linear_model.LogisticRegression(n_jobs=1, C=1e5)
In [87]:
d2v_model = d2v_model.fit(train_regressors, train_targets)
In [88]:
word_embeddings.evaluate_prediction(d2v_model.predict(test_regressors), test_targets, my_tags,title=str(doc2vec_model))
accuracy 0.390946502058
confusion matrix
 [[15  0 12  1  2 12]
 [ 5  0 14  3  5  4]
 [10  1 58  0 14  3]
 [ 2  0  6  1  3  4]
 [ 2  0 22  0  9  2]
 [ 8  0 12  1  0 12]]
(row=expected, col=predicted)

Now let us train KNN model with doc2vec. You will observe how poor the model performs.

In [89]:
knn_test_predictions = [
    doc2vec_model.docvecs.most_similar([pred_vec], topn=1)[0][0]
    for pred_vec in test_regressors
]
word_embeddings.evaluate_prediction(knn_test_predictions, test_targets,my_tags, str(doc2vec_model))
2017-06-28 22:13:33,504 : INFO : precomputing L2-norms of doc weight vectors
accuracy 0.201646090535
confusion matrix
 [[ 0 37  1  0  3  1]
 [ 3 26  0  0  2  0]
 [ 2 56 14  1 12  1]
 [ 0 15  0  0  1  0]
 [ 0 24  4  0  6  1]
 [ 1 28  0  0  1  3]]
(row=expected, col=predicted)

Since doc2vec also gives us a vector for each genre, we can also observe which genres are similar to each other

In [90]:
doc2vec_model.docvecs.most_similar('action')
Out[90]:
[('sci-fi', 0.955485463142395),
 ('animation', 0.9373874068260193),
 ('fantasy', 0.5044772624969482),
 ('comedy', 0.35625192523002625),
 ('romance', 0.09751249849796295)]
In [91]:
doc2vec_model.docvecs.most_similar('fantasy')
Out[91]:
[('animation', 0.6903013586997986),
 ('romance', 0.6896251440048218),
 ('action', 0.5044772624969482),
 ('sci-fi', 0.47834616899490356),
 ('comedy', -0.18462462723255157)]

Since words and categories fall in the same vector space, it is possible to also observe which words sorround a tag. Notice how well those words descrive the tag, sci-fi in this case

In [121]:
doc2vec_model.most_similar([doc2vec_model.docvecs['fantasy']])
Out[121]:
[('years', 0.9583119750022888),
 ('fell', 0.9185272455215454),
 ('regina', 0.9117251634597778),
 ('marks', 0.9067319631576538),
 ('infected', 0.9006549119949341),
 ('cunning', 0.8974300622940063),
 ('vampires', 0.892426609992981),
 ('skywalker', 0.8912185430526733),
 ('cyborg', 0.8903250098228455),
 ('unfolds', 0.8898645043373108)]

Visualization

In [93]:
%pylab inline
pylab.rcParams['figure.figsize'] = (15, 10)
Populating the interactive namespace from numpy and matplotlib
In [94]:
#doc2vec_model.wv.vocab
doc2vec_words = word2vec_helpers.unpack_words_from_doc_vector(doc2vec_model)
In [95]:
word2vec_helpers.visualize_vectors(doc2vec_model, doc2vec_words)

Excercise: Try to visualize words and find interesting clusters. There are so many creative ways to visualize these powerful vector representations


Deep IR

Paper: https://arxiv.org/pdf/1504.07295v3.pdf

This approach is very simple: Train a model for each class or tag and then see which model fits each plot the best using a technique called the Bayes' Theorem.

Pre-processing: We just clean non-alphanumeric characters and split reviews by sentences.

In [96]:
# The corpus is small so can be read into memory
# here we split plot into sentences
revtrain = list(deepir.plots("training", train_data, test_data))
revtest = list(deepir.plots("test",train_data, test_data))
In [97]:
revtrain[0:1]
Out[97]:
[{'x': [['turkish',
    'and',
    'his',
    'close',
    'friend',
    'accomplice',
    'tommy',
    'get',
    'pulled',
    'into',
    'the',
    'world',
    'of',
    'match',
    'fixing',
    'by',
    'the',
    'notorious',
    'brick',
    'top'],
   ['things',
    'get',
    'complicated',
    'when',
    'the',
    'boxer',
    'they',
    'had',
    'lined',
    'up',
    'gets',
    'badly',
    'beaten',
    'by',
    'pitt',
    'pikey',
    'slang',
    'for',
    'an',
    'irish',
    'gypsy',
    'who',
    'comes',
    'into',
    'the',
    'equation',
    'after',
    'turkish',
    'an',
    'unlicensed',
    'boxing',
    'promoter',
    'wants',
    'to',
    'buy',
    'caravan',
    'off',
    'the',
    'irish',
    'gypsies'],
   ['they',
    'then',
    'try',
    'to',
    'convince',
    'pitt',
    'not',
    'only',
    'to',
    'fight',
    'for',
    'them',
    'but',
    'to',
    'lose',
    'for',
    'them',
    'too'],
   ['whilst',
    'all',
    'this',
    'is',
    'going',
    'on',
    'huge',
    'diamond',
    'heist',
    'takes',
    'place',
    'and',
    'fistful',
    'of',
    'motley',
    'characters',
    'enter',
    'the',
    'story',
    'including',
    'cousin',
    'avi',
    'boris',
    'the',
    'blade',
    'franky',
    'four',
    'fingers',
    'and',
    'bullet',
    'tooth',
    'tony'],
   ['things',
    'go',
    'from',
    'bad',
    'to',
    'worse',
    'as',
    'it',
    'all',
    'becomes',
    'about',
    'the',
    'money',
    'the',
    'guns',
    'and',
    'the',
    'damned',
    'dog']],
  'y': 'comedy'}]
In [98]:
# shuffle training set for unbiased word2vec training
np.random.shuffle(revtrain)

Let us look at an example from the category sci-fi

In [99]:
next(deepir.tag_sentences(revtrain, ["sci-fi"]))
Out[99]:
['in',
 'fascist',
 'future',
 'where',
 'all',
 'forms',
 'of',
 'feeling',
 'are',
 'illegal',
 'man',
 'in',
 'charge',
 'of',
 'enforcing',
 'the',
 'law',
 'rises',
 'to',
 'overthrow',
 'the',
 'system']

In the following steps, we train 6 word2vec models from scratch instead of using the Google pretrained word2vec as in the previous excercises

In [100]:
## training
from gensim.models import Word2Vec
import multiprocessing
In [101]:
## 1. create a w2v learner 
basemodel = Word2Vec(
    workers=multiprocessing.cpu_count(), # use your cores
    iter=100, # iter = sweeps of SGD through the data; more is better
    hs=1, negative=0, # we only have scoring for the hierarchical softmax setup
    )
In [102]:
print(basemodel)
Word2Vec(vocab=0, size=100, alpha=0.025)
In [103]:
basemodel.iter
Out[103]:
100
In [104]:
# 2. build a vocabulary
basemodel.build_vocab(deepir.tag_sentences(revtrain, my_tags))
2017-06-28 22:13:38,994 : INFO : collecting all words and their counts
2017-06-28 22:13:38,996 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2017-06-28 22:13:39,056 : INFO : collected 17544 word types from a corpus of 148422 raw words and 7880 sentences
2017-06-28 22:13:39,057 : INFO : Loading a fresh vocabulary
2017-06-28 22:13:39,070 : INFO : min_count=5 retains 3647 unique words (20% of original 17544, drops 13897)
2017-06-28 22:13:39,070 : INFO : min_count=5 leaves 126255 word corpus (85% of original 148422, drops 22167)
2017-06-28 22:13:39,080 : INFO : deleting the raw counts dictionary of 17544 items
2017-06-28 22:13:39,081 : INFO : sample=0.001 downsamples 42 most-common words
2017-06-28 22:13:39,081 : INFO : downsampling leaves estimated 91197 word corpus (72.2% of prior 126255)
2017-06-28 22:13:39,082 : INFO : estimated required memory for 3647 words and 100 dimensions: 5470500 bytes
2017-06-28 22:13:39,085 : INFO : constructing a huffman tree from 3647 words
2017-06-28 22:13:39,155 : INFO : built huffman tree with maximum node depth 15
2017-06-28 22:13:39,157 : INFO : resetting layer weights
In [105]:
from copy import deepcopy
genremodels = [deepcopy(basemodel) for i in range(len(my_tags))]
In [106]:
# 3. training for each model or each tag
for i in range(len(my_tags)):
    slist = list(deepir.tag_sentences(revtrain, my_tags[i]))
    print(my_tags[i], "genre (", len(slist), ")")
    genremodels[i].train(  slist, total_examples=len(slist), epochs=basemodel.iter)
2017-06-28 22:13:39,592 : INFO : training model with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=0 window=5
sci-fi genre ( 1157 )
2017-06-28 22:13:40,340 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-06-28 22:13:40,342 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-06-28 22:13:40,345 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-06-28 22:13:40,345 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-06-28 22:13:40,346 : INFO : training on 2163100 raw words (1331086 effective words) took 0.7s, 1780320 effective words/s
2017-06-28 22:13:40,347 : INFO : training model with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=0 window=5
action genre ( 1370 )
2017-06-28 22:13:41,269 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-06-28 22:13:41,272 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-06-28 22:13:41,276 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-06-28 22:13:41,278 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-06-28 22:13:41,278 : INFO : training on 2577000 raw words (1598414 effective words) took 0.9s, 1735829 effective words/s
2017-06-28 22:13:41,280 : INFO : training model with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=0 window=5
comedy genre ( 2375 )
2017-06-28 22:13:42,287 : INFO : PROGRESS: at 62.38% examples, 1723566 words/s, in_qsize 7, out_qsize 0
2017-06-28 22:13:42,862 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-06-28 22:13:42,863 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-06-28 22:13:42,866 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-06-28 22:13:42,871 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-06-28 22:13:42,872 : INFO : training on 4515700 raw words (2765161 effective words) took 1.6s, 1744093 effective words/s
2017-06-28 22:13:42,873 : INFO : training model with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=0 window=5
fantasy genre ( 707 )
2017-06-28 22:13:43,342 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-06-28 22:13:43,345 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-06-28 22:13:43,347 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-06-28 22:13:43,349 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-06-28 22:13:43,349 : INFO : training on 1340700 raw words (822429 effective words) took 0.5s, 1748793 effective words/s
2017-06-28 22:13:43,351 : INFO : training model with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=0 window=5
animation genre ( 867 )
2017-06-28 22:13:43,952 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-06-28 22:13:43,958 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-06-28 22:13:43,962 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-06-28 22:13:43,963 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-06-28 22:13:43,963 : INFO : training on 1604300 raw words (972786 effective words) took 0.6s, 1603441 effective words/s
2017-06-28 22:13:43,966 : INFO : training model with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 negative=0 window=5
romance genre ( 1404 )
2017-06-28 22:13:44,910 : INFO : worker thread finished; awaiting finish of 3 more threads
2017-06-28 22:13:44,914 : INFO : worker thread finished; awaiting finish of 2 more threads
2017-06-28 22:13:44,915 : INFO : worker thread finished; awaiting finish of 1 more threads
2017-06-28 22:13:44,917 : INFO : worker thread finished; awaiting finish of 0 more threads
2017-06-28 22:13:44,917 : INFO : training on 2641400 raw words (1630692 effective words) took 0.9s, 1725774 effective words/s

Now we will compute the most likely class for each plot using Bayes' Theorem formula

alt text

In [107]:
probs = deepir.docprob( [r['x'] for r in revtest], genremodels )
2017-06-28 22:13:44,932 : INFO : scoring sentences with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 and negative=0
2017-06-28 22:13:44,934 : INFO : reached end of input; waiting to finish 9 outstanding jobs
2017-06-28 22:13:44,945 : INFO : scoring 894 sentences took 0.0s, 73422 sentences/s
2017-06-28 22:13:44,946 : INFO : scoring sentences with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 and negative=0
2017-06-28 22:13:44,949 : INFO : reached end of input; waiting to finish 9 outstanding jobs
2017-06-28 22:13:44,960 : INFO : scoring 894 sentences took 0.0s, 66932 sentences/s
2017-06-28 22:13:44,960 : INFO : scoring sentences with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 and negative=0
2017-06-28 22:13:44,961 : INFO : reached end of input; waiting to finish 9 outstanding jobs
2017-06-28 22:13:44,971 : INFO : scoring 894 sentences took 0.0s, 83436 sentences/s
2017-06-28 22:13:44,972 : INFO : scoring sentences with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 and negative=0
2017-06-28 22:13:44,974 : INFO : reached end of input; waiting to finish 9 outstanding jobs
2017-06-28 22:13:44,985 : INFO : scoring 894 sentences took 0.0s, 74658 sentences/s
2017-06-28 22:13:44,985 : INFO : scoring sentences with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 and negative=0
2017-06-28 22:13:44,986 : INFO : reached end of input; waiting to finish 9 outstanding jobs
2017-06-28 22:13:44,996 : INFO : scoring 894 sentences took 0.0s, 83549 sentences/s
2017-06-28 22:13:44,997 : INFO : scoring sentences with 4 workers on 3647 vocabulary and 100 features, using sg=0 hs=1 sample=0.001 and negative=0
2017-06-28 22:13:44,998 : INFO : reached end of input; waiting to finish 9 outstanding jobs
2017-06-28 22:13:45,010 : INFO : scoring 894 sentences took 0.0s, 72172 sentences/s
In [108]:
probs[0:2]
Out[108]:
0 1 2 3 4 5
doc
0 0.140609 0.136355 0.314262 0.057756 0.284332 0.066686
1 0.025672 0.333158 0.311374 0.018236 0.000245 0.311316
In [109]:
predictions = probs.idxmax(axis=1).apply(lambda x: my_tags[x])
In [110]:
predictions[0:2]
Out[110]:
doc
0    comedy
1    action
dtype: object
In [111]:
tag_index = 0
col_name = "out-of-sample prob positive for " + my_tags[tag_index]
probpos = pd.DataFrame({col_name:probs[[tag_index]].sum(axis=1),
                        "true genres": [r['y'] for r in revtest]})
probpos.boxplot(col_name,by="true genres", figsize=(12,5))
Out[111]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f9d8e64ca58>

The chart above summarizes the probability of each plot being a scifi movie based on the reviews

In [112]:
probpos[0:2]
Out[112]:
out-of-sample prob positive for sci-fi true genres
doc
0 0.140609 comedy
1 0.025672 romance
In [113]:
target = [r['y'] for r in revtest]
In [114]:
word_embeddings.evaluate_prediction(predictions, target, my_tags, "Deep IR with word2vec")
accuracy 0.341563786008
confusion matrix
 [[14  2  6  5  7  8]
 [ 4  7  3  7  6  4]
 [11  9 36  7 18  5]
 [ 8  2  1  2  1  2]
 [ 6  4  9  5 10  1]
 [ 6  4  5  1  3 14]]
(row=expected, col=predicted)

To Be continued

Excercise: Can you explain why we got very bad results for deepir? (Hint: training dataset contains a total of 30K words)

Excercise: Feel free to test other approaches proposed by the original authors here

Excercise: Experiment with other type of classifiers, like Random Forest Classifiers and Gradient Boost, and if possible, Deep Neural Networks. More information

Excercise: Try to implement different preprocessing methods such as stopwords removal, and those other ones you learned in the Text Mining course

Excercise: Create better visualization (always good to know your data)