The goal of this notebook is to implement a logistic regression classifier. I will:
import turicreate
For this project, I will be using a subset of the Amazon product reviews dataset. The subset is chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted primarily of positive reviews.
Note that the sentiment assigned is based on rating. I have ignored all reviews with rating=3, since they tend to have neutral sentiment.
Reviews with a rating of 4 or higher are assigned to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, I used +1 for the positive class label and -1 for the negative class label
products = turicreate.SFrame('amazon_baby_subset.sframe/')
#checking out the dataset
products.head()
name | review | rating | sentiment |
---|---|---|---|
Stop Pacifier Sucking without tears with ... |
All of my kids have cried non-stop when I tried to ... |
5.0 | 1 |
Nature's Lullabies Second Year Sticker Calendar ... |
We wanted to get something to keep track ... |
5.0 | 1 |
Nature's Lullabies Second Year Sticker Calendar ... |
My daughter had her 1st baby over a year ago. ... |
5.0 | 1 |
Lamaze Peekaboo, I Love You ... |
One of baby's first and favorite books, and i ... |
4.0 | 1 |
SoftPlay Peek-A-Boo Where's Elmo A Childr ... |
Very cute interactive book! My son loves this ... |
5.0 | 1 |
Our Baby Girl Memory Book | Beautiful book, I love it to record cherished t ... |
5.0 | 1 |
Hunnt® Falling Flowers and Birds Kids ... |
Try this out for a spring project !Easy ,fun and ... |
5.0 | 1 |
Blessed By Pope Benedict XVI Divine Mercy Full ... |
very nice Divine Mercy Pendant of Jesus now on ... |
5.0 | 1 |
Cloth Diaper Pins Stainless Steel ... |
We bought the pins as my 6 year old Autistic son ... |
4.0 | 1 |
Cloth Diaper Pins Stainless Steel ... |
It has been many years since we needed diaper ... |
5.0 | 1 |
As observed above one column of this dataset is 'sentiment', corresponding to the class label with +1 indicating a review with positive sentiment and -1 indicating one with negative sentiment.
products['sentiment']
dtype: int Rows: 53072 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ]
I will now be exploring other columns in the dataset. The 'name' column indicates the name of the product. Following is the list of the first 10 products in the dataset. I will then count the number of positive and negative reviews.
products.head(10)['name']
dtype: str Rows: 10 ["Stop Pacifier Sucking without tears with Thumbuddy To Love's Binky Fairy Puppet and Adorable Book", "Nature's Lullabies Second Year Sticker Calendar", "Nature's Lullabies Second Year Sticker Calendar", 'Lamaze Peekaboo, I Love You', "SoftPlay Peek-A-Boo Where's Elmo A Children's Book", 'Our Baby Girl Memory Book', 'Hunnt® Falling Flowers and Birds Kids Nursery Home Decor Vinyl Mural Art Wall Paper Stickers', 'Blessed By Pope Benedict XVI Divine Mercy Full Color Medal', 'Cloth Diaper Pins Stainless Steel Traditional Safety Pin (Black)', 'Cloth Diaper Pins Stainless Steel Traditional Safety Pin (Black)']
print('# of positive reviews =', len(products[products['sentiment']==1]))
print('# of negative reviews =', len(products[products['sentiment']==-1]))
# of positive reviews = 26579 # of negative reviews = 26493
In this section, I will be performing some simple feature cleaning using SFrames. Instead of using all of the words for building bag-of-words features, I will limit the dataset to 193 words (for simplicity). I compiled a list of 193 most frequent words into a JSON file.
Now, I will load these words from this JSON file:
import json
with open('important_words.json', 'r') as f: # Reads the list of most frequent words
important_words = json.load(f)
important_words = [str(s) for s in important_words]
print(important_words)
['baby', 'one', 'great', 'love', 'use', 'would', 'like', 'easy', 'little', 'seat', 'old', 'well', 'get', 'also', 'really', 'son', 'time', 'bought', 'product', 'good', 'daughter', 'much', 'loves', 'stroller', 'put', 'months', 'car', 'still', 'back', 'used', 'recommend', 'first', 'even', 'perfect', 'nice', 'bag', 'two', 'using', 'got', 'fit', 'around', 'diaper', 'enough', 'month', 'price', 'go', 'could', 'soft', 'since', 'buy', 'room', 'works', 'made', 'child', 'keep', 'size', 'small', 'need', 'year', 'big', 'make', 'take', 'easily', 'think', 'crib', 'clean', 'way', 'quality', 'thing', 'better', 'without', 'set', 'new', 'every', 'cute', 'best', 'bottles', 'work', 'purchased', 'right', 'lot', 'side', 'happy', 'comfortable', 'toy', 'able', 'kids', 'bit', 'night', 'long', 'fits', 'see', 'us', 'another', 'play', 'day', 'money', 'monitor', 'tried', 'thought', 'never', 'item', 'hard', 'plastic', 'however', 'disappointed', 'reviews', 'something', 'going', 'pump', 'bottle', 'cup', 'waste', 'return', 'amazon', 'different', 'top', 'want', 'problem', 'know', 'water', 'try', 'received', 'sure', 'times', 'chair', 'find', 'hold', 'gate', 'open', 'bottom', 'away', 'actually', 'cheap', 'worked', 'getting', 'ordered', 'came', 'milk', 'bad', 'part', 'worth', 'found', 'cover', 'many', 'design', 'looking', 'weeks', 'say', 'wanted', 'look', 'place', 'purchase', 'looks', 'second', 'piece', 'box', 'pretty', 'trying', 'difficult', 'together', 'though', 'give', 'started', 'anything', 'last', 'company', 'come', 'returned', 'maybe', 'took', 'broke', 'makes', 'stay', 'instead', 'idea', 'head', 'said', 'less', 'went', 'working', 'high', 'unit', 'seems', 'picture', 'completely', 'wish', 'buying', 'babies', 'won', 'tub', 'almost', 'either']
Now, I will perform 2 simple data transformations:
I will start with Step 1 which can be done as follows:
import string
def remove_punctuation(text):
try: # python 2.x
text = text.translate(None, string.punctuation)
except: # python 3.x
translator = text.maketrans('', '', string.punctuation)
text = text.translate(translator)
return text
products['review_clean'] = products['review'].apply(remove_punctuation)
Now I will proceed with Step 2. For each word in important_words, I will compute a count for the number of times the word occurs in the review. I will store this count in a separate column (one for each word). The result of this feature processing is a single column for each word in important_words which keeps a count of the number of times the respective word occurs in the review text.
for word in important_words:
products[word] = products['review_clean'].apply(lambda s : s.split().count(word))
#looking at the revised dataset with word counts
products.head()
name | review | rating | sentiment | review_clean | baby |
---|---|---|---|---|---|
Stop Pacifier Sucking without tears with ... |
All of my kids have cried non-stop when I tried to ... |
5.0 | 1 | All of my kids have cried nonstop when I tried to ... |
0 |
Nature's Lullabies Second Year Sticker Calendar ... |
We wanted to get something to keep track ... |
5.0 | 1 | We wanted to get something to keep track ... |
0 |
Nature's Lullabies Second Year Sticker Calendar ... |
My daughter had her 1st baby over a year ago. ... |
5.0 | 1 | My daughter had her 1st baby over a year ago She ... |
1 |
Lamaze Peekaboo, I Love You ... |
One of baby's first and favorite books, and i ... |
4.0 | 1 | One of babys first and favorite books and it is ... |
0 |
SoftPlay Peek-A-Boo Where's Elmo A Childr ... |
Very cute interactive book! My son loves this ... |
5.0 | 1 | Very cute interactive book My son loves this ... |
0 |
Our Baby Girl Memory Book | Beautiful book, I love it to record cherished t ... |
5.0 | 1 | Beautiful book I love it to record cherished t ... |
0 |
Hunnt® Falling Flowers and Birds Kids ... |
Try this out for a spring project !Easy ,fun and ... |
5.0 | 1 | Try this out for a spring project Easy fun and ... |
0 |
Blessed By Pope Benedict XVI Divine Mercy Full ... |
very nice Divine Mercy Pendant of Jesus now on ... |
5.0 | 1 | very nice Divine Mercy Pendant of Jesus now on ... |
0 |
Cloth Diaper Pins Stainless Steel ... |
We bought the pins as my 6 year old Autistic son ... |
4.0 | 1 | We bought the pins as my 6 year old Autistic son ... |
0 |
Cloth Diaper Pins Stainless Steel ... |
It has been many years since we needed diaper ... |
5.0 | 1 | It has been many years since we needed diaper ... |
0 |
one | great | love | use | would | like | easy | little | seat | old | well | get | also | really | son | time | bought |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
product | good | daughter | much | loves | stroller | put | months | car | still | back | used | recommend | first | even |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
perfect | nice | ... |
---|---|---|
0 | 0 | ... |
0 | 0 | ... |
0 | 1 | ... |
1 | 0 | ... |
0 | 0 | ... |
0 | 0 | ... |
0 | 0 | ... |
0 | 1 | ... |
0 | 0 | ... |
0 | 0 | ... |
As seen above the SFrame products now contains one column for each of the 193 important_words. As an example, the column perfect contains a count of the number of times the word perfect occurs in each of the reviews.
I will now write some code to compute the number of product reviews that contain the word perfect.
contains_perfect=products['perfect'].apply(lambda x: 1 if x>=1 else 0)
print(sum(contains_perfect))
2955
As seen above 2955 reviews contain the word perfect
import numpy as np
I will now be writing a function that extracts columns from an SFrame and converts them into a NumPy array. Two arrays are returned: one representing features and another representing class labels. Note that the feature matrix includes an additional column 'intercept' to take account of the intercept term.
def get_numpy_data(data_sframe, features, label):
data_sframe['intercept'] = 1
features = ['intercept'] + features
features_sframe = data_sframe[features]
feature_matrix = features_sframe.to_numpy()
label_sarray = data_sframe[label]
label_array = label_sarray.to_numpy()
return(feature_matrix, label_array)
Converting data into NumPy arrays.
feature_matrix, sentiment = get_numpy_data(products, important_words, 'sentiment')
feature_matrix
array([[1, 0, 0, ..., 0, 0, 0], [1, 0, 0, ..., 0, 0, 0], [1, 1, 0, ..., 0, 0, 0], ..., [1, 0, 0, ..., 0, 0, 0], [1, 0, 1, ..., 0, 0, 0], [1, 0, 0, ..., 0, 0, 0]])
feature_matrix.shape
(53072, 194)
As seen aboce ther are 194 features are there in the feature_matrix
Now, let us see what the sentiment column looks like:
sentiment
array([ 1, 1, 1, ..., -1, -1, -1])
A lineat classifier uses training data to associate each feature(in our case word) with a weight or coefficient for each word.
Each weight/coefficent for a word count shows how positively(or negatively) influential a word is. For example the good may have a weight of 1 while the word great may have a weight of 1.5 and the word awesome may have a weight of 2.
This can be further illustrated later once we learn weights using our training data.
Note, this is a called a linear because output is weighted sum of units.
For each review we calculate this weighted sum of units by mutliplying weight of each word with its corresponding count of occurence in a review and call it the score of a review.
The score is given by:
$$ \mathbf{w}^T h(\mathbf{x}_i) $$where the feature vector $h(\mathbf{x}_i)$ represents the word counts of important_words in the review $\mathbf{x}_i$ and $\mathbf{w}$ is learned weights for all words.
If the score for a review is greater than 0 we can say its sentiment is +1, and if the score for a review is less than 0 we can predict the sentiment to be -1.
We now want to calculate how sure we of our predicted sentiment given a set of learned coefficients and corresponding calculated score.
There are two ends to the spectrum of scores. If the score is very largely positive i.e. apporaching infinity we are very sure that $$ P(y_i = +1 | \mathbf{x}_i) =1 $$
Similarly if score is very negativie i.e. approaching negative infinity we are very sure that
$$ P(y_i = +1 | \mathbf{x}_i) =0 $$If the score is 0, we can say we are not sure if sentiment is positive or negative which can be described as follows:
$$ P(y_i = +1 | \mathbf{x}_i) =0.5 $$To be able to relate and essentially squeeze the real line input of score into probability between 0 and 1 we use something called the link function
The link function we will be using is called the sigmoid function which can be illustrated as follows
$$ sigmoid(score) = \frac{1}{1 + \exp(-score)} $$where score is $\mathbf{w}^T h(\mathbf{x}_i)$
The link function is given by: $$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}, $$
where the feature vector $h(\mathbf{x}_i)$ represents the word counts of important_words in the review $\mathbf{x}_i$.
Here we are predicting the probability that a review is positive given the set of coefficients,feature_matrix, and corresponding reviews.
Following is the code that implements the link function:
'''
produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
'''
def predict_probability(feature_matrix, coefficients):
#dot product of feature_matrix and coefficients
dot_product = np.dot(feature_matrix,coefficients)
# Computing P(y_i = +1 | x_i, w) using the link function
predictions = 1/(1+np.exp(-dot_product))
# returning predictions
return predictions
Following is how the link function works with matrix algebra
Since the word counts are stored as columns in feature_matrix, each $i$-th row of the matrix corresponds to the feature vector $h(\mathbf{x}_i)$: $$ [\text{feature_matrix}] = \left[ \begin{array}{c} h(\mathbf{x}_1)^T \\ h(\mathbf{x}_2)^T \\ \vdots \\ h(\mathbf{x}_N)^T \end{array} \right] = \left[ \begin{array}{cccc} h_0(\mathbf{x}_1) & h_1(\mathbf{x}_1) & \cdots & h_D(\mathbf{x}_1) \\ h_0(\mathbf{x}_2) & h_1(\mathbf{x}_2) & \cdots & h_D(\mathbf{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ h_0(\mathbf{x}_N) & h_1(\mathbf{x}_N) & \cdots & h_D(\mathbf{x}_N) \end{array} \right] $$
By the rules of matrix multiplication, the score vector containing elements $\mathbf{w}^T h(\mathbf{x}_i)$ is obtained by multiplying feature_matrix and the coefficient vector $\mathbf{w}$. $$ [\text{score}] = [\text{feature_matrix}]\mathbf{w} = \left[ \begin{array}{c} h(\mathbf{x}_1)^T \\ h(\mathbf{x}_2)^T \\ \vdots \\ h(\mathbf{x}_N)^T \end{array} \right] \mathbf{w} = \left[ \begin{array}{c} h(\mathbf{x}_1)^T\mathbf{w} \\ h(\mathbf{x}_2)^T\mathbf{w} \\ \vdots \\ h(\mathbf{x}_N)^T\mathbf{w} \end{array} \right] = \left[ \begin{array}{c} \mathbf{w}^T h(\mathbf{x}_1) \\ \mathbf{w}^T h(\mathbf{x}_2) \\ \vdots \\ \mathbf{w}^T h(\mathbf{x}_N) \end{array} \right] $$
Following is a test of wether predictions match using the previously defined probability function:
dummy_feature_matrix = np.array([[1.,2.,3.], [1.,-1.,-1]])
dummy_coefficients = np.array([1., 3., -1.])
correct_scores = np.array( [ 1.*1. + 2.*3. + 3.*(-1.), 1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_predictions = np.array( [ 1./(1+np.exp(-correct_scores[0])), 1./(1+np.exp(-correct_scores[1])) ] )
print('The following outputs must match ')
print('------------------------------------------------')
print('correct_predictions =', correct_predictions)
print('output of predict_probability =', predict_probability(dummy_feature_matrix, dummy_coefficients))
The following outputs must match ------------------------------------------------ correct_predictions = [0.98201379 0.26894142] output of predict_probability = [0.98201379 0.26894142]
I will now write a function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:
errors
vector containing $\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ for all $i$.feature
vector containing $h_j(\mathbf{x}_i)$ for all $i$. def feature_derivative(errors, feature):
# Computing the dot product of errors and feature
derivative = feature*errors
# Returning the derivative
return derivative
Although we can rely on the likelihood a transformation of this likelihood---called the log likelihood--- simplifies the derivation of the gradient and is more numerically stable. Due to its numerical stability, I will use the log likelihood instead of the likelihood to assess the algorithm.
The log likelihood is computed using the following formula:
$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$Following is the function to compute the log likelihood
def compute_log_likelihood(feature_matrix, sentiment, coefficients):
indicator = (sentiment==+1)
scores = np.dot(feature_matrix, coefficients)
logexp = np.log(1. + np.exp(-scores))
# Simple check to prevent overflow
mask = np.isinf(logexp)
logexp[mask] = -scores[mask]
lp = np.sum((indicator-1)*scores - logexp)
return lp
Just to make sure things are running smoothly, I will run the following code block and check that the outputs match.
dummy_feature_matrix = np.array([[1.,2.,3.], [1.,-1.,-1]])
dummy_coefficients = np.array([1., 3., -1.])
dummy_sentiment = np.array([-1, 1])
correct_indicators = np.array( [ -1==+1, 1==+1 ] )
correct_scores = np.array( [ 1.*1. + 2.*3. + 3.*(-1.), 1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_first_term = np.array( [ (correct_indicators[0]-1)*correct_scores[0], (correct_indicators[1]-1)*correct_scores[1] ] )
correct_second_term = np.array( [ np.log(1. + np.exp(-correct_scores[0])), np.log(1. + np.exp(-correct_scores[1])) ] )
correct_ll = sum( [ correct_first_term[0]-correct_second_term[0], correct_first_term[1]-correct_second_term[1] ] )
print('The following outputs must match ')
print('------------------------------------------------')
print('correct_log_likelihood =', correct_ll)
print('output of compute_log_likelihood =', compute_log_likelihood(dummy_feature_matrix, dummy_sentiment, dummy_coefficients))
The following outputs must match ------------------------------------------------ correct_log_likelihood = -5.331411615436032 output of compute_log_likelihood = -5.331411615436032
Now I am ready to implement my own logistic regression. All I have to do is to write a gradient ascent function that takes gradient steps towards the optimum.
from math import sqrt
def logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter):
coefficients = np.array(initial_coefficients) # make sure it's a numpy array
for itr in range(max_iter):
# Predicting P(y_i = +1|x_i,w) using your predict_probability() function
predictions = predict_probability(feature_matrix,initial_coefficients)
# Computing indicator value for (y_i = +1)
indicator = (sentiment==+1)
# Computing the errors as indicator - predictions
errors = indicator - predictions
for j in range(len(coefficients)): # loop over each coefficient
# feature_matrix[:,j] is the feature column associated with coefficients[j].
# Compute the derivative for coefficients[j].
# print(errors.shape)
# print(feature_matrix[:,j].shape)
derivative = feature_derivative(errors,feature_matrix[:,j])
# print(derivative.shape)
# print(coefficients[j])
coefficients[j]+=np.array(step_size*np.sum(derivative))
# Checking whether log likelihood is increasing
if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
lp = compute_log_likelihood(feature_matrix, sentiment, coefficients)
print('iteration %*d: log likelihood of observed labels = %.8f' % \
(int(np.ceil(np.log10(max_iter))), itr, lp))
return coefficients
Now, I will run the logistic regression solver.
coefficients = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
step_size=1e-7, max_iter=301)
iteration 0: log likelihood of observed labels = -36780.91768478 iteration 1: log likelihood of observed labels = -36775.13127954 iteration 2: log likelihood of observed labels = -36769.34795095 iteration 3: log likelihood of observed labels = -36763.56769899 iteration 4: log likelihood of observed labels = -36757.79052366 iteration 5: log likelihood of observed labels = -36752.01642492 iteration 6: log likelihood of observed labels = -36746.24540276 iteration 7: log likelihood of observed labels = -36740.47745714 iteration 8: log likelihood of observed labels = -36734.71258803 iteration 9: log likelihood of observed labels = -36728.95079539 iteration 10: log likelihood of observed labels = -36723.19207918 iteration 11: log likelihood of observed labels = -36717.43643934 iteration 12: log likelihood of observed labels = -36711.68387583 iteration 13: log likelihood of observed labels = -36705.93438858 iteration 14: log likelihood of observed labels = -36700.18797754 iteration 15: log likelihood of observed labels = -36694.44464262 iteration 20: log likelihood of observed labels = -36665.77410742 iteration 30: log likelihood of observed labels = -36608.66369535 iteration 40: log likelihood of observed labels = -36551.86072283 iteration 50: log likelihood of observed labels = -36495.36502431 iteration 60: log likelihood of observed labels = -36439.17638957 iteration 70: log likelihood of observed labels = -36383.29456460 iteration 80: log likelihood of observed labels = -36327.71925271 iteration 90: log likelihood of observed labels = -36272.45011567 iteration 100: log likelihood of observed labels = -36217.48677511 iteration 200: log likelihood of observed labels = -35684.56313478 iteration 300: log likelihood of observed labels = -35181.56106069
Class predictions for a data point $\mathbf{x}$ can be computed from the coefficients $\mathbf{w}$ using the following formula: $$ \hat{y}_i = \left\{ \begin{array}{ll} +1 & \mathbf{x}_i^T\mathbf{w} > 0 \\ -1 & \mathbf{x}_i^T\mathbf{w} \leq 0 \\ \end{array} \right. $$
I will write some code to compute class predictions. I will do this in two steps:
Step 1 can be implemented as follows:
# Compute the scores as a dot product between feature_matrix and coefficients.
scores = np.dot(feature_matrix, coefficients)
# scores.shape
Now Step 2 to compute the class predictions using the scores obtained above:
class_predictions =[]
for i in range(0,53072):
if scores[i]>0:
class_predictions.append(+1)
else:
class_predictions.append(-1)
class_predictions=np.array(class_predictions)
Following code block shows that 21348(40%) of reviews are predicted to have positive sentiment
positive =0
for i in range(0,len(class_predictions)):
if class_predictions[i]==+1:
positive +=1
print(positive)
21348
correct=0
incorrect=0
for i in range(0,len(sentiment)):
if sentiment[i]==class_predictions[i]:
correct +=1
else:
incorrect +=1
Now I will measure the classification accuracy of the model. Classification accuracy can be computed as follows:
$$ \mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}} $$num_mistakes = incorrect
accuracy = correct/len(sentiment)
print("-----------------------------------------------------")
print('# Reviews correctly classified =', len(products) - num_mistakes)
print('# Reviews incorrectly classified =', num_mistakes)
print('# Reviews total =', len(products))
print("-----------------------------------------------------")
print('Accuracy = %.2f' % accuracy)
----------------------------------------------------- # Reviews correctly classified = 39325 # Reviews incorrectly classified = 13747 # Reviews total = 53072 ----------------------------------------------------- Accuracy = 0.74
The accuracy of the model on predictions made above is 74%
To find the words that correspond most strongly with positive reviews,I we will first do the following:
coefficients = list(coefficients[1:]) # exclude intercept
word_coefficient_tuples = [(word, coefficient) for word, coefficient in zip(important_words, coefficients)]
word_coefficient_tuples = sorted(word_coefficient_tuples, key=lambda x:x[1], reverse=True)
Now, word_coefficient_tuples contains a sorted list of (word, coefficient_value) tuples. The first 10 elements in this list correspond to the words that are most positive.
The 10 words that have the most positive coefficient values are associated with positive sentiment.
word_coefficient_tuples
[('great', 0.06927514999999965), ('love', 0.06900425000000007), ('easy', 0.06748420000000005), ('little', 0.04679045000000026), ('loves', 0.046414200000000044), ('well', 0.030355849999999917), ('perfect', 0.030355849999999917), ('old', 0.020272350000000126), ('nice', 0.018481399999999922), ('soft', 0.017954649999999985), ('daughter', 0.017683749999999956), ('fits', 0.017307500000000003), ('happy', 0.016856000000000055), ('best', 0.014929600000000069), ('also', 0.01491455000000008), ('recommend', 0.014869399999999946), ('baby', 0.01398144999999999), ('comfortable', 0.013454699999999986), ('car', 0.013213899999999966), ('clean', 0.012280800000000036), ('bit', 0.011618599999999958), ('works', 0.011498200000000073), ('son', 0.011437999999999962), ('stroller', 0.011061750000000054), ('size', 0.010730649999999979), ('play', 0.009225650000000026), ('price', 0.00921060000000007), ('room', 0.009165449999999927), ('easily', 0.00907514999999994), ('kids', 0.008488200000000001), ('lot', 0.007615299999999957), ('still', 0.007284200000000025), ('around', 0.007028350000000027), ('need', 0.006501600000000022), ('take', 0.006140400000000018), ('keep', 0.005944749999999961), ('crib', 0.005598599999999981), ('cute', 0.005568500000000013), ('year', 0.0055534499999999876), ('without', 0.005282549999999977), ('set', 0.005237399999999982), ('big', 0.004379549999999989), ('seat', 0.00418389999999998), ('diaper', 0.003897949999999982), ('wish', 0.0036571499999999827), ('use', 0.0031002999999999946), ('though', 0.0027842500000000063), ('babies', 0.0026488000000000167), ('seems', 0.0024682000000000016), ('bag', 0.0023929500000000074), ('enough', 0.0023478000000000114), ('every', 0.002257499999999995), ('able', 0.0022274000000000018), ('many', 0.0020768999999999974), ('makes', 0.002061850000000011), ('pretty', 0.001881249999999992), ('night', 0.0017909500000000025), ('toy', 0.0017307499999999869), ('long', 0.0015802500000000092), ('good', 0.0013695500000000032), ('looking', 0.0012341000000000008), ('us', 0.0007976499999999969), ('think', 0.0007675500000000023), ('purchase', 0.0007675500000000023), ('since', 0.0006020000000000016), ('cover', 0.0005568500000000004), ('won', 0.00012039999999999963), ('looks', -0.00013544999999999982), ('found', -0.0002257499999999991), ('put', -0.0003160499999999996), ('high', -0.0005417999999999993), ('used', -0.0008427999999999955), ('chair', -0.0008427999999999955), ('go', -0.0009782499999999995), ('day', -0.0010685499999999926), ('really', -0.0013093499999999956), ('bottles', -0.0015200499999999959), ('worth', -0.0015200499999999959), ('almost', -0.0015200499999999959), ('side', -0.0016705499999999998), ('hold', -0.0017608499999999902), ('using', -0.0019414500000000095), ('look', -0.002061850000000011), ('amazon', -0.002137099999999985), ('sure', -0.002483250000000003), ('month', -0.0025735499999999917), ('months', -0.002588600000000018), ('find', -0.0027089999999999797), ('getting', -0.0029648500000000124), ('come', -0.003070200000000009), ('head', -0.0033862499999999934), ('small', -0.0035367500000000104), ('second', -0.0038076499999999784), ('place', -0.0038527999999999883), ('together', -0.003897949999999982), ('give', -0.004289249999999999), ('want', -0.004469849999999993), ('wanted', -0.00451499999999999), ('say', -0.0045300499999999895), ('took', -0.004996600000000007), ('know', -0.00511699999999999), ('however', -0.005417999999999959), ('fit', -0.005568500000000013), ('purchased', -0.005703950000000043), ('see', -0.00579425000000003), ('came', -0.005824349999999959), ('different', -0.005944749999999961), ('buying', -0.006004950000000025), ('gate', -0.006125349999999979), ('last', -0.006185550000000014), ('much', -0.006290899999999979), ('bottle', -0.006351099999999957), ('less', -0.00636615000000003), ('like', -0.006381199999999975), ('actually', -0.006426349999999969), ('make', -0.00656179999999999), ('new', -0.006727349999999993), ('instead', -0.006847750000000042), ('tub', -0.006862799999999976), ('maybe', -0.0068778499999999545), ('started', -0.006998249999999967), ('water', -0.007193900000000036), ('child', -0.007329350000000019), ('right', -0.007524999999999968), ('problem', -0.007555100000000024), ('either', -0.007615299999999957), ('said', -0.007660449999999984), ('went', -0.007735700000000008), ('part', -0.0077507499999999695), ('ordered', -0.007765800000000038), ('top', -0.007871149999999992), ('bottom', -0.007916299999999984), ('anything', -0.007916299999999984), ('quality', -0.007976499999999952), ('weeks', -0.00829255000000003), ('design', -0.00835275), ('made', -0.00854839999999994), ('times', -0.008713949999999958), ('picture', -0.008759099999999978), ('away', -0.00907514999999994), ('stay', -0.009586849999999984), ('pump', -0.009827650000000064), ('open', -0.009917950000000054), ('cup', -0.009948049999999964), ('worked', -0.01023399999999998), ('milk', -0.010414599999999956), ('completely', -0.010685500000000053), ('trying', -0.010715600000000053), ('difficult', -0.010760749999999973), ('piece', -0.011121949999999922), ('box', -0.01148315000000004), ('got', -0.011949700000000037), ('try', -0.012055049999999953), ('going', -0.012115250000000067), ('another', -0.012416249999999926), ('two', -0.012506549999999913), ('idea', -0.012611899999999928), ('unit', -0.01273230000000006), ('working', -0.012807550000000046), ('company', -0.013544999999999974), ('received', -0.013575099999999934), ('bad', -0.013575099999999934), ('one', -0.01392124999999994), ('something', -0.01398144999999999), ('bought', -0.014237300000000029), ('never', -0.014417900000000006), ('hard', -0.014839300000000081), ('cheap', -0.015381100000000028), ('thing', -0.015636950000000052), ('first', -0.015757349999999917), ('broke', -0.016163700000000055), ('plastic', -0.01628409999999994), ('reviews', -0.016404500000000075), ('returned', -0.016630249999999947), ('better', -0.016765700000000033), ('item', -0.01839110000000001), ('buy', -0.019610149999999982), ('way', -0.020317500000000082), ('tried', -0.020483050000000013), ('time', -0.022258949999999934), ('could', -0.022845900000000065), ('thought', -0.023101750000000105), ('waste', -0.025193699999999958), ('return', -0.027977950000000137), ('monitor', -0.028324099999999935), ('disappointed', -0.03005484999999982), ('back', -0.031589949999999895), ('even', -0.03347119999999984), ('get', -0.03396785000000007), ('work', -0.03619524999999997), ('money', -0.04141760000000029), ('product', -0.047452650000000124), ('would', -0.06381199999999962)]
The ten most positive words are:
The ten most negative words are: