Implementing logistic regression

The goal of this notebook is to implement a logistic regression classifier. I will:

Firing up Turi Create

Loading review dataset

For this project, I will be using a subset of the Amazon product reviews dataset. The subset is chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted primarily of positive reviews.

Note that the sentiment assigned is based on rating. I have ignored all reviews with rating=3, since they tend to have neutral sentiment.

Reviews with a rating of 4 or higher are assigned to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, I used +1 for the positive class label and -1 for the negative class label

As observed above one column of this dataset is 'sentiment', corresponding to the class label with +1 indicating a review with positive sentiment and -1 indicating one with negative sentiment.

I will now be exploring other columns in the dataset. The 'name' column indicates the name of the product. Following is the list of the first 10 products in the dataset. I will then count the number of positive and negative reviews.

Applying text cleaning on the review data

In this section, I will be performing some simple feature cleaning using SFrames. Instead of using all of the words for building bag-of-words features, I will limit the dataset to 193 words (for simplicity). I compiled a list of 193 most frequent words into a JSON file.

Now, I will load these words from this JSON file:

Now, I will perform 2 simple data transformations:

  1. Remove punctuation using Python's built-in string functionality.
  2. Compute word counts. Each entry in the word_count column is a dictionary where the key is the word and the value is a count of the number of times the word occurs. (only for important_words)

I will start with Step 1 which can be done as follows:

Now I will proceed with Step 2. For each word in important_words, I will compute a count for the number of times the word occurs in the review. I will store this count in a separate column (one for each word). The result of this feature processing is a single column for each word in important_words which keeps a count of the number of times the respective word occurs in the review text.

As seen above the SFrame products now contains one column for each of the 193 important_words. As an example, the column perfect contains a count of the number of times the word perfect occurs in each of the reviews.

I will now write some code to compute the number of product reviews that contain the word perfect.

As seen above 2955 reviews contain the word perfect

I will now be writing a function that extracts columns from an SFrame and converts them into a NumPy array. Two arrays are returned: one representing features and another representing class labels. Note that the feature matrix includes an additional column 'intercept' to take account of the intercept term.

Converting data into NumPy arrays.

As seen aboce ther are 194 features are there in the feature_matrix

Now, let us see what the sentiment column looks like:

A lineat classifier uses training data to associate each feature(in our case word) with a weight or coefficient for each word.

Each weight/coefficent for a word count shows how positively(or negatively) influential a word is. For example the good may have a weight of 1 while the word great may have a weight of 1.5 and the word awesome may have a weight of 2.

This can be further illustrated later once we learn weights using our training data.

Note, this is a called a linear because output is weighted sum of units.

For each review we calculate this weighted sum of units by mutliplying weight of each word with its corresponding count of occurence in a review and call it the score of a review.

The score is given by:

$$ \mathbf{w}^T h(\mathbf{x}_i) $$

where the feature vector $h(\mathbf{x}_i)$ represents the word counts of important_words in the review $\mathbf{x}_i$ and $\mathbf{w}$ is learned weights for all words.

If the score for a review is greater than 0 we can say its sentiment is +1, and if the score for a review is less than 0 we can predict the sentiment to be -1.

We now want to calculate how sure we of our predicted sentiment given a set of learned coefficients and corresponding calculated score.

There are two ends to the spectrum of scores. If the score is very largely positive i.e. apporaching infinity we are very sure that $$ P(y_i = +1 | \mathbf{x}_i) =1 $$

Similarly if score is very negativie i.e. approaching negative infinity we are very sure that

$$ P(y_i = +1 | \mathbf{x}_i) =0 $$

If the score is 0, we can say we are not sure if sentiment is positive or negative which can be described as follows:

$$ P(y_i = +1 | \mathbf{x}_i) =0.5 $$

To be able to relate and essentially squeeze the real line input of score into probability between 0 and 1 we use something called the link function

The link function we will be using is called the sigmoid function which can be illustrated as follows

$$ sigmoid(score) = \frac{1}{1 + \exp(-score)} $$

where score is $\mathbf{w}^T h(\mathbf{x}_i)$

The link function is given by: $$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}, $$

where the feature vector $h(\mathbf{x}_i)$ represents the word counts of important_words in the review $\mathbf{x}_i$.

Here we are predicting the probability that a review is positive given the set of coefficients,feature_matrix, and corresponding reviews.

Following is the code that implements the link function:

Following is how the link function works with matrix algebra

Since the word counts are stored as columns in feature_matrix, each $i$-th row of the matrix corresponds to the feature vector $h(\mathbf{x}_i)$: $$ [\text{feature_matrix}] = \left[ \begin{array}{c} h(\mathbf{x}_1)^T \\ h(\mathbf{x}_2)^T \\ \vdots \\ h(\mathbf{x}_N)^T \end{array} \right] = \left[ \begin{array}{cccc} h_0(\mathbf{x}_1) & h_1(\mathbf{x}_1) & \cdots & h_D(\mathbf{x}_1) \\ h_0(\mathbf{x}_2) & h_1(\mathbf{x}_2) & \cdots & h_D(\mathbf{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ h_0(\mathbf{x}_N) & h_1(\mathbf{x}_N) & \cdots & h_D(\mathbf{x}_N) \end{array} \right] $$

By the rules of matrix multiplication, the score vector containing elements $\mathbf{w}^T h(\mathbf{x}_i)$ is obtained by multiplying feature_matrix and the coefficient vector $\mathbf{w}$. $$ [\text{score}] = [\text{feature_matrix}]\mathbf{w} = \left[ \begin{array}{c} h(\mathbf{x}_1)^T \\ h(\mathbf{x}_2)^T \\ \vdots \\ h(\mathbf{x}_N)^T \end{array} \right] \mathbf{w} = \left[ \begin{array}{c} h(\mathbf{x}_1)^T\mathbf{w} \\ h(\mathbf{x}_2)^T\mathbf{w} \\ \vdots \\ h(\mathbf{x}_N)^T\mathbf{w} \end{array} \right] = \left[ \begin{array}{c} \mathbf{w}^T h(\mathbf{x}_1) \\ \mathbf{w}^T h(\mathbf{x}_2) \\ \vdots \\ \mathbf{w}^T h(\mathbf{x}_N) \end{array} \right] $$

Following is a test of wether predictions match using the previously defined probability function:

Computing derivative of log likelihood with respect to a single coefficient

$$ \frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right) $$

I will now write a function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:

Although we can rely on the likelihood a transformation of this likelihood---called the log likelihood--- simplifies the derivation of the gradient and is more numerically stable. Due to its numerical stability, I will use the log likelihood instead of the likelihood to assess the algorithm.

The log likelihood is computed using the following formula:

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$

Following is the function to compute the log likelihood

Just to make sure things are running smoothly, I will run the following code block and check that the outputs match.

Taking gradient steps

Now I am ready to implement my own logistic regression. All I have to do is to write a gradient ascent function that takes gradient steps towards the optimum.

Now, I will run the logistic regression solver.

Predicting sentiments

Class predictions for a data point $\mathbf{x}$ can be computed from the coefficients $\mathbf{w}$ using the following formula: $$ \hat{y}_i = \left\{ \begin{array}{ll} +1 & \mathbf{x}_i^T\mathbf{w} > 0 \\ -1 & \mathbf{x}_i^T\mathbf{w} \leq 0 \\ \end{array} \right. $$

I will write some code to compute class predictions. I will do this in two steps:

Step 1 can be implemented as follows:

Now Step 2 to compute the class predictions using the scores obtained above:

Following code block shows that 21348(40%) of reviews are predicted to have positive sentiment

Measuring accuracy

Now I will measure the classification accuracy of the model. Classification accuracy can be computed as follows:

$$ \mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}} $$

The accuracy of the model on predictions made above is 74%

Words that contribute most to positive & negative sentiments

To find the words that correspond most strongly with positive reviews,I we will first do the following:

Now, word_coefficient_tuples contains a sorted list of (word, coefficient_value) tuples. The first 10 elements in this list correspond to the words that are most positive.

Ten "most positive" words

The 10 words that have the most positive coefficient values are associated with positive sentiment.

The ten most positive words are:

The ten most negative words are: