Skip to content

Twitter sentiment classification - Part 1

Published:

Sen­ti­ment clas­si­fi­ca­tion is some­what of a trend in NLP; it con­sists of clas­si­fy­ing small texts ac­cord­ing to its sen­ti­ment con­no­ta­tion: a pos­i­tive or neg­a­tive feel­ing. Today we’ll use the Sen­ti­ment140 dataset to train a clas­si­fier model in python. The dataset con­sists of 1.6 mil­lion sentiment-​labeled tweets. The en­cod­ing must be man­u­ally set.

import re
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

data = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding = 'ISO-8859-1', header = None)
data.columns = ['sentiment','id','date','flag','user','tweet']

Let’s take a look at some tweet ex­am­ples by sen­ti­ment:

from random import sample

data_positive = data.loc[data['sentiment'] == 4]
pd.options.display.max_colwidth = 200
data_positive.iloc[sample(range(len(data_positive)), 10)]['tweet']

Flickr is up­load­ing… Oooh, and these pics count to­wards my 101 things!

@mrdi­rec­tor09 That’s why you have me

@michael­grainger lmao! I un­der­stand bro…trust me

@kyle­and­jack­ieo I.e. I don’t need to know when you’re get­ting a cof­fee, and I don’t need to know all your deep thoughts about every­thing.

At sushi land

just re­al­ized my birth­day isnt that far away a month and 3 days

@uh­hit­san­ge­laa thats good. glad ur feel­ing bet­ter girl!

@mike­cane Thanks, Mike!

i just found out that my name means god’s grace in he­brew.

@rah­no­cer­ous tired. yr 12 is killing me, al­beit slowly. 2 days left and im on 2 week break though

from random import sample

data_negative = data.loc[data['sentiment'] == 0]
pd.options.display.max_colwidth = 200
data_negative.iloc[sample(range(len(data_positive)), 10)]['tweet']

The weather is blow­ing mines right now and I’m in traf­fic

@keza34 oh i havent, ive bn sat at home with with­drawels, so not good

Only pow­der pink slipon vans would have com­pleted this look. I rushed out and for­got my hair ties

@jen­leigh­barry Hey Jen! Sadly no.. guess­ing you are!? Aw­some­ness! Can hear the click-​click of your fo­cused eye going to work!

Home. Don’t think i’ll wake up at 5. :-p I had set an alarm for 6 in the kids’ room & for­got to turn it off. I feel bad about that.

i woke up ear­lier than i wanted to thanks to Prince pa­rade to­dayy

@jan­iceromero same thing hap­pened to me.. it ei­ther you not use Ak­ismet or just check your com­ments daily.

ok…am i java rookie…i knw…bt i hope ds openCMS docs make some sense

@clara018 yeah! my day seemed to pass so fast with­out him up­date

@clover­dash He’s play­ing Juan Igna­cio Chela…who’s good on clay. very an­noy­ing. Fin­gers crossed though!

We can see that not all tweets are ob­vi­ously pos­i­tive or neg­a­tive, the less ob­vi­ous ones will be a chal­lenge to our clas­si­fier.

Pre-​Processing

Pre-​processing is a huge step in our analy­sis, as it di­rectly in­flu­ences the model’s per­for­mance. We’ll use python regex to in­di­cate words that are all caps, re­place URLs for URL, user men­tions with USER, re­move all spe­cial sym­bols, in­di­cate punc­tu­a­tion rep­e­ti­tions with RE­PEAT, hash­tags with HASH­TAG and word end elon­ga­tions (e.g. heeyyyyyyy) with ELONG. Eng­lish con­trac­tions are split, extra spaces re­moved and, fi­nally, every­thing is set to lower case.

def preprocess_tweets(tweet):
    #Detect ALLCAPS words
    tweet = re.sub(r"([A-Z]+\s?[A-Z]+[^a-z0-9\W]\b)", r"\1 <ALLCAPS> ", tweet)
    #Remove URLs
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','<URL> ', tweet)
    #Separate words that are joined by / (e.g. black/brown)
    tweet = re.sub(r"/"," / ", tweet)
    #Remove user mentions
    tweet = re.sub('@[^\s]+', "<USER>", tweet)
    #Remove all special symbols
    tweet = re.sub('[^A-Za-z0-9<>/.!,?\s]+', '', tweet)
    #Detect puncutation repetition
    tweet = re.sub('(([!])\\2+)', '! <REPEAT> ', tweet)
    tweet = re.sub('(([?])\\2+)', '? <REPEAT> ', tweet)
    tweet = re.sub('(([.])\\2+)', '. <REPEAT> ', tweet)
    #Remove hashtags
    tweet = re.sub(r'#([^\s]+)', r'<HASHTAG> \1', tweet)
    #Detect word elongation (e.g. heyyyyyy)
    tweet = re.sub(r'(.)\1{2,}\b', r'\1 <ELONG> ', tweet)
    tweet = re.sub(r'(.)\1{2,}', r'\1)', tweet)
    #Expand english contractions
    tweet = re.sub(r"'ll", " will", tweet)
    tweet = re.sub(r"'s", " is", tweet)
    tweet = re.sub(r"'d", " d", tweet) # Would/Had ambiguity
    tweet = re.sub(r"'re", " are", tweet)
    tweet = re.sub(r"didn't", "did not", tweet)
    tweet = re.sub(r"couldn't", "could not", tweet)
    tweet = re.sub(r"can't", "cannot", tweet)
    tweet = re.sub(r"doesn't", "does not", tweet)
    tweet = re.sub(r"don't", "do not", tweet)
    tweet = re.sub(r"hasn't", "has not", tweet)
    tweet = re.sub(r"'ve", " have", tweet)
    tweet = re.sub(r"shouldn't", "should not", tweet)
    tweet = re.sub(r"wasn't", "was not", tweet)
    tweet = re.sub(r"weren't", "were not", tweet)
    #Remove extra spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Lower case
    tweet = tweet.lower()

    return tweet

Let’s use train_test_split to split our data into train­ing and test­ing data while ap­ply­ing our pre­process func­tion.

from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(data, train_size = 0.8, random_state = 42)

sentiment = np.array(data['sentiment'])
tweets = np.array(data['tweet'].apply(preprocess_tweets))

sentiment_train = np.array(train_data['sentiment'])
tweets_train = np.array(train_data['tweet'].apply(preprocess_tweets))

sentiment_test = np.array(test_data['sentiment'])
tweets_test = np.array(test_data['tweet'].apply(preprocess_tweets))

We’ll build a word2count dic­tio­nary that will have a key entry for each word found in the data and a value cor­re­spond­ing to how many times that word was seen. Later, a rea­son­able thresh­old that in­cluded slightly more than 95% of word oc­cur­rences was cho­sen.

word2count = {}
for tweet in tweets:
    for word in re.findall(r"[\w']+|[.,!?]", tweet):
        if word not in word2count:
            word2count[word] = 1
        else:
            word2count[word] += 1

total_count = np.array(list(word2count.values()))

print(sum(total_count[total_count > 75]) / sum(total_count))
0.9551287692699321

Zipf’s Law is an em­pir­i­cal ap­prox­i­ma­tion which states that the sec­ond most fre­quent word in a lan­guage is used half as fre­quently as the first one, the third one, a third of the fre­quency of the first one and so on. Math­e­mat­i­cally, Zipf’s Law can be de­fined as Pn1/nαP_n \sim 1/n^\alpha, where PnP_n rep­re­sents the fre­quency of the n-th most fre­quent word and α\alpha is ap­prox­i­mately 1. This re­la­tion­ship can be seen as a line in a log-​log plot with the y-​axis as log(count) and x-​axis as log(rank) of words:

zipfs_law

Thus, word fre­quency ex­po­nen­tially de­creases and words that are rarely seen pro­vide lit­tle to none in­for­ma­tion to our model and make it much more com­plex and sparse. There­fore, only rel­a­tively fre­quent words are in­cluded while still hold­ing most of the in­for­ma­tion.

Vec­tor­iz­ing

We’ll build a bag of words, in which each tweet will be­come a vec­tor of length n, where n is the num­ber of words in our dic­tio­nary, and the val­ues of this vec­tor cor­re­spond to how many times that word is seen in that tweet. This ap­proach, how­ever, has a huge dis­ad­van­tage: very fre­quent words (such as the and a) will al­most al­ways have the high­est counts, while ac­tu­ally hold­ing lit­tle in­for­ma­tion. The TF-​IDF (term fre­quency times in­verse doc­u­ment fre­quency) ap­proach tack­les this issue by using a dif­fer­ent count value: we’ll use a sim­ple ap­proach and mul­ti­ply the text (tweet) fre­quency of a word by log(n)/(df(d,t)+1)log(n) / (df(d, t) + 1), where nn is the num­ber of doc­u­ments and df(d,t)df(d, t) is the num­ber of doc­u­ments which con­tain the word. Thus, the trans­formed count value will in­di­cate which words stand out as most unique and defin­ing on each tweet.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df = 75)
vectorizer.fit(tweets)

tweets_bow_train = vectorizer.transform(tweets_train)
tweets_bow_test = vectorizer.transform(tweets_test)

Next, a lo­gis­tic re­gres­sion will be the model of choice to clas­sify the data. As the model will re­ceive a huge input vec­tor while hav­ing to out­put a sin­gle value, it’s rea­son­able that the weights should be rel­a­tively sparse, that is, many words should have lit­tle to no in­flu­ence on the out­put. Thus, we’ll use L1 penalty to en­sure spar­sity in the model. The C pa­ra­me­ter is the in­verse of reg­u­lar­iza­tion strength, that is, lower C val­ues re­sult in a stronger reg­u­lar­iza­tion and sparser so­lu­tions. We’ll try three dif­fer­ent C val­ues. A fourth model with L2 penalty and C = 1 (de­fault value) will be fit for com­par­i­son.

from sklearn.linear_model import LogisticRegression

regressor1 = LogisticRegression(C = 1, penalty = 'l1', solver = 'liblinear',
                                multi_class = 'ovr', random_state = 42)
regressor1.fit(tweets_bow_train_idf, sentiment_train)

regressor2 = LogisticRegression(C = 0.5, penalty = 'l1', solver = 'liblinear',
                                multi_class = 'ovr', random_state = 42)
regressor2.fit(tweets_bow_train_idf, sentiment_train)

regressor3 = LogisticRegression(C = 0.1, penalty = 'l1', solver = 'liblinear',
                                multi_class = 'ovr', random_state = 42)
regressor3.fit(tweets_bow_train_idf, sentiment_train)

regressor4 = LogisticRegression(solver = 'liblinear', multi_class = 'ovr',
                                random_state = 42)
regressor4.fit(tweets_bow_train_idf, sentiment_train)

We’ll use area under the curve (AUC) and F1-​score to mea­sure mod­els’ per­for­mance.

from sklearn.metrics import roc_auc_score, f1_score

pred1 = regressor1.predict(tweets_bow_test)
pos_prob1 = regressor1.predict_proba(tweets_bow_test)[:, 1]
auc1 = roc_auc_score(sentiment_test, pos_prob1)
f11 = f1_score(sentiment_test, pred1, pos_label=4)

pred2 = regressor2.predict(tweets_bow_test)
pos_prob2 = regressor2.predict_proba(tweets_bow_test)[:, 1]
auc2 = roc_auc_score(sentiment_test, pos_prob2)
f12 = f1_score(sentiment_test, pred2, pos_label=4)

pred3 = regressor3.predict(tweets_bow_test)
pos_prob3 = regressor3.predict_proba(tweets_bow_test)[:, 1]
auc3 = roc_auc_score(sentiment_test, pos_prob3)
f13 = f1_score(sentiment_test, pred3, pos_label=4)

pred4 = regressor4.predict(tweets_bow_test)
pos_prob4 = regressor4.predict_proba(tweets_bow_test)[:, 1]
auc4 = roc_auc_score(sentiment_test, pos_prob4)
f14 = f1_score(sentiment_test, pred4, pos_label=4)
Model 1:
AUC: 0.8782442518806748
F1: 0.8017438980490371
Model 2:
AUC: 0.878181401427863
F1: 0.8021068750172958
Model 3:
AUC: 0.8724711782629141
F1: 0.7978355389550899
Model 4:
AUC: 0.878032003703207
F1: 0.8012445320682644

Model 1 is the best one AUC-​wise, while model 2 is the best one ac­cord­ing to F1 score. Let’s see the spar­sity on each model:

sparsity1 = np.mean(regressor1.coef_.ravel() == 0) * 100
sparsity2 = np.mean(regressor2.coef_.ravel() == 0) * 100
sparsity3 = np.mean(regressor3.coef_.ravel() == 0) * 100
sparsity4 = np.mean(regressor4.coef_.ravel() == 0) * 100

print('Sparsity with L1 and C = 1: %.2f%%' % sparsity1)
print('Sparsity with L1 and C = 0.5: %.2f%%' % sparsity2)
print('Sparsity with L1 and C = 0.1: %.2f%%' % sparsity3)
print('Sparsity with L2 and C = 1: %.2f%%' % sparsity4)
Sparsity with L1 and C = 1: 15.51%
Sparsity with L1 and C = 0.5: 29.29%
Sparsity with L1 and C = 0.1: 72.18%
Sparsity with L2 and C = 1: 0.00%

It’s quite amaz­ing that even with 72.18% of co­ef­fi­cients set to 0, model 3 was still able to achieve a per­for­mance al­most iden­ti­cal to much more com­plex mod­els. Also, spar­sity raises model per­for­mance and makes it sim­pler, as the L2 model with no spar­sity is much more com­plex and per­forms a bit worse ac­cord­ing to both met­rics.

In­ter­pret­ing the model

Let’s see which words have the biggest con­tri­bu­tion for both pos­i­tive and neg­a­tive sen­ti­ment. The third model will be used as it’s more sparse and al­lows for bet­ter in­ter­pre­ta­tion.

coefs = np.array(regressor3.coef_.ravel())

sorting = coefs.argsort()

high_coefs = []
high_words = []
for i in range(-1, -21, -1):
    high_coefs.append(coefs[sorting[i]])
    temp = np.zeros(coefs.shape[0])
    temp[sorting[i]] = 1
    high_words.append(vectorizer.inverse_transform(temp)[0][0])

low_coefs = []
low_words = []
for i in range(20):
    low_coefs.append(coefs[sorting[i]])
    temp = np.zeros(coefs.shape[0])
    temp[sorting[i]] = 1
    low_words.append(vectorizer.inverse_transform(temp)[0][0])

    high_coefs = [high_coefs]
    low_coefs = [low_coefs]

    high_coefs = np.round(high_coefs, 1)
    low_coefs = np.round(low_coefs, 1)

    fig, ax = plt.subplots(figsize = (10, 2))
    im = ax.imshow(high_coefs, cmap = 'YlGn')
    ax.set_xticks(np.arange(len(high_words)))
    ax.set_xticklabels(list(high_words))

    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")
    ax.axes.get_yaxis().set_visible(False)
    for i in range(20):
        text = ax.text(i, 0, high_coefs[0][i],
                       ha="center", va="center", color="black")
    fig.tight_layout()
    plt.savefig('highest_heatmap.png', dpi = 150)

    fig, ax = plt.subplots(figsize = (10, 2))
    im = ax.imshow(low_coefs, cmap = 'PuBu_r')
    ax.set_xticks(np.arange(len(low_words)))
    ax.set_xticklabels(list(low_words))

    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")
    ax.axes.get_yaxis().set_visible(False)

    for i in range(20):
        text = ax.text(i, 0, low_coefs[0][i],
                       ha="center", va="center", color="black")
    fig.tight_layout()
    plt.savefig('lowest_heatmap.png', dpi = 150)

Heatmap of most positive words Heatmap of most negative words

Many words are quite pre­dictable. How­ever, wor­ries stands out among pos­i­tive ones. Still, we can think that tweets con­tain­ing wor­ries talk about it lightly, while the word wor­ried might have a neg­a­tive con­no­ta­tion in­stead.

Let’s take a look at the 10 most cor­rect pos­i­tives:

pos_indices = sentiment_test == 4
pos_predicted = pos_prob1 > 0.5
true_positives = pos_indices & pos_predicted
true_positives_rank = np.argsort(pos_prob1[true_positives])
print(tweets_test[true_positives][true_positives_rank[range(-1, -11, -1)]])
['<user> <url> <elong> you look great. and happy. smiling is good. haha. i love your smile.'
'<user> glad it makes you happy. smile '
'<user> welcome and thank you for the followfriday! '
'<user> <user> happy birthday. we love you! thank you '
'<user> yay! <repeat> thanks for the followfriday love! <repeat> '
'<user> welcome home! im glad you had a great time, thanks for the amazing updates '
'<user> im glad you enjoy! thanks!'
'<user> yay <elong> ! thank you! youre awesome! <repeat> '
'<user> your welcome! thanks for sharing the great quote. '
'waves good morning and smiles her best smile! <url> ']

Top 10 true neg­a­tives:

neg_indices = sentiment_test == 0
neg_predicted = pos_prob1 <= 0.5
true_negatives = neg_indices & neg_predicted
true_negatives_rank = np.argsort(pos_prob1[true_negatives])
print(tweets_test[true_negatives][true_negatives_rank[range(10)]])
['is sad. i miss u . <repeat> '
'i cant believe farrah fawcett died! so sad '
'rip <allcaps> farrah fawcett! this is so sad '
'so sad to hear farrah fawcett died '
'<user> i had boatloads of sharpies and i didnt go! <repeat> sad sad sad sad so very sad. '
' sad awkwardness' 'im sad. <repeat> i miss my <user> '
'sad i dont know why i sad ' 'sad sorrow weary disappointed '
'i hate i missed roo im so sad ']

Top 10 false neg­a­tives:

pos_indices = sentiment_test == 4
neg_predicted = pos_prob1 <= 0.5
false_negatives = pos_indices & neg_predicted
false_negatives_rank = np.argsort(pos_prob1[false_negatives])
print(tweets_test[false_negatives][false_negatives_rank[range(10)]])
['<user> <allcaps> dont be sad. it doesnt make me sad '
'im not sad anymore ' '<user> that is sad! '
'dubsteppin. miss my lovies. '
'saw quotdrag me to hellquot sadly it scared me <elong> hate the ending. '
'btw, bye <user> dang! wish u went along wit them. <repeat> sad sad.'
'the saddest person is texting me telling me about how sad my life is and is getting nothing right, now shes sad '
'<user> poor girl, the fever is horrible! i hate it! get well soon bama! '
'<user> your sad ' '<user> coucou miss ']

Top 10 false pos­i­tives:

neg_indices = sentiment_test == 0
pos_predicted = pos_prob1 > 0.5
false_positives = neg_indices & pos_predicted
false_positives_rank = np.argsort(pos_prob1[false_positives])
print(tweets_test[false_positives][false_positives_rank[range(-1, -11, -1)]])
['<user> cool! thank you thank you '
'<user> hey say me something haha now that you in love, love, love 8 forget about me? haha luv ya and im so happy because u happy'
'<user> i love you,i love you,i love you youre the most beautiful and sweet girl ever.'
'thank you lol' 'wait for a wonderful day ' '<user> thank you'
'<user> thank you ' '<user> thank you lovie. ' '<user> thank you. '
'<user> thanks. ']

Here we can see that most of these false pos­i­tives/neg­a­tives prob­a­bly are ac­tu­ally real neg­a­tives/pos­i­tives. This sug­gests that there’s mis­la­beled data in the dataset. More­over, this shows that the model has a good ca­pac­ity of gen­er­al­iza­tion, as it cor­rectly clas­si­fied mis­la­beled data.

This whole project shows that even a sim­ple model, such as lo­gis­tic re­gres­sion, may have a very sat­is­fac­tory per­for­mance with great gen­er­al­iza­tion. Thus, it’s al­ways best prac­tice to begin with a sim­ple model. On the fol­low­ing posts, I’ll use re­cur­rent neural net­works to tackle this task.


Previous Post
Twitter sentiment classification - Part 2
Next Post
Exploratory data analysis: the WHO suicide dataset