Text Processing and Classification Intro (Part 2 — Text Classification)

Published in

Analytics Vidhya

9 min readDec 7, 2021

https://cfml.se/blog/sentiment_classification/

How would we predict the sentiments of new reviews?

In the previous part of my writing, I tried to cover the steps to analyze text data. The analysis performed on the previous part is descriptive. This part will try to explain how we would predict the sentiments of new reviews. For this part, I study a lot from this Medium article.

Link to previous part of writing :
https://sea-remus.medium.com/text-processing-and-classification-intro-part-1-sentiment-analysis-7e22a83e1c4

Brief Recap on Part 1

The data that we would like to use has the following columns:

We understand that the sentiments of the reviews are actually classified through the “Score” column (making modeling actually redundant) but this writing will assume that we don’t have the “Score” column.

By analyzing our text data, we know that there are 2 sentiments that are present on the market. We classify those sentiments as positive sentiment and negative sentiment. We will use stopwords to remove redundant words, which will also include “br”, “href”, “amazon”, “product”, “one”, “find”, “taste”, “flavor”, “good”, “buy”, “make”, and “coffee” since through analysis we found that those words are also redundant.
From previous analysis and EDA, we also finally conclude that we won’t be using the “Summary” column since some of the users didn’t gave a summary of their reviews.

Data Preprocessing

We would first import essential packages and obtain the data using the following code:

#essentials
import pandas as pd
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns#obtaining the data
con = sqlite3.connect('database.sqlite')
raw = pd.read_sql_query("select ProductId,ProfileName, HelpfulnessNumerator, HelpfulnessDenominator, Time, Text, case when Score >= 4 then 1 else 0 end Sentiment from Reviews", con)
con.close()

Before we begin the data wrangling, we split the data into 3 parts which will be the train data, test data, and validation data. The purpose is actually to differentiate the test data used for hyperparameter tuning (should there be one) and the test data used for imitating real world new data.

from sklearn.model_selection import train_test_split
xtrain, xsplit, ytrain, ysplit = train_test_split(raw.drop('Sentiment',axis = 1),
                                                  raw.Sentiment, 
                                                  test_size = 0.3, 
                                                  random_state = 42)
xtest, xval, ytest, yval = train_test_split(xsplit,ysplit,
                                            test_size = 0.5,
                                            random_state = 42)

We would also define a tokenization function. We would tokenize text after removing words that are redundant (stopwords) and lematize the text after tokenization.

So for those who are have zero idea of text processing when reading this writing:
Tokenization in natural language processing basically means splitting sentences (sequence of words) into words (or tokens). We can actually do this either by using the built in Python string method .split() or by using word_tokenize() from NLTK package.
Lemmatization is a process to obtain the root of words. An alternative to Lemmatization is Stemming, which is to just stem words. The idea of both is to shortened words so that if we found two words or more that have similar meanings, we would consider them as the same thing. For example, we have a list of words such as [‘playing’, ‘plays’, ‘played’]. When stemmed or lematized, we would obtain [‘play’, ‘play’, ‘play’]. In some cases stemming is better than lemmatization while in other cases it would be the other way around. Think of this as normalization and standardization for numerical data.

def tokenize(txt):
    import re
    from wordcloud import STOPWORDS
    import nltk
    from nltk.tokenize import word_tokenize
    from nltk.stem.wordnet import WordNetLemmatizer
    stpwrd = set(STOPWORDS)
    #we then add frequent irrelevant words discovered before
    stpwrd.update(['br','href','amazon','product','one',
                   'find','taste','flavor','good','buy',
                   'make','coffee'])
    removeapos = txt.lower().replace("'",'') #remove any apostrophe
    text = re.sub(r"[^a-zA-Z0-9\s]"," ", removeapos) #sub weird char to space
    words = word_tokenize(text)
    lemma = [WordNetLemmatizer().lemmatize(w) for w in words if w not in
             stpwrd]
    return lemma

After defining the tokenization function, we define a function to handle missing values and engineer new variables.

def preprocess_engineer(data,train = True):
    #this is to handle missing values before using the data
    data['ProfileName'] = data['ProfileName'].apply(lambda x: 'Anonymous' if 
                                                (x == 'nan')|(x == 'NaN')|
                                                (x == 'N/A')|(x == '0')|
                                                (x == '')|(x == '-1')|
                                                (x == 'null')|(x == 'Null')|
                                                (x == 'NA')|(x == 'na')|
                                                (x == 'none')|(x == 'unknown')
                                                else x)
    
   #this is to simulate that after data train we would 
   #face a number of new  reviews    if train == True:
        countprof = data.ProfileName.value_counts().reset_index()
        countprof.columns = ['ProfileName','ProfReviewCount']
        
        countprod = data.ProductId.value_counts().reset_index()
        countprod.columns = ['ProductId','ProdReviewCount']
        
        countspam = data.groupby(['ProductId',
                                  'ProfileName']).size().reset_index()
        countspam.columns = ['ProductId','ProfileName','SpamReviewCount']
        
        engineered = pd.merge(
                        pd.merge(
                         pd.merge(data,countprof,how='left',on='ProfileName'),
                            countprod,how = 'left', on = 'ProductId'),
                        countspam, how = 'left',on =['ProductId',
                                                     'ProfileName'])
        return engineered      
    if train == False:
        #this is from train
        countproftrain = xtrain.ProfileName.value_counts().reset_index()
        countproftrain.columns = ['ProfileName','ProfReviewCounttrain']
        
        countprodtrain = xtrain.ProductId.value_counts().reset_index()
        countprodtrain.columns = ['ProductId','ProdReviewCounttrain']
        
        countspamtrain = xtrain.groupby(['ProductId',
                                         'ProfileName']).size().reset_index()
        countspamtrain.columns = ['ProductId',
                                  'ProfileName',
                                  'SpamReviewCounttrain']        #this is from the newly introduced data
        countprof = data.ProfileName.value_counts().reset_index()
        countprof.columns = ['ProfileName','ProfReviewCountadd']
        
        countprod = data.ProductId.value_counts().reset_index()
        countprod.columns = ['ProductId','ProdReviewCountadd']
        
        countspam = data.groupby(['ProductId',
                                  'ProfileName']).size().reset_index()
        countspam.columns = ['ProductId','ProfileName','SpamReviewCountadd']
        
        
        
        #this is to add newly introduced data with train
        #ProfileName
        countproffinal = countprof.merge(countproftrain,
                                         how = 'left',
                                         on ='ProfileName')
        countproffinal.fillna(0,inplace = True)
        countproffinal['ProfReviewCount'] = countproffinal.ProfReviewCounttrain + countproffinal.ProfReviewCountadd
        countproffinal.drop(['ProfReviewCounttrain',
                             'ProfReviewCountadd'],
                             axis = 1,inplace = True)
    
        #ProductId
        countprodfinal = countprod.merge(countprodtrain,
                                         how = 'left',
                                         on = 'ProductId')
        countprodfinal.fillna(0,inplace = True)
        countprodfinal['ProdReviewCount'] = countprodfinal.ProdReviewCounttrain + countprodfinal.ProdReviewCountadd
        countprodfinal.drop(['ProdReviewCounttrain',
                             'ProdReviewCountadd'],
                             axis = 1,inplace = True)
        
        #SpamReviewCount
        countspamfinal = countspam.merge(countspamtrain,how = 'left',on = ['ProductId','ProfileName'])
        countspamfinal.fillna(0,inplace = True)
        countspamfinal['SpamReviewCount'] = countspamfinal.SpamReviewCounttrain + countspamfinal.SpamReviewCountadd
        countspamfinal.drop(['SpamReviewCounttrain',
                             'SpamReviewCountadd'],
                             axis = 1,inplace = True)
        
        engineered = pd.merge(
                        pd.merge(
                            pd.merge(data,countproffinal,
                                     how='left',on='ProfileName'),
                                 countprodfinal,
                                 how = 'left', 
                                 on = 'ProductId'),
                             countspamfinal, 
                             how = 'left',
                             on = ['ProductId','ProfileName'])
        return engineered

To summarize the function above, basically the idea is to fill out missing values within the ProfileName column and also create new variables that talks about the count of reviews made by a certain customer, the count of reviews received by a certain product, and the count of multiple reviews made by the same certain customer on a certain product.

Finish creating functions, we now apply the function.

xtrainpreprocessed = preprocess_engineer(xtrain,train = True)
xtestpreprocessed = preprocess_engineer(xtest,train = False)
xvalpreprocessed = preprocess_engineer(xval,train = False)

Notice that we haven’t use tokenize. I will get to it a bit later.

Text data as model input

Before we get into modeling, we must convert the text data to numeric data. There are 2 basic methods of doing it that I’ve learned so far, the first one being Bag of Words (BOW) and the second one being Word Embedding (Word2Vec, W2V). I’m not going to dive deep into the theories and how it works, instead I’m just going to tell you what it does since this writing is only an introduction.

Basically, BOW tries to convert the words in your text data to numbers that represents how often does the word show up within your data. How often does the word show up can be represented by the count of the words (CountVectorizer) or the weight of the word (TF-IDF). You can try to Google them to better understand them.
Word2Vec is an alternative to the BOW. Word2Vec, like its name, tries to convert the word within your data into vectors. Word2Vec uses neural network to produce the vectors that represent our original words. There are two ways to do this, the first one being CBOW (Continuous Bag of Words) and Skip-gram.

https://www.researchgate.net/figure/CBOW-and-Skip-gram-models-architecture-6_fig1_332543231

Say we have this sentence:

I’m gonna go shopping

To explain this briefly,
CBOW tries to predict a word by looking at other words around (example: output = “gonna”, input = [“I’m”, “go”, “shopping”]).
Skip-gram tries to predict other words around a certain word (example: input = “gonna”, output = [“I’m”, “go”, “shopping”])

We’ll be using Word2Vec to convert the text data into numeric data because one of the downsides of Bag of Words method is that it often encounters dimensionality problems.

Imaginary dataframe (Bag of Words with CountVectorizer)

Imagine that we have a dataframe as above. The number of words that exist in your data becomes the number of columns within your dataframe. The dataframe above imagines that you use CountVectorizer as your Bag of Words approach, so you can see that say for the first row (first text data) of your data, there’s 3 “get” words, 1 “this” word, 4 “out” words, and 3 “can” words. Now, you can imagine how large would the input be if we use the Bag of Words approach.
With Word2Vec, we can handle this by calculating the mean(average) of word vectors within text data.

Imaginary dataframe (Word Embedding with Word2Vec)

The imaginary dataframe above shows you the output of Word2Vec. The vector of zeros means that the word doesn’t appear on the text data (e.g. the 3rd text data doesn’t have “this” and “can”). It’s common to average the vectors of each datapoint. After averaging, the number of columns no longer equals the number of words. Instead, the number of columns will be equal to the size of the vector.

Converting text to numbers

Understanding the brief concepts, we will now code several things at once.

#function to average
def AverageVectors(wordvectmod,oritokenizedtxt):
    todict = dict(zip(wordvectmod.wv.index_to_key,
                      wordvectmod.wv.vectors))
    #if a text is empty we should return a vector of zeros
    #with the same dimensionality as all the other vectors
    dim = len(next(iter(todict.values())))
    return np.array([np.mean([todict[w] for w in words if w in todict]
                              or 
                             [np.zeros(dim)],axis = 0)
                     for words in oritokenizedtxt
                    ])w2v = Word2Vec(xtrainpreprocessed.Text.apply(tokenize),
                 sg = 0,
                 hs = 1,
                 seed = 42,
                 vector_size = 128)
#convert to vect average
xtrainvect = AverageVectors(w2v,xtrainpreprocessed.Text.apply(tokenize))
xtestvect = AverageVectors(w2v,xtestpreprocessed.Text.apply(tokenize))
xvalvect = AverageVectors(w2v,xvalpreprocessed.Text.apply(tokenize))

What I did on the code above is create a function that will accept the Word2Vec model that we created and then average the vectors of words for each datapoint. After that, we apply the function to the text data that we have.
After the vectors from Word2Vec have been averaged, we convert them into dataframe and then add other features that might be useful to detect customers’ sentiment.

#add the remaining features
def vecttodata(aftervec,beforevec):
    df = pd.DataFrame(aftervec)
    df['HelpfulnessNumerator'] = beforevec.reset_index(drop=True)['HelpfulnessNumerator']
    df['HelpfulnessDenominator'] = beforevec.reset_index(drop=True)['HelpfulnessNumerator']
    df['Time'] = beforevec.reset_index(drop=True)['Time']
    df['ProfReviewCount'] = beforevec.reset_index(drop=True)['ProfReviewCount']
    df['ProdReviewCount'] = beforevec.reset_index(drop=True)['ProdReviewCount']
    df['SpamReviewCount'] = beforevec.reset_index(drop=True)['SpamReviewCount'] 
    
    return df
    
xtraindf = vecttodata(xtrainvect,xtrainpreprocessed)
xtestdf = vecttodata(xtestvect,xtestpreprocessed)
xvaldf = vecttodata(xvalvect,xvalpreprocessed)

Training Classification Model

On this opportunity, I would like to use XGBoost classifier to build the classification model. The training time is relatively long and it took about an hour to get the training process to finish. There are homogeneous (each attributes/features has similar nature to one and another like text, image, video, .etc) and heterogeneous (each attributes/features has different nature to one and another like age with income, mixed dataset, .etc) data. I tried to use XGBoost because it’s good for heterogeneous data but should the readers like to try building classification model using only the text data as input, I recommend using neural networks because it’s probably way much faster and it’s preferable for homogeneous datasets.

In the previous part, you can also see that the number of positive sentiments and negative sentiments are imbalanced so to handle the imbalances we could either perform a data upsampling or compute sample weights for the training data. I choose the latter. The following is the code that I wrote to build and train the sentiment classification model.

rom xgboost import XGBClassifier
from sklearn.utils.class_weight import compute_sample_weightsample_weights = compute_sample_weight(
    class_weight='balanced',
    y=ytrain
)xgb = XGBClassifier(booster = 'gbtree',
                    objective = 'binary:logistic',
                    eval_metric='auc',
                    seed = 42, use_label_encoder = False,
                    num_parallel_tree = 10,
                    n_estimators = 50,
                    verbosity = 2)
xgb.fit(xtraindf,ytrain,sample_weight = sample_weights)
pred = xgb.predict(xtestdf)
from sklearn.metrics import confusion_matrix,classification_report
print(classification_report(pred,ytest, target_names = ['Negative','Positive']))

Final Proofing & Conclusion

We can say that the model performed well on test data but we can improve it more by performing hyperparameter tuning, setting probability cutoff value, .etc. We run the model on the validation data to test how it performed on unseen data one more time.

Classification Report on Validation Data

You can see that the classification report doesn’t differ at all from the classification report for the test data.

This writing, albeit far from perfect, was meant to give a rough step-by-step process of what we should do when we would like to manipulate text data. There are a lot of text manipulation methods that I myself haven’t completely understood nor try to learn but I was hoping that I could share this little knowledge with other NLP beginners.
As usual, I hope you guys would leave me any advice regarding my approach in the comment section leave a clap should you find this post useful. :)

See you next time!