Opinion mining of online users’ comments using Natural Language Processing and machine learning

Date

2020-08-28

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

With the widespread popularity of World Wide Web, increasing number of people are active on social media and websites to post their opinions towards products or special events or to make decisions based on the opinions and experiences of people on social media. These Online opinions are unstructured or structured textual data containing insignificant as well as significant information which has attracted attention of researchers to extract knowledge from such textual data. Opinion mining and Natural Language Processing (NLP) techniques help to find information through the huge number of reviews in the form of unstructured comments. In this research a model is proposed for classification of online user’s feedback and opinions to improve the accuracy and precision of the classification in comparison to the existing research on the same dataset. More-precisely, in this research, Natural Language Processing (NLP) techniques as well as various supervised machine learning techniques are used to classify users’ opinions. The performances of all the classifiers are evaluated to find the best performance. The data set contains 689 comments extracted from the users' comments from Amazon.com, collected and annotated by Minqing Hu and Bing Liu. The selected comments are about the product “Speakers” on Amazon.com. Each comment is written by one user and it has a certain label that shows the author's desire to comment. This label can be classified as "positive", "negative" or "neutral". The data is provided in the form of XML file, a semi-structured format. The opinions are processed using natural language processing techniques, for instance by removing punctuations, removing URLs, removing numbers, removing spaces, removing stop-words, and their features are extracted using natural language processing techniques, for example, Word Tokenization, Stemming and Bag of words and Bag of N-grams and Term Frequency-Inverse Document Frequency (TF_IDF). The proposed method was implemented using Python programming language and Natural Language Toolkit (NLTK) and other libraries in python. The proposed model gave a peak of 88% precision by Random Forest with 140 trees and bigram feature space. Also, Random Forest, Gradient Boosting, Artificial Neutral Network, and SVM gave 87% precision for trigram feature space.

Description

Keywords

Data mining, opinion mining, Natural Language Processing (NLP), data pre-processing, word tokenization, stemming, term frequency-inverse document frequency, supervised machine learning, random forest, gradient boosting, decision trees, SVMs, gini-index, artificial neural network

Citation