A MACHINE LEARNING CLASSIFICATION MODEL FOR MOVIE REVIEWS USING N-GRAM FEATURES SELECTION

Authors

  • Ademola Abiodun Omilabu Author
  • 2Adedeji Oladimeji Adebare., 3Olayinka Olufunmilayo Olusanya., 4Omotayo Joseph Adeyemi Author

Keywords:

Sentiment Analysis, N-Gram Features, Machine Learning, Movie Reviews, IMDb Dataset, Exploratory Data Analysis.

Abstract

This study addresses the increasing reliance of consumers on online platforms for informed decision-making, particularly in the context of movie reviews. Using a dataset of 50,000 reviews from Kaggle’s IMDb dataset, the research develops a sentiment classification model by combining N-Gram feature extraction with diverse machine learning algorithms. The study emerges from limitations identified in prior methodologies, such as small dataset sizes, rule-based approaches, and an overreliance on TF-IDF feature extraction. Employing Python for Exploratory Data Analysis (EDA), the research undertakes key preprocessing steps, including stemming, lowercase conversion, stop-word removal, and tokenization. Central to the methodology is N-Gram feature selection, designed to capture nuanced contextual relationships between words. The machine learning algorithms applied include Linear Support Vector Classifier (Linear SVC), Logistic Regression, Decision Trees, Bernoulli Naïve Bayes, and Multinomial Naïve Bayes. Comparative analysis reveals that the N-Gram methodology—especially when combined with the Linear Support Vector Classifier—outperforms the traditional TF-IDF approach, yielding superior accuracy and more balanced confusion matrices. While Multinomial Naïve Bayes and Logistic Regression demonstrate strong effectiveness, Decision Trees show limitations in precision. The findings underscore the superiority of the N-Gram approach, particularly alongside the Linear Support Vector Classifier, as a robust framework for sentiment analysis in movie reviews. These insights carry practical implications for businesses that rely on user-generated content for decision-making, highlighting the necessity of incorporating N-Gram feature extraction into sentiment analysis models.

Downloads

Published

2025-09-22