Analyzing millions of Amazon customer reviews to extract actionable business insights and predict product ratings. This project demonstrates skills in large-scale data processing, feature engineering, statistical analysis, machine learning, and transformer-based NLP modeling. Insights from the project inform product strategy, marketing decisions, and scalable automated review analysis pipeline. 🛒
This project analyzes ~123GB of Amazon customer review data (subset used: 129,794 reviews) from 58,902 products between 2000–2018. The analysis identifies sentiment patterns, predicts product ratings, and evaluates reviewer behavior using structured features and unstructured review text. Methods include statistical testing, hierarchical machine learning models, and transformer-based NLP with DistilBERT.
Clear rating imbalance with 5-star reviews dominating the dataset, noted before further phases to address bias.
Improving accuracy (green line) and decreasing validation loss (green line) indicate the model was learning effectively without severe overfitting.
This plot highlights per-class performance by avoiding misleading dominance from the high-rating majority class.
Feature enrichment and leaky feature removal improved Random Forest performance, while the two-stage approach shows the best balance, especially for minority classes.
Core: Python, Pandas, NumPy, NLTK
Visualization: Matplotlib, Seaborn, Plotly
ML Models: Logistic Regression, Random Forest, Hierarchical Models
NLP: Hugging Face Transformers (DistilBERT)
Evaluation: Accuracy, Macro F1, Precision/Recall, Confusion Matrices