Project 3 | High-Fidelity Sentiment Distillation ๐Ÿ“„๐Ÿค–

This project develops a distilled TF-IDF NLP pipeline for high-volume consumer review data. The team reduced feature cardinality to 20,000 key markers while maintaining elite-tier accuracy (89%), creating a robust, automated, and auditable โ€œGolden Assetโ€ for sentiment modeling.


Links ๐Ÿ”—

Presentation Slides ๐ŸŽค Final Report ๐Ÿ“„ Team Repo ๐Ÿ‘ฅ

Project Overview ๐Ÿ“„

The project focuses on automated ingestion, distillation, and visualization of 50,000 text records. The NLP pipeline preserves negation markers, reduces noise via NLTK filtering, and applies bi-gram TF-IDF vectorization. The resulting dataset enables accurate sentiment classification while minimizing computational overhead and ensuring ethical oversight.


Project Workflow ๐Ÿชœ


Visuals ๐Ÿ“ท

Figure 1: Class Balance Bar Chart
Figure 1: Seaborn bar chart showing perfect 50/50 split (25,000 positive / 25,000 negative) for unbiased modeling.
Figure 2: Review Length Histogram Before/After Distillation
Figure 2: Dual-axis histogram showing reduction in word counts after distillation. Confirms removal of non-semantic filler.
Figure 3a: Positive Word Cloud
Figure 3a: Positive Word Cloud visually auditing the โ€œGolden Asset,โ€ confirming key positive sentiment markers are preserved.
Figure 3b: Negative Word Cloud
Figure 3b: Negative Word Cloud visually auditing the โ€œGolden Asset,โ€ confirming key negative sentiment markers are preserved.
Figure 3c: Positive Sample Quality Audit
Figure 3c: Quality audit of 500 sampled positive records confirming distilled text maintains semantic fidelity.
Figure 3d: Negative Sample Quality Audit
Figure 3d: Quality audit of 500 sampled negative records confirming distilled text maintains semantic fidelity.
Figure 4: Sentiment Model Heatmap
Figure 4: Heatmap validating model quality using preserved negation markers and bi-gram TF-IDF, achieving 89% balanced accuracy.

Results ๐ŸŸฐ

Dataset completeness: 50,000 records ingested with 100% label fidelity

Feature reduction: 20,000 high-intensity markers retained

Modeling accuracy: 89% with balanced precision/recall

Golden Asset validation: Preserved negation markers and semantic integrity


Recommendations & Ethical Considerations โš–๏ธ


Technical Stack ๐Ÿ”จ

Data Wrangling: pandas, os, glob

Distillation: NLTK, custom stop-word filtering

Modeling: Scikit-Learn (Naive Bayes, SVM), TF-IDF Vectorization

Visualization: Matplotlib, Seaborn, WordCloud

Environment: Python 3.x, Jupyter Notebook, GitHub Pages deployment


Conclusion โœ…

The framework confirms that high-fidelity, distilled data is essential for effective sentiment modeling. Automated ingestion and distillation preserved integrity of all 50,000 records, providing a blueprint for transforming noisy text into a model-ready โ€œGolden Asset,โ€ improving analytic efficiency and predictive accuracy.


References ๐Ÿ“š