top of page

Drawing the Line: Classifying News Sentiment with SVM's

Overview

For classification tasks, Support Vector Machines (SVMs) are effective supervised learning models. They function by determining which hyperplane in a high-dimensional space best divides classes. SVMs are perfect for complex text classification tasks because they perform best when the decision boundaries between categories are overlapping or complex.

In this project, we used support vector machines (SVMs) to categorize news articles about the removal of smartphone chargers into three sentiment categories: against, neutral, and pro. SVMs are more resilient to borderline situations than Naïve Bayes or Decision Trees because they strive to maximize the margin between sentiment classes.

In order to balance the trade-off between maximizing margin and minimizing classification error, we tested three distinct SVM kernels: linear, RBF (Radial Basis Function), and polynomial. We also adjusted the cost parameter C. This enabled us to evaluate SVMs' ability to detect nuanced linguistic clues in headlines and descriptions that might indicate bias or tone.

Data Preparation

We used preprocessing and vectorization to make our text data numeric and labeled in order to comply with the Support Vector Machine models. Using a dataset of lemmatized news headlines and article descriptions, we classified each one as either pro, neutral, or against the policy requiring the removal of smartphone chargers.

To assign weight to terms according to their significance across all articles, we performed TF-IDF transformation on top of CountVectorized inputs. Because SVMs rely on geometric separation in high-dimensional space, this transformation is crucial, and TF-IDF aids in prioritizing meaningful words in that space.

​

The dataset was split using stratified sampling into:

  • 80% training set – used to learn the support vectors and hyperplane,

  • 20% testing set – used to evaluate generalization.

​

​Our pipeline was completely compatible since the features were scaled numerically, and SVMs can only work with numeric, labeled data.

Screenshot 2025-04-22 at 10.50.51 AM.png
Screenshot 2025-04-22 at 10.51.51 AM.png
Screenshot 2025-04-22 at 10.51.23 AM.png
Screenshot 2025-04-22 at 10.52.24 AM.png

In order to guarantee that the models could distinguish between subtle sentiment categories and to enable kernel-based experimentation with linear, RBF, and polynomial mappings, this preparation was essential.

Code

We implemented and assessed Support Vector Machines (SVMs) using Python and the Scikit-learn library. To transform text into a numeric format appropriate for linear separation, the dataset was preprocessed using the TF-IDF transformation.

Three different kernel types were tested in order to assess the effects of various decision boundaries:

  • Linear – Best for linearly separable data.

  • RBF (Radial Basis Function) – Allows nonlinear boundaries with higher flexibility.

  • Polynomial – Captures interaction between features using degree-based separation.

​

Each kernel was trained using a cost parameter (C) of 0.1, 1, and 10, and the most accurate version was chosen. We presented results from C = 1 for consistency since it provided a good balance between classification accuracy and margin width.

To guarantee a fair comparison, the same training/testing split was used to train each SVM, and classification reports and confusion matrices were used to monitor the results.​

Results & Interpretation

To evaluate Support Vector Machines for sentiment classification, we tested three different kernel functions: linear, RBF (Radial Basis Function), and polynomial, each with a regularization parameter (C) set to 1.

​

Confusion Matrix: Below are the visualized confusion matrices for each kernel. These diagrams highlight how frequently each model correctly or incorrectly predicted the sentiment class.

image.png

Classification Report Summary

Screenshot 2025-04-22 at 12.50.19 PM.png

Kernel Comparison

  • The RBF kernel achieved the highest overall accuracy (53%) and recall for the Neutral class (96%).

  • Linear and Polynomial kernels both reached 49% accuracy, with similar misclassification trends.

  • All models performed well in recognizing Neutral sentiment but struggled with minority classes, especially Against.

These findings demonstrate how well SVMs capture subtle variations in text sentiment, especially when nonlinear kernels like RBF are used. Nonetheless, there was still some misunderstanding because the language used in the Pro and Against categories overlapped, which was an issue with every model that was examined.

Conclusion

​The best model for sentiment classification in this project turned out to be Support Vector Machines (SVMs). With a recall of 96%, the RBF kernel outperformed the other two tested kernels in identifying Neutral articles, achieving the highest accuracy of 53%.

SVM's capacity to represent intricate, nonlinear decision boundaries in high-dimensional text data is demonstrated by this impressive performance, which is advantageous when dealing with minute variations in media sentiment. All models had trouble with the overlapping vocabulary between the Pro and Against categories, but SVM handled this ambiguity better than either Decision Trees or Naïve Bayes.

​

While linear and polynomial models did fairly well, the kernel comparison also showed that they lacked the flexibility required to fully capture the sentiment dynamics in this dataset. The RBF kernel had a distinct advantage because it could adjust to more complex patterns.

All things considered, the SVM results highlight how crucial model selection and hyperparameter tuning are when handling text data. These results will guide the development of more precise, context-aware sentiment classifiers in the future, perhaps with the use of ensemble methods or sophisticated language models like BERT.

bottom of page