Vineel Rayapati
Data Scientist
From Raw to Refined:
Data Collection & Preprocessing
Data Collection & Sources Overview
Data is at the heart of this project, serving as the foundation for understanding public sentiment on smartphone companies removing chargers from the box. To ensure a diverse and balanced dataset, news articles were gathered from multiple sources, including APIs and web scraping techniques. By leveraging structured data retrieval from NewsAPI.org and MediaStack, along with Google News RSS web scraping, we compiled a rich dataset of 277 articles. This dataset captures various viewpoints, ensuring a fair representation of both Pro-Regulation (supporting charger removal) and Anti-Regulation (opposing charger removal) perspectives.
​
To obtain a wide range of articles, we used NewsAPI.org, a service that aggregates news from thousands of publishers worldwide. We queried the API using keywords such as "smartphone charger removal," "phone no charger," "Apple charger removal," and "Samsung charger policy". Similarly, MediaStack API was employed to supplement the dataset with additional articles from various news outlets, ensuring comprehensive coverage of the topic. Combined, these two APIs contributed approximately 200 articles to the dataset, covering opinions, regulatory changes, and public reactions.
​
In addition to API-based data collection, Google News RSS scraping was performed to extract the latest headlines related to smartphone charger policies. Using Python’s feedparser library, we gathered approximately 77 articlesdirectly from Google's RSS feeds, ensuring that we captured the most recent and evolving discussions. By integrating both API and web scraping techniques, this dataset provides a well-rounded view of the issue, offering valuable insights into public discourse, corporate policies, and environmental concerns related to charger removal.

Data Cleaning & Preprocessing

Data cleaning and preprocessing are essential steps in transforming raw text data into meaningful and structured information for analysis. In this project, the collected news articles underwent several preprocessing techniques to remove noise and standardize the text. First, unnecessary characters, punctuation, and numbers were removed, and all text was converted to lowercase to maintain consistency. Next, stopwords—common but unimportant words like "the," "and," and "is"—were filtered out to focus on the most meaningful terms. Additionally, stemming and lemmatization were applied: stemming reduces words to their root form (e.g., "removing" → "remov"), while lemmatization converts words into their dictionary form (e.g., "removing" → "remove"). These steps help reduce redundancy and improve text representation for further analysis.
After text normalization, feature extraction techniques were applied to convert text data into numerical format for machine learning models. Two key vectorization methods were used: CountVectorizer, which converts text into a bag-of-words representation based on word frequency, and TF-IDF (Term Frequency-Inverse Document Frequency), which assigns importance scores to words based on their occurrence in documents. These vectorized features allow for clustering, classification, and sentiment analysis. The final cleaned dataset includes structured text ready for topic modeling, sentiment analysis, and predictive modeling. This process ensures a robust and reliable dataset, improving the accuracy of insights derived from the collected articles.
Raw Data

Here is the CSV file where data is organized and rearranged
Cleaned Data


Here is the CSV file where data is count and tfidf vectorized data
Histogram for Sentiment Scores
A histogram of sentiment scores provides a visual representation of how the collected news articles are distributed across different sentiment values. By plotting the VADER sentiment scores, we can observe whether the majority of articles lean toward positive (Pro-Regulation), negative (Against-Regulation), or neutral sentiment. The x-axis represents the sentiment scores, ranging from -1 (strongly negative) to +1 (strongly positive), while the y-axis shows the frequency of articles within each sentiment range. This helps identify sentiment trends, such as whether public opinion is mostly neutral or polarized. Analyzing this distribution can offer insights into how the media portrays smartphone charger removal policies and whether certain narratives dominate the discussion
