top of page

Discovering Hidden Narratives with Clustering

Overview

we use clustering as an unsupervised machine learning approach to explore how news media and public discourse frame the controversial decision by smartphone manufacturers to remove chargers from new phone boxes. With no predefined labels in the data, clustering allows us to group similar articles based on their textual content, helping us uncover dominant themes, recurring narratives, and varying tones across the dataset.

​

Our dataset contains over 140 news articles scraped from multiple sources, each consisting of a headline and a description. By combining and transforming this text into numerical vectors using TF-IDF, we created a format suitable for clustering algorithms. We implemented two key methods: K-Means Clustering to segment articles into distinct groups based on content similarity, and Hierarchical Clustering using cosine similarity to understand how articles relate in a tree-like  structure.

​

Through this analysis, we expect to find meaningful clusters that reflect different sides of the charger removal debate such as support for sustainability, consumer frustration, comparisons between tech companies, discussions around charging technology, and the economic/business implications of the policy. Clustering not only gives us a high-level view of the narrative landscape but also reveals how public opinion and media framing differ subtly or significantly across outlets and contexts.

Data Preparation

Clustering algorithms such as K-Means and Hierarchical Clustering require unlabeled, numeric data as input. This means we must convert our raw text data — consisting of news article headlines and descriptions — into a structured numerical format that captures the meaning and relevance of words in each document.

To achieve this, we used TF-IDF (Term Frequency-Inverse Document Frequency) vectorization, a technique that transforms text into feature vectors. This method emphasizes words that are unique and informative within a specific article but not overly common across all articles. Before vectorizing, we combined each article’s title and descriptionfields to form a single text input per article.

Once transformed, each article is represented as a high-dimensional numeric vector, where each feature corresponds to a term's TF-IDF weight. These vectors are then used as input for both clustering models.

Before Transformation

Screenshot 2025-03-27 at 10.40.10 PM.png

After Transformation

Screenshot 2025-03-27 at 10.41.44 PM.png

K-Means Clustering + PCA

We used Python to implement K-Means clustering on the TF-IDF vectorized data. The process started by combining each article’s title and description into one unified text field to capture maximum context. This was then transformed into a TF-IDF matrix using TfidfVectorizer, converting each article into a high-dimensional vector of weighted terms.

We then applied the K-Means algorithm from sklearn.cluster, which attempts to group articles into k clusters by minimizing intra-cluster distances. We experimented with three values of k — 3, 4, and 5 — and used the Silhouette Score to evaluate the effectiveness of each clustering configuration. This score helped us determine how well-separated and cohesive each cluster was.

After identifying the best k, we used Principal Component Analysis (PCA) to reduce the data to two dimensions for easier visualization. This allowed us to generate a scatter plot where each article is a point, colored by its cluster assignment. This PCA projection makes the abstract clustering process more intuitive and visually interpretable.

Hierarchical Clustering with Cosine Similarity

In addition to K-Means, we used R to perform Hierarchical Clustering — a technique that creates a tree-like structure (dendrogram) by recursively merging similar observations. This method doesn't require us to pre-specify the number of clusters and allows us to visually analyze the closeness and relationship between articles. The workflow began with text preprocessing using the tm package: we removed stopwords, punctuation, numbers, and applied stemming. A Term-Document Matrix (TDM) was then created to quantify the presence of terms across articles. We computed Cosine Similarity as our distance metric using the proxy package, which is particularly effective for sparse text data. Cosine similarity captures the angular difference between text vectors, reflecting how similar two documents are in terms of word use rather than magnitude. Using hclust with Ward’s method (ward.D2), we generated a dendrogram that grouped articles based on content similarity. We cut the dendrogram at k = 5 to match our KMeans clustering result and allow for comparison between the two techniques.

Results

Silhouette Scores and K-Means Cluster Evaluation

We tested k = 3, 4, and 5 using the KMeans algorithm. To evaluate how well the clusters formed, we calculated the Silhouette Score for each value of k. This score ranges from -1 to 1, where higher values indicate better cohesion within clusters and better separation between them.

 

Our results were as follows:

Screenshot 2025-03-27 at 10.59.36 PM.png

Despite the relatively low scores — typical for high-dimensional TF-IDF data — k = 5 consistently outperformed the other configurations. This informed our decision to proceed with 5 clusters in both the KMeans and Hierarchical methods.

image.png

K-Means Clustering Visualization via PCA

After choosing k = 5, we projected the TF-IDF matrix into 2D using PCA and visualized the resulting clusters. Each dot on the scatterplot represents a news article, and each color represents a different cluster. This visualization revealed that some clusters — like Cluster 2 — are tightly packed and clearly separated from others, suggesting consistent language patterns or strong thematic alignment. Others showed mild overlap, reflecting subtler differences or shared terms across topics.

image.png

Hierarchical Clustering Dendrogram

The dendrogram produced via Hierarchical Clustering provides a top-down view of the relationships between articles. Each branch point represents a merge between two clusters, with the height of the branch indicating how dissimilar the merged items are. We cut the dendrogram at k = 5, which visually aligned well with natural splits in the tree structure. These 5 clusters closely mirrored those found by KMeans, adding robustness to our findings.

image.png

Hierarchical Clustering Dendrogram

After obtaining the cluster assignments, we sampled articles from each group and reviewed keywords and text to assign meaningful labels to each cluster. These thematic labels help communicate the high-level patterns that emerged from the unsupervised analysis.

Screenshot 2025-03-27 at 11.06.13 PM.png

Conclusion

Clustering was instrumental in revealing how the charger removal debate is covered across media outlets. Through both KMeans (Python) and Hierarchical Clustering (R), we discovered five clear article groupings that reflect not just technical distinctions, but real-world sentiment and business narratives.Both clustering methods aligned in their segmentation, validating the results. More importantly, this segmentation helped translate abstract textual patterns into concrete, interpretable themes such as sustainability, consumer frustration, brand strategy, and innovation.These insights not only inform our understanding of the charger removal policy, but also set the stage for deeper discovery. The next phase of this analysis will use Topic Modeling (LDA) to extract the underlying topics from each cluster and Association Rule Mining to explore co-occurring ideas and sentiments in the discourse.

bottom of page