Vineel Rayapati
Data Scientist
Uncovering Patterns in Tech News with Association Rule MininG
Overview
Association Rule Mining (ARM) is a powerful unsupervised learning technique used to discover interesting relationships or patterns (associations) between items in a dataset. Traditionally used in market basket analysis (e.g., discovering that people who buy "bread" often buy "butter"), ARM is increasingly applied in text mining — where words act as items and documents as transactions.
In this project, we apply ARM to a dataset of news articles related to smartphone chargers. Each article is treated as a transaction, and the preprocessed words from that article represent the items. This approach allows us to uncover how words are contextually associated across multiple articles — helping to surface common product features, consumer concerns, brand mentions, or innovations in charger technology.
Key Concepts and Metrics in ARM
Support Definition: How frequently an itemset appears in the dataset. Formula:

Confidence Definition: How often items in B appear in transactions that contain A. Formula:

Lift Definition: How much more likely B is to appear with A than would be expected if A and B were independent. Formula:

Data Preparation
Apriori AlgorithmThis algorithm efficiently finds frequent itemsets using a bottom-up approach, pruning infrequent item combinations based on minimum support thresholds. It helps reduce computational complexity by eliminating non-promising candidates early in the process.
​
​
Why Transaction Data?
ARM requires transactional, unlabeled data. This means each observation should be a list of items (in our case, words), without any associated class or outcome labels. The goal is to identify how items co-occur across many observations.
Dataset Overview
We used a news dataset containing charger-related articles. Each article includes text under fields like Title, Description, and Content.
Preprocessing Steps
-
Lowercasing: Convert all text to lowercase.
-
Tokenization: Split content into words.
-
Cleaning: Remove stop words, punctuation, numbers, and non-alphabetic tokens.
-
Deduplication: Remove repeated words in the same document.
-
Transaction Creation: Treat each cleaned article as a transaction of words.
Before Transformation

After Transformation

Code Summary
The transformation and mining process was implemented in Python using nltk (for NLP), mlxtend (for Apriori), and networkx (for visualization).
Processing Pipeline
-
Clean and tokenize articles → remove stopwords → generate list of transactions
-
Encode data using TransactionEncoder
-
Apply Apriori algorithm (min_support = 0.02)
-
Generate rules and compute support, confidence, and lift
-
Visualize with network graphs
We extracted association rules based on different thresholds and metrics. The top 15 rules for each of the key metrics for support, confidence, and lift are shown below:
Top 15 Rules by Support
These are the most frequently co-occurring item pairs in the dataset. The most common associations involved the words "samsung", "galaxi", and "charger".

These rules indicate common product mentions, such as “Samsung Galaxy” and various charger-related terms.
Top 15 Rules by Confidence
These rules are almost always true when the antecedent occurs. Confidence values of 1.0 show perfect reliability within the dataset.

ome rules with 100% confidence suggest strong co-occurrence of words describing feature upgrades (e.g., "vent", "wireless", "upgrad") and device specifications.
Top 15 Rules by Lift
Lift reveals the strongest non-random associations, some of which are more corporate or location-based.

These high-lift associations often relate to specific press releases or corporate statements that consistently use the same phrases, such as "divis", "corpor", and "globe".
Network Visualization

This network graph visualizes the top association rules by lift, highlighting strong non-random word pairings in charger-related news. Words like corpor, divis, and north frequently co-occur, suggesting structured phrasing in corporate or press release content. High lift values indicate tight semantic coupling between these terms across multiple articles.
This graph shows the top association rules by support, emphasizing the most frequently co-occurring word pairs across articles. Terms like samsung, galaxi, ultra, and new form a tight cluster, reflecting common product descriptions. The frequent link between charg and charger highlights consistent usage of core terminology in charger-related news.


This graph illustrates the top association rules by confidence, highlighting rules that are nearly always true when the antecedent occurs. Strong directional links toward samsung from terms like screen, review, and smartphon suggest consistent brand references. Another cluster shows vent, wireless, and upgrad, pointing to high-confidence associations around charger features and innovation.
Conclusion
Through the use of Association Rule Mining on charger-related news articles, we were able to extract valuable insights:
-
Brand Clustering: “Samsung” and “Galaxy” consistently co-occur, showing brand-product tightness in reporting.
-
Feature Innovation: “Wireless”, “vent”, “upgrade” — these terms appear together in high-confidence rules, pointing to innovation narratives in charger products.
-
Corporate Language: High lift values linked words like “corpor”, “divis”, and “globe”, revealing structured vocabulary in press release content.
These insights can aid market trend analysis, content summarization, or even product development by highlighting recurring language around technologies.